IPL Prediction Using Machine Learning


IPL Winning Team Prediction

The Indian Premier League (IPL) is Postponed For Coronavirus. This 13th edition of the IPL will run for almost 57 Days in India. The tournament will be contested by 8 teams who will be playing in a “home-and-away round-robin system“, with the top four at the end of the group phase progressing to the semi-finals.


The main objective of this project is to predict ipl Semi final and Final based on future team record.

1.Data Collection

I scraped data from Wikipedia And IPLT20 Website Comprising of record of teams as of ill 2020, details of the fixtures of 2020 ipl and details of each team’s history in previous ipl. I stored the above piece of ipl data in three separate csv files. For the fourth file, I download  ipl data-set for matches played between 2008 and 2019 from Kaggle in another csv file. Then I did manual data cleaning of the csv file  as per my needs to make a machine learning model out of it.

2.Data Cleaning And Formatting

Load Two CSV File. results.csv contain IPL match dates, team name, wining team name, ground city name, wining margin. IPL 2020 Dataset.csv in appearances, won title, play semifinal, and play final. and current rank I give based on wining IPL trophy.

IPL = pd.read_csv('datasets/IPL 2020 Dataset.csv') results = pd.read_csv('datasets/results.csv')
IPL Data Head
Result data head
df = results[(results['Team_1'] == 'Chennai Super Kings') | (results['Team_2'] == 'Chennai Super Kings')]
india = df.iloc[:]
ipl Team result win ground

3. Exploratory data analysis [EDA]

After that, I merge the details of the teams participating this year with their past results.

IPL_Teams = ['Mumbai Indians', 'Chennai Super Kings', 'Delhi Capitals', 'Kings XI Punjab', 
            'Royal Challengers Bangalore', 'Kolkata Knight Riders', 'Sun Risers Hyderabad', 'Rajasthan Royals']
df_teams_1 = results[results['Team_1'].isin(IPL_Teams)]
df_teams_2 = results[results['Team_2'].isin(IPL_Teams)]
df_teams = pd.concat((df_teams_1, df_teams_2))
ipl team 2020

I remove the columns like date, margin, ground. Because this features not important for prediction.

#dropping columns that wll not affect match outcomes
df_teams_2010 = df_teams.drop(['date','Margin', 'Ground'], axis=1)
ipl 2020 team prediction

4. Feature engineering and selection

I create two label. label 1, team_1 won the match else label 2, if team-2 won.

df_teams_2010 = df_teams_2010.reset_index(drop=True)
df_teams_2010.loc[df_teams_2010.winner == df_teams_2010.Team_1,'winning_team']=1
df_teams_2010.loc[df_teams_2010.winner == df_teams_2010.Team_2, 'winning_team']=2
df_teams_2010 = df_teams_2010.drop(['winning_team'], axis=1)

ipl 2020 team prediction

Create Dummy Variables for convert categorical to continuous

# Get dummy variables
final = pd.get_dummies(df_teams_2010, prefix=['Team_1', 'Team_2'], columns=['Team_1', 'Team_2'])

# Separate X and y sets
X = final.drop(['winner'], axis=1)
y = final["winner"]

# Separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

ipl dummy variable

In Logistic Regression, Random Forests and K Nearest Neighbours for training the model. I chooice Random Forest.

 5.model building

rf = RandomForestClassifier(n_estimators=100, max_depth=20,
rf.fit(X_train, y_train) 

score = rf.score(X_train, y_train)
score2 = rf.score(X_test, y_test)

print("Training set accuracy: ", '%.3f'%(score))
print("Test set accuracy: ", '%.3f'%(score2))
IPL Prediction Model Accurccy

Evaluate model for testing set

fixtures = pd.read_csv('datasets/fixtures.csv')
ranking = pd.read_csv('datasets/ipl_rankings.csv') 

# List for storing the group stage games

pred_set = []

Next, I added new columns with ranking position for each team and slicing the dataset for first 56 games.

fixtures.insert(1, 'first_position', fixtures['Team_1'].map(ranking.set_index('Team')['Position']))
fixtures.insert(2, 'second_position', fixtures['Team_2'].map(ranking.set_index('Team')['Position']))

# We only need the group stage games, so we have to slice the dataset

fixtures = fixtures.iloc[:56, :]
IPL Latest Model Data

add teams for new prediction dataset based on rank position of each team.

for index, row in fixtures.iterrows():
    if row['first_position'] < row['second_position']:
        pred_set.append({'Team_1': row['Team_1'], 'Team_2': row['Team_2'], 'winning_team': None})
        pred_set.append({'Team_1': row['Team_2'], 'Team_2': row['Team_1'], 'winning_team': None})
pred_set = pd.DataFrame(pred_set)
backup_pred_set = pred_set
IPL Ranking Postion

After that, Get Dummy Variables And Add Missing Columns Compare To training model dataset.

pred_set = pd.get_dummies(pred_set, prefix=['Team_1', 'Team_2'], columns=['Team_1', 'Team_2'])

missing_cols = set(final.columns) - set(pred_set.columns)
for c in missing_cols:
    pred_set[c] = 0
pred_set = pred_set[final.columns]

pred_set = pred_set.drop(['winner'], axis=1)
IPL compare model data

6. Model Results

predictions = rf.predict(pred_set)
for i in range(fixtures.shape[0]):
    print(backup_pred_set.iloc[i, 1] + " and " + backup_pred_set.iloc[i, 0])
    if predictions[i] == 1:
        print("Winner: " + backup_pred_set.iloc[i, 1])
        print("Winner: " + backup_pred_set.iloc[i, 0])

For results You Visit jupyter notebook Link

For Semifinal I choice Four teams Kolkata Knight Riders, Chennai Super Kings, Mumbai Indians, Rajasthan Royals.

semi = [('Kolkata Knight Riders', 'Chennai Super Kings'),
            ('Mumbai Indians', 'Rajasthan Royals')]
def clean_and_predict(matches, ranking, final, logreg):
    positions = []
    for match in matches:
        positions.append(ranking.loc[ranking['Team'] == match[0],'Position'].iloc[0])
        positions.append(ranking.loc[ranking['Team'] == match[1],'Position'].iloc[0])
    pred_set = []
    i = 0
    j = 0

    while i < len(positions):
        dict1 = {}

        if positions[i] < positions[i + 1]:
            dict1.update({'Team_1': matches[j][0], 'Team_2': matches[j][1]})
            dict1.update({'Team_1': matches[j][1], 'Team_2': matches[j][0]})

        i += 2
        j += 1
    pred_set = pd.DataFrame(pred_set)
    backup_pred_set = pred_set

    pred_set = pd.get_dummies(pred_set, prefix=['Team_1', 'Team_2'], columns=['Team_1', 'Team_2'])

    missing_cols2 = set(final.columns) - set(pred_set.columns)
    for c in missing_cols2:
        pred_set[c] = 0
    pred_set = pred_set[final.columns]

    pred_set = pred_set.drop(['winner'], axis=1)

    predictions = logreg.predict(pred_set)
    for i in range(len(pred_set)):
        print(backup_pred_set.iloc[i, 1] + " and " + backup_pred_set.iloc[i, 0])
        if predictions[i] == 1:
            print("Winner: " + backup_pred_set.iloc[i, 1])
            print("Winner: " + backup_pred_set.iloc[i, 0])

then I run semifinal function

clean_and_predict(semi, ranking, final, rf)
ipl semifinal result

Finally I run final function for Chennai Super Kings and Mumbai Indians.

finals = [('Chennai Super Kings', 'Mumbai Indians')]
clean_and_predict(finals, ranking, final, rf)
IPL Final Match Prediction

if this IPL 2020 final between CSK Vs MI. This Model Predict Go To MI Side.

Full Project Code Available Click Hear


Leave a Comment

Your email address will not be published. Required fields are marked *

This div height required for enabling the sticky sidebar