0% found this document useful (0 votes)
11 views4 pages

Cricket Player Performance Prediction

The document outlines a Python script that utilizes Selenium and Pandas to scrape cricket player statistics from a website, processes the data into a DataFrame, and performs linear regression and ridge regression to predict future performance metrics for players. It includes steps for data extraction, cleaning, and merging batting and bowling statistics, followed by model training and evaluation. Finally, it generates predictions for various performance metrics based on historical data.

Uploaded by

995aarvee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views4 pages

Cricket Player Performance Prediction

The document outlines a Python script that utilizes Selenium and Pandas to scrape cricket player statistics from a website, processes the data into a DataFrame, and performs linear regression and ridge regression to predict future performance metrics for players. It includes steps for data extraction, cleaning, and merging batting and bowling statistics, followed by model training and evaluation. Finally, it generates predictions for various performance metrics based on historical data.

Uploaded by

995aarvee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd

import pandas as pd

import time
from selenium import webdriver
from [Link] import Keys
from [Link] import By

# Initialize webdriver
driver = [Link]()

final_data = [Link]()

# Sample URL used in the example (update with the actual player URLs)
[Link]("[Link]
player=KA+Pollard&role=all&format=T20I&groupby=match&start_date=2021-10-
17&end_date=2022-10-17")

# List of players to iterate over (update this list with actual player names or
IDs)
players = ["player1", "player2", "player3"] # Example players

for i in players:
driver.find_element_by_xpath('//*[@id="player"]').clear()
driver.find_element_by_xpath('//*[@id="player"]').send_keys(i)
try:
driver.find_element_by_xpath('/html/body/div[1]/div[1]/div[2]/div/form/
input[3]').click()
except:
driver.find_element_by_xpath('/html/body/div[1]/div[1]/div[2]/div/form/
input[3]').click()
[Link](3)
try:
# Batting data
bat =
driver.find_element_by_xpath('//*[@id="T20I-Batting"]/div/table').text
stats = [Link]([Link]('\n')[0].[Link](',', expand=True)[0:-1])
[Link] = [Link][0]
stats = stats[1:]
del stats['%']
stats = stats[['Match', 'Runs', 'Balls', 'Out', '4s', '6s', 'Dot']]
[Link] = ['Match', 'Runs Scored', 'Balls Played', 'Out', 'Bat SR',
'50', '100', '4s Scored', '6s Scored', 'Bat Dot%']
[Link](5)
except:
continue

try:
# Bowling data
bowl =
driver.find_element_by_xpath('//*[@id="T20I-Bowling"]/div/table').text
stats2 = [Link]([Link]('\n')[0].[Link](',', expand=True)[0:-
1])
[Link] = [Link][0]
stats2 = stats2[1:]
stats2 = stats2[['Match', 'Overs', 'Runs', 'Wickets', 'Econ', 'SR', '5W',
'4s', '6s', 'Dot%']]
[Link] = ['Match', 'Overs Bowled', 'Runs Given', 'Wickets Taken',
'Econ', 'Bowl Avg', 'Bowl SR', '5W', '4s Given', '6s Given']
except:
stats2 = [Link]({'Match': [], 'Overs Bowled': [], 'Runs Given': [],
'Wickets Taken': [], 'Econ': [], 'Bowl Avg': [], 'Bowl SR': [], '5W': [], '4s
Given': [], '6s Given': []})

overall = [Link](stats, stats2, on='Match')


overall['overall'] = overall['Runs Scored'] + overall['Wickets Taken'] #
Example calculation
overall = overall.sort_values(by='Match')
[Link](0, 'Player', i)
overall = [Link](0)
final_data = final_data.append(overall)

final_data

from sklearn.model_selection import train_test_split


from sklearn import linear_model

# Assuming 'model1_df' is the DataFrame containing the data


# Ensure 'model1_df' is defined before running this code

# Linear Regression
# Fitting the model and checking accuracy

X = model1_df[model1_df.columns[1:-1]]
y = model1_df[model1_df.columns[-1]]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=9999)

points_model = linear_model.LinearRegression().fit(X_train, y_train)

print('Training set accuracy:', points_model.score(X_train, y_train))


print('Test set accuracy:', points_model.score(X_test, y_test))

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import linear_model

# Using ridge regression to predict the next match's performance based on the same
player's performance in past
models = [Link]()

for i in players_list:
player = final_data[final_data['Player'] == i]
player_new = [Link]()

X = player_new[player_new.columns[2:11]]
y = player_new[player_new.columns[22:23]]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

ridge = [Link]()
for j in range(0, 101):
points = linear_model.Ridge(alpha=j).fit(X_train, y_train)
ridge_df = [Link]({'Alpha': [Link](j), 'Train':
[Link]([Link](X_train, y_train)), 'Test': [Link]([Link](X_test,
y_test))})
ridge = [Link](ridge_df)
ridge['Average'] = ridge[['Train', 'Test']].mean(axis=1)
try:
k = ridge[ridge['Average'] == ridge['Average'].max()]['Alpha'][0]
except:
k = ridge[ridge['Average'] == ridge['Average'].max()]['Alpha'][0]
next_runs = linear_model.Ridge(alpha=k)
next_runs.fit(X_train, y_train)

X = player_new[player_new.columns[11:21]]
y = player_new[player_new.columns[22:23]]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

ridge = [Link]()
for j in range(0, 101):
points = linear_model.Ridge(alpha=j).fit(X_train, y_train)
ridge_df = [Link]({'Alpha': [Link](j), 'Train':
[Link]([Link](X_train, y_train)), 'Test': [Link]([Link](X_test,
y_test))})
ridge = [Link](ridge_df)
ridge['Average'] = ridge[['Train', 'Test']].mean(axis=1)
try:
k = ridge[ridge['Average'] == ridge['Average'].max()]['Alpha'][0]
except:
k = ridge[ridge['Average'] == ridge['Average'].max()]['Alpha'][0]
next_balls = linear_model.Ridge(alpha=k)
next_balls.fit(X_train, y_train)

X = player_new[player_new.columns[11:21]]
y = player_new[player_new.columns[25:26]]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

ridge = [Link]()
for j in range(0, 101):
points = linear_model.Ridge(alpha=j).fit(X_train, y_train)
ridge_df = [Link]({'Alpha': [Link](j), 'Train':
[Link]([Link](X_train, y_train)), 'Test': [Link]([Link](X_test,
y_test))})
ridge = [Link](ridge_df)
ridge['Average'] = ridge[['Train', 'Test']].mean(axis=1)
try:
k = ridge[ridge['Average'] == ridge['Average'].max()]['Alpha'][0]
except:
k = ridge[ridge['Average'] == ridge['Average'].max()]['Alpha'][0]
next_wkts = linear_model.Ridge(alpha=k)
next_wkts.fit(X_train, y_train)

X = player_new[player_new.columns[11:21]]
y = player_new[player_new.columns[24:25]]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

ridge = [Link]()
for j in range(0, 101):
points = linear_model.Ridge(alpha=j).fit(X_train, y_train)
ridge_df = [Link]({'Alpha': [Link](j), 'Train':
[Link]([Link](X_train, y_train)), 'Test': [Link]([Link](X_test,
y_test))})
ridge = [Link](ridge_df)
ridge['Average'] = ridge[['Train', 'Test']].mean(axis=1)
try:
k = ridge[ridge['Average'] == ridge['Average'].max()]['Alpha'][0]
except:
k = ridge[ridge['Average'] == ridge['Average'].max()]['Alpha'][0]
next_overs = linear_model.Ridge(alpha=k)
next_overs.fit(X_train, y_train)

X = player_new[player_new.columns[11:21]]
y = player_new[player_new.columns[24:25]]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

latest = [Link]('Player').tail(1)
next_runs_given = next_runs.predict(latest[[Link][11:21]])
next_balls_faced = next_balls.predict(latest[[Link][11:21]])
next_wkts_taken = next_wkts.predict(latest[[Link][11:21]])
next_overs_faced = next_overs.predict(latest[[Link][11:21]])

[Link][i, 'next_runs_given'] = round(next_runs_given[0], 0)


[Link][i, 'next_balls_faced'] = round(next_balls_faced[0], 0)
[Link][i, 'next_wkts_taken'] = round(next_wkts_taken[0], 0)
[Link][i, 'next_overs_faced'] = round(next_overs_faced[0], 0)
[Link][i, 'next_runs_given'] = round(next_runs_given[0], 0)
[Link][i, 'next_balls_faced'] = round(next_balls_faced[0], 0)
[Link][i, 'next_wkts_taken'] = round(next_wkts_taken[0], 0)
[Link][i, 'next_overs_faced'] = round(next_overs_faced[0], 0)

# Display the models DataFrame with predictions


print(models)

Common questions

Powered by AI

Performance predictions for the next match are made by employing Ridge regression models trained on previous performance data. The analysis uses metrics such as runs given, balls faced, wickets taken, and overs faced to make individual predictions for each aspect of a player's performance. The process involves using the best 'alpha' obtained from hyperparameter tuning and applying the model to the latest data point of each player, which represents their most recent performance, to forecast their next match output .

The rationale for using multiple iterations for different 'alpha' values is to thoroughly explore the impact of regularization on model performance and find the optimal 'alpha'. By iterating from 0 to 100, the process allows the model to assess the incremental effects of 'alpha' on the trade-off between bias and variance. This ensures that the model not only fits the training data well but also generalizes effectively to unseen test data, improving robustness against overfitting and underfitting .

Using different columns for different models within the Ridge regression process is beneficial because it allows the model to focus on the most relevant features for each specific prediction target (e.g., predicting runs, balls, wickets, or overs). This aligns the feature set with the outcome variable, enhancing the model's ability to capture the underlying relationships and improve prediction accuracy. Tailoring the feature selection to each target variable optimizes the model training process, ensures efficiency by avoiding irrelevant data, and mitigates the risk of overfitting .

The hyperparameter 'alpha' in Ridge regression controls the strength of the regularization applied to the model’s coefficients. A larger 'alpha' penalizes large coefficients more severely, thus reducing model complexity and addressing overfitting, whereas a smaller 'alpha' allows more complex models. In this analysis, 'alpha' is optimized by iterating over a range of values (0 to 100) and selecting the value that yields the highest average score between training and testing datasets. This process ensures the model has a balanced performance that generalizes well to new data .

Linear regression models assume a linear relationship between the independent and dependent variables, which may not always be the case in cricket performance due to its multifaceted nature. The model's accuracy can be affected by outliers or non-linear patterns, and it may not capture complex interactions between variables. Moreover, cricket performances can be influenced by external factors such as playing conditions and opposition quality, which are not accounted for in simple linear models. Ridge regression helps to mitigate some of these issues by reducing overfitting, yet it cannot fully address all potential non-linearities and external factors .

Challenges in using Selenium include potential website element changes disrupting data extraction scripts, issues with dynamic content loading that can lead to incomplete data captures, and elements being altered or removed without notice. These challenges can be addressed by implementing exception handling to retry actions, using explicit waits to ensure elements have loaded, and regularly updating the scripts to align with website changes. Additionally, Selenium alternatives or supplementing with APIs, if available, could provide more stability and ease of access to data .

Selenium enhances the data collection process by automating web interactions to scrape cricket player statistics from online sources. It simulates user actions such as navigating pages, entering player names, and fetching updated datasets without manual intervention. Selenium's ability to interact programmatically with web elements enables efficient and repeatable extraction of large datasets that would otherwise be time-consuming and prone to human error, ensuring timely and accurate inputs for subsequent Ridge regression analysis .

The system merges batting and bowling statistics based on the 'Match' column for each player. The significance of the 'overall' column, which is derived from the sum of 'Runs Scored' and 'Wickets Taken', is to provide a comprehensive metric that reflects a player's all-around contribution in a game. By sorting and adding this column, the analysis can evaluate a player’s overall performance in a more holistic manner, essential for generating performance predictions .

The implementation ensures model accuracy by utilizing cross-validation through a train-test split, which divides the data into separate training and testing datasets. This approach is crucial because it allows the model to be trained on one subset of the data and tested on another, providing an unbiased evaluation of its predictive performance. The repeated iterations and averaging in Ridge regression further enhance the model's stability and accuracy by fine-tuning the hyperparameter 'alpha', thus reducing overfitting and enhancing generalization to new data .

The primary objective of using Ridge regression in this cricket performance analysis is to predict the performance of players in their next match based on their past performance. This includes predicting various aspects such as runs given, balls faced, wickets taken, and overs faced. Ridge regression helps in dealing with multicollinearity by introducing a penalty for large coefficients, which stabilizes the model and improves prediction accuracy .

You might also like