0% found this document useful (0 votes)
11 views11 pages

MAchine Learning Lab 6 Programs

The document provides Python programs for various machine learning algorithms, including Central Tendency Measures (mean, median, mode), measures of dispersion (variance, standard deviation), K-Nearest Neighbors (KNN) for classification and regression, Decision Trees for classification and regression, and Naïve Bayes classification. Each section includes code examples and outputs demonstrating the performance of the algorithms on datasets, primarily using the Iris dataset. The results indicate high accuracy and performance metrics for the algorithms applied.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views11 pages

MAchine Learning Lab 6 Programs

The document provides Python programs for various machine learning algorithms, including Central Tendency Measures (mean, median, mode), measures of dispersion (variance, standard deviation), K-Nearest Neighbors (KNN) for classification and regression, Decision Trees for classification and regression, and Naïve Bayes classification. Each section includes code examples and outputs demonstrating the performance of the algorithms on datasets, primarily using the Iris dataset. The results indicate high accuracy and performance metrics for the algorithms applied.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Machine Learning

1. Write a python program to compute Central Tendency Measures:


Mean, Median, Mode

CODE:
from collections import Counter

def compute_mean(numbers):
return sum(numbers) / len(numbers) # Indented this line

def compute_median(numbers):
sorted_numbers = sorted(numbers)
n = len(sorted_numbers)
if n % 2 == 0:
mid = n // 2
return (sorted_numbers[mid - 1] + sorted_numbers[mid]) / 2
else:
return sorted_numbers[n // 2]

def compute_mode(numbers):
count = Counter(numbers)
max_count = max([Link]())
mode = [num for num, freq in [Link]() if freq == max_count]
return mode if mode else None

if __name__ == "__main__":
# Sample input, you can change this list to test with different data
data = [1, 2, 3, 4, 5, 6, 6, 7, 8, 8, 8]
mean = compute_mean(data)
median = compute_median(data)
mode = compute_mode(data)
print(f"Data: {data}")
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}”)

OUTPUT:

Data: [1, 2, 3, 4, 5, 6, 6, 7, 8, 8, 8]
Mean: 5.2727272727272725
Median: 6
Mode: [8]

Footer 1
2. Measure of Dispersion: Variance, Standard Deviation

PROGRAM CODE:

def compute_mean(numbers):
return sum(numbers) / len(numbers) # Added indentation here
def compute_variance(numbers):
mean = compute_mean(numbers)
squared_diff = [(x - mean) ** 2 for x in numbers]
variance = sum(squared_diff) / len(numbers)
return variance
def compute_standard_deviation(numbers):
variance = compute_variance(numbers)
standard_deviation = variance ** 0.5
return standard_deviation
if __name__ == "__main__":
# Taking user input for a list of numbers
input_data = input("Enter a list of numbers separated by spaces: ")
try:
# Convert the user input into a list of floats
data = [float(num) for num in input_data.split()]
# Calculate the measures of dispersion
variance = compute_variance(data)
standard_deviation = compute_standard_deviation(data)
print(f"Data: {data}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {standard_deviation}")
except ValueError:
print("Invalid input! Please enter a list of numbers separated by
spaces.”

OUTPUT :

Enter a list of numbers separated by spaces: 10 12 16


20 23 26 29 45
Data: [10.0, 12.0, 16.0, 20.0, 23.0, 26.0, 29.0,
45.0]
Variance: 109.484375
Standard Deviation: 10.463478150213723

Footer 2
[Link] is an example of applying the K-Nearest Neighbors (KNN)
algorithm
for both classification and regression using Python. We'll use the popular
scikit-learn library and some sample datasets to illustrate the concepts.

PROGRAM CODE:

# Import necessary libraries


import numpy as np
from [Link] import load_iris, make_regression
from sklearn.model_selection import train_test_split
from [Link] import KNeighborsClassifier, KNeighborsRegressor
from [Link] import accuracy_score, mean_squared_error

# ---------------- KNN for Classification ---------------- #


# Load the Iris dataset for classification
iris = load_iris()
X_classification = [Link]
y_classification = [Link]

# Split the dataset into training and testing sets


X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
X_classification, y_classification, test_size=0.3, random_state=42
)

# Initialize the KNN classifier with k=3


knn_classifier = KNeighborsClassifier(n_neighbors=3)

# Train the model


knn_classifier.fit(X_train_c, y_train_c)

# Predict on the test set


y_pred_c = knn_classifier.predict(X_test_c)

# Calculate accuracy
accuracy = accuracy_score(y_test_c, y_pred_c)
print("Classification Results:")
print(f"Accuracy: {accuracy * 100:.2f}%")

# ---------------- KNN for Regression ---------------- #


# Create a synthetic dataset for regression
X_regression, y_regression = make_regression(n_samples=200, n_features=1, noise=10,
random_state=42)

# Split the dataset into training and testing sets


X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(

Footer 3
X_regression, y_regression, test_size=0.3, random_state=42
)

# Initialize the KNN regressor with k=3


knn_regressor = KNeighborsRegressor(n_neighbors=3)

# Train the model


knn_regressor.fit(X_train_r, y_train_r)

# Predict on the test set


y_pred_r = knn_regressor.predict(X_test_r)

# Calculate mean squared error


mse = mean_squared_error(y_test_r, y_pred_r)
print("\nRegression Results:")
print(f"Mean Squared Error: {mse:.2f}")

OUTPUT:

Classification Results:
Accuracy: 100.00%

Regression Results:
Mean Squared Error: 269.01

Footer 4
[Link]’s a Python program to demonstrate the Decision Tree Algorithm for a
classification problem using the Iris dataset. The program also includes parameter
tuning using Grid Search for better results.

import numpy as np
from [Link] import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from [Link] import DecisionTreeClassifier,plot_tree
from [Link] import accuracy_score,classification_report
import [Link] as plt

iris = load_iris()
X = [Link]
y = [Link]

X_train,X_test,y_train,y_test=train_test_split(
X,y,test_size=0.3,random_state=42
)
dt_classifier = DecisionTreeClassifier(random_state=42)

dt_classifier.fit(X_train,y_train)

y_pred=dt_classifier.predict(X_test)

accuracy = accuracy_score(y_test,y_pred)
print("Decision Tree Classification Results(Default Parameters):")
print(f"Accuracy:{accuracy * 100:.2f}%")
print("\nClassification Report:")
print(classification_report(y_test,y_pred))

[Link](figsize=(15,10))
plot_tree(dt_classifier,filled=
True,feature_names=iris.feature_names,class_names=iris.target_names)
[Link]("Decision Tree Visualization")
[Link]()

param_grid={
"criterion":["gini","entropy"],
"max_depth":[None,3,5,10],
"min_samples_split":[2,5,10],
"min_samples_leaf":[1,2,4],
}

grid_search = GridSearchCV(estimator=dt_classifier, param_grid=param_grid, cv=5,


scoring='accuracy')

Footer 5
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
best_model = grid_search.best_estimator_
y_pred_tuned = best_model.predict(X_test)

accuracy_tuned = accuracy_score(y_test,y_pred_tuned)
print("\nDecision Tree Classification Results(Tuned Parameters):")
print(f"Accuracy:{accuracy_tuned * 100:.2f}%")
print(f"Best Parameters: {best_params}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_tuned))
[Link](figsize=(15, 10))
plot_tree(best_model, filled=True, feature_names=iris.feature_names,
class_names=iris.target_names) # Used best_model here
[Link]("Tuned Decision Tree Visualization")
[Link]()

OUTPUT:
Decision Tree Classification Results(Default Parameters):
Accuracy:100.00%

Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 19


1 1.00 1.00 1.00 13
2 1.00 1.00 1.00 13

accuracy 1.00 45
macro avg 1.00 1.00 1.00 45
weighted avg 1.00 1.00 1.00 45

Footer 6
Decision Tree Classification Results(Tuned Parameters):
Accuracy:100.00%
Best Parameters: {'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 1,
'min_samples_split': 10}

Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 19


1 1.00 1.00 1.00 13
2 1.00 1.00 1.00 13

accuracy 1.00 45
macro avg 1.00 1.00 1.00 45
weighted avg 1.00 1.00 1.00 45

Footer 7
Footer 8
[Link]’s an example of using the Decision Tree algorithm for regression in Python. We'll
use a synthetic regression dataset and evaluate the model's performance based on
metrics such as Mean Squared Error (MSE) and R² score.

import numpy as np
import [Link] as plt
from [Link] import make_regression
from sklearn.model_selection import train_test_split
from [Link] import DecisionTreeRegressor, plot_tree
from [Link] import mean_squared_error, r2_score
X, y = make_regression(n_samples=200, n_features=1, noise=15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
dt_regressor = DecisionTreeRegressor(random_state=42)
dt_regressor.fit(X_train, y_train)
y_pred = dt_regressor.predict(X_test)\
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Decision Tree Regression Results:")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R² Score: {r2:.2f}")
[Link](figsize=(12, 8))
plot_tree(dt_regressor, filled=True, feature_names=["Feature"], rounded=True)
[Link]("Decision Tree Visualization")
[Link]()
[Link](figsize=(8, 6))
[Link](X_test, y_test, color="blue", label="Actual Values")
[Link](X_test, y_pred, color="red", label="Predicted Values")
[Link]("Decision Tree Regression: Predictions vs Actual Values")
[Link]("Feature")
[Link]("Target")
[Link]()
[Link]()

OUTPUT:

Decision Tree Regression Results:


Mean Squared Error (MSE): 527.88
R² Score: 0.94

Footer 9
Footer 10
6. Here’s a demonstration of the Naïve Bayes Classification algorithm using Python.
We'll
use the Gaussian Naïve Bayes model from sklearn and apply it to the Iris dataset to
classify different species of flowers.

from [Link] import load_iris


from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from [Link] import accuracy_score, classification_report
iris = load_iris()
X, y = [Link], [Link]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)
y_pred = nb_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=iris.target_names)
print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:\n", report)

OUTPUT:
Accuracy: 1.00
Classification Report:
precision recall f1-score
support

setosa 1.00 1.00 1.00 10


versicolor 1.00 1.00 1.00 9
virginica 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

Footer 11

Common questions

Powered by AI

Cross-validation, such as k-fold, enhances robustness by mitigating overfitting risks, ensuring model performance generalizability across different data subsets. It involves splitting the dataset into 'k' parts, iteratively training on 'k-1' and validating on the remainder. This provides diverse validation environments, leading to consistent performance evaluation. It ensures that models like classifiers and regressors perform well not just on a single subset but across diverse data partitions .

Variance and standard deviation quantify data spread around the mean. Variance indicates the average squared deviation, useful for understanding data volatility, while standard deviation, being the square root of variance, is expressed in the same units as data, offering intuitive interpretability. These measures help in assessing data variability, informing decision-making processes .

The 'random_state' parameter ensures experimental reproducibility by controlling randomness in dataset splitting or decision-making paths. By setting 'random_state', consistent partitions or tree structures are generated across runs, allowing for result comparison and debugging. It is crucial for reproducibility, ensuring that results presented are traceable to specific configurations and not random variability .

Accuracy, the fraction of correct predictions, is a straightforward metric in evaluating classification performance. For both Decision Trees and Naive Bayes, high accuracy, as observed, suggests effective classification. However, it can be misleading in imbalanced datasets, where accuracy may not reflect true predictive performance across all classes, necessitating additional metrics like precision, recall, and F1-score for comprehensive evaluation .

Plotting actual vs predicted values supports regression model evaluation by visually indicating prediction accuracy and bias. Closer alignment to the identity line suggests high accuracy, while deviations reveal systematic errors or variance. This method enhances understanding of outliers, model fit, and potential regions for improvement, complementing quantitative metrics like MSE and R² .

In KNN, the choice of 'k' significantly impacts model performance by affecting bias and variance. A smaller 'k' can lead to a model that fits closely to the training data, reducing bias but potentially increasing variance, causing sensitivity to noise. This could result in overfitting. Conversely, a larger 'k' introduces more bias by smoothing the decision boundary, potentially underfitting but improving generalization. For classification, this affects accuracy, while for regression, the mean squared error is influenced .

Model evaluation in decision tree regression utilizes MSE to quantify the average squared difference between predicted and actual values, indicating fit accuracy. A lower MSE signifies better prediction precision. Conversely, R² score assesses model explanatory power, denoting the proportion of variance in the dependent variable explained by the model. An R² of 0.94 suggests 94% variance explanation, reflecting strong model performance .

Grid Search systematically tests combinations of hyperparameters, optimizing model performance by fine-tuning attributes such as 'criterion', 'max_depth', and 'min_samples_split' in decision tree classifiers. This exhaustive search can uncover configurations yielding higher accuracy and more accurate decision boundaries, as shown with improved accuracy from 100% using default parameters to 100% with optimal parameters found by Grid Search .

A Decision Tree for regression predicts continuous numeric outcomes by partitioning the feature space into regions producing homogeneous target values, using metrics like Mean Squared Error for split decisions. Conversely, a classification tree predicts class labels, splitting data to minimize impurity indices like Gini or entropy, aiming for homogeneous class nodes. The distinction lies in their output type—continuous for regression and categorical for classification .

Measures of central tendency—mean, median, and mode—offer insights into data distributions. Mean indicates average values, reflecting overall data centrality. Median provides a robust central point against outliers, especially in skewed distributions. Mode identifies the most frequent value, revealing commonalities. Together, they describe data distribution, balancing insights between central value and dispersion .

You might also like