1. Write programs in Python using NumPy library to do the following: a.
Compute the mean, standard
deviation, and variance of a two dimensional random integer array along the second axis.
b. Create a 2-dimensional array of size m x n integer elements, also print the shape, type and data type of
the array and then reshape it into an n x m array, where n and m are user inputs given at the run time.
c. Test whether the elements of a given 1D array are zero, non-zero and NaN. Record the indices of these
elements in three separate arrays.
d. Create three random arrays of the same size: Array1, Array2 and Array3. Subtract Array 2 from Array3
and store in Array4. Create another array Array5 having two times the values in Array1. Find Co-variance and
Correlation of Array1 with Array4 and Array5 respectively.
e. Create two random arrays of the same size 10: Array1, and Array2. Find the sum of the first half of both
the arrays and product of the second half of both the arrays.
2. Do the following using PANDAS Series:
a. Create a series with 5 elements. Display the series sorted on index and also sorted on values
seperately
b. Create a series with N elements with some duplicate values. Find the minimum and maximum
ranks assigned to the values using ‘first’ and ‘max’ methods
c. Display the index value of the minimum and maximum element of a Series
import pandas as pd
# a. Create a series with 5 elements
s1 = [Link]([45, 12, 78, 23, 56], index=['e', 'a', 'd', 'b', 'c'])
print("Original Series:")
print(s1)
# Sorted by index
print("\nSeries Sorted by Index:")
print(s1.sort_index())
# Sorted by values
print("\nSeries Sorted by Values:")
print(s1.sort_values())
# b. Create a series with N elements containing duplicate values
s2 = [Link]([50, 20, 50, 40, 20, 60, 40])
print("\nSeries with Duplicate Values:")
print(s2)
# Ranking using 'first' method
print("\nRank using 'first' method:")
print([Link](method='first'))
# Ranking using 'max' method
print("\nRank using 'max' method:")
print([Link](method='max'))
# c. Display index of minimum and maximum element
s3 = [Link]([34, 67, 12, 89, 45], index=['A', 'B', 'C', 'D', 'E'])
print("\nSeries:")
print(s3)
print("\nIndex of Minimum Value:", [Link]())
print("Index of Maximum Value:", [Link]())
3. Create a data frame having at least 3 columns and 50 rows to store numeric data generated using
a random function. Replace 10% of the values by null values whose index positions are generated using
random function. Do the following: a. Identify and count missing values in a data frame.
b. Drop the column having more than 5 null values.
c. Identify the row label having maximum of the sum of all values in a row and drop that row.
d. Sort the data frame on the basis of the first column.
e. Remove all duplicates from the first column.
f. Find the correlation between first and second column and covariance between second and third
column.
g. Discretize the second column and create 5 bins.
import pandas as pd
import numpy as np
# Create DataFrame with 50 rows and 3 columns using random numbers
[Link](1)
df = [Link]([Link](1,100,size=(50,3)),
columns=['Col1','Col2','Col3'])
print("Original DataFrame:")
print(df)
# Replace 10% values with NaN at random positions
for i in range(int([Link] * 0.1)): # 10% of total values
r = [Link](0,50)
c = [Link](0,3)
[Link][r,c] = [Link]
print("\nDataFrame after inserting NaN values:")
print(df)
# a. Identify and count missing values
print("\nMissing values in each column:")
print([Link]().sum())
# b. Drop column having more than 5 null values
df = [Link][:, [Link]().sum() <= 5]
print("\nDataFrame after dropping columns with more than 5 null values:")
print(df)
# c. Find row with maximum row sum and drop it
row_label = [Link](axis=1).idxmax()
print("\nRow with maximum sum:", row_label)
df = [Link](row_label)
print("\nDataFrame after dropping that row:")
print(df)
# d. Sort the DataFrame by first column
df = df.sort_values(by=[Link][0])
print("\nSorted DataFrame by first column:")
print(df)
# e. Remove duplicates from first column
df = df.drop_duplicates(subset=[Link][0])
print("\nAfter removing duplicates from first column:")
print(df)
# f. Correlation between first and second column
if len([Link]) >= 2:
print("\nCorrelation between first and second column:")
print(df[[Link][0]].corr(df[[Link][1]]))
# Covariance between second and third column
if len([Link]) >= 3:
print("\nCovariance between second and third column:")
print(df[[Link][1]].cov(df[[Link][2]]))
# g. Discretize second column into 5 bins
if len([Link]) >= 2:
bins = [Link](df[[Link][1]], 5)
print("\nDiscretized second column into 5 bins:")
print(bins)
4. Consider two excel files having attendance of two workshos. Each file has three fields ‘Name’,
‘Date, duration (in minutes) where names are unique within a file. Note that duration may take one of three
values (30, 40, 50) only. Import the data into two data frames and do the following: a. Perform merging of
the two data frames to find the names of students who had attended both workshops.
b. Find names of all students who have attended a single workshop only.
c. Merge two data frames row-wise and find the total number of records in the data frame.
d. Merge two data frames row-wise and use two columns viz. names and dates as multi-row indexes.
Generate descriptive statistics for this hierarchical data frame.
import pandas as pd
# Import Excel files into DataFrames
df1 = pd.read_excel("[Link]")
df2 = pd.read_excel("[Link]")
print("Workshop 1 Data:")
print(df1)
print("\nWorkshop 2 Data:")
print(df2)
# a. Students who attended both workshops
both = [Link](df1, df2, on="Name")
print("\nStudents who attended both workshops:")
print(both["Name"])
# b. Students who attended only one workshop
only_df1 = df1[~df1["Name"].isin(df2["Name"])]
only_df2 = df2[~df2["Name"].isin(df1["Name"])]
single = [Link]([only_df1, only_df2])
print("\nStudents who attended only one workshop:")
print(single["Name"])
# c. Merge row-wise and count total records
combined = [Link]([df1, df2], axis=0)
print("\nTotal number of records after row-wise merge:")
print(len(combined))
# d. Row-wise merge and create multi-row index (Name, Date)
hier_df = combined.set_index (["Name", "Date"])
print("\nHierarchical DataFrame:")
print(hier_df)
print("\nDescriptive Statistics:")
print(hier_df.describe())
5. Using Iris data, plot the following with proper legend and axis labels: (Download IRIS data from:
[Link] or import it from sklearn datasets)
a. Plot bar chart to show the frequency of each class label in the data.
b. Draw a scatter plot for Petal width vs sepal width and fit a regression line
c. Plot density distribution for feature petal length.
d. Use a pair plot to show pairwise bivariate distribution in the Iris Dataset.
e. Draw heatmap for the four numeric attributes
f. Compute mean, mode, median, standard deviation, confidence interval and standard error for
each feature
g. Compute correlation coefficients between each pair of features and plot heatmap
import pandas as pd
import seaborn as sns
import [Link] as plt
from [Link] import load_iris
from scipy import stats
import numpy as np
# Load Iris dataset
iris = load_iris(as_frame=True)
df = [Link]
df['species'] = df['target'].map(dict(enumerate(iris.target_names)))
print("First 5 rows of dataset:")
print([Link]())
# a. Bar chart for frequency of each class label
[Link]()
df['species'].value_counts().plot(kind='bar')
[Link]("Frequency of each Iris Class")
[Link]("Species")
[Link]("Frequency")
[Link](["Count"])
[Link]()
# b. Scatter plot: Petal width vs Sepal width with regression line
[Link]()
[Link](x='sepal width (cm)', y='petal width (cm)', data=df)
[Link]("Petal Width vs Sepal Width")
[Link]("Sepal Width")
[Link]("Petal Width")
[Link]()
# c. Density distribution for Petal Length
[Link]()
[Link](df['petal length (cm)'], fill=True)
[Link]("Density Distribution of Petal Length")
[Link]("Petal Length")
[Link]()
# d. Pair plot
[Link](df, hue="species")
[Link]()
# e. Heatmap for numeric attributes
[Link]()
numeric_df = [Link][:,0:4]
[Link](numeric_df.corr(), annot=True)
[Link]("Heatmap of Numeric Attributes")
[Link]()
# f. Statistical measures
print("\nStatistical Measures:")
for col in numeric_df.columns:
mean = numeric_df[col].mean()
median = numeric_df[col].median()
mode = numeric_df[col].mode()[0]
std = numeric_df[col].std()
sem = [Link](numeric_df[col])
ci = [Link](0.95, len(numeric_df[col])-1, loc=mean, scale=sem)
print(f"\nFeature: {col}")
print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)
print("Standard Deviation:", std)
print("Standard Error:", sem)
print("95% Confidence Interval:", ci)
# g. Correlation coefficients and heatmap
corr = numeric_df.corr()
print("\nCorrelation Matrix:")
print(corr)
[Link]()
[Link](corr, annot=True, cmap="coolwarm")
[Link]("Correlation Heatmap of Iris Features")
[Link]()
6. Consider the following data frame containing a family name, gender of the family member and
her/his monthly income in each record.
Write a program in Python using Pandas to perform the following:
a. Calculate and display familywise gross monthly income.
b. Calculate and display the member with the highest monthly income.
c. Calculate and display monthly income of all members with income greater than Rs. 60000.00.
d. Calculate and display the average monthly income of the female members
a. Find total number of passengers with age less than 30
b. Find total fare paid by passengers of first class
c. Compare number of survivors of each passenger class
d. Compute descriptive statistics for any numeric attribute gender wise
Name Gender MonthlyIncome
Shah Male 114000.00
Vats Male 65000.00
Vats Female 43150.00
Kumar Female 69500.00
Vats Female 155000.00
Kumar Male 103000.00
Shah Male 55000.00
Shah Female 112400.00
Kumar Female 81030.00
Vats Male 71900.00
import pandas as pd
# Creating the DataFrame
data = {
"Name": ["Shah", "Vats", "Vats", "Kumar", "Vats", "Kumar", "Shah", "Shah", "Kumar", "Vats"],
"Gender": ["Male", "Male", "Female", "Female", "Female", "Male", "Male", "Female", "Female", "Male"],
"MonthlyIncome": [114000.00, 65000.00, 43150.00, 69500.00, 155000.00, 103000.00, 55000.00, 112400.00,
81030.00, 71900.00]
df = [Link](data)
# a. Familywise gross monthly income
family_income = [Link]("Name")["MonthlyIncome"].sum()
print("Familywise Gross Monthly Income:")
print(family_income)
# b. Member with the highest monthly income
highest_income = [Link][df["MonthlyIncome"].idxmax()]
print("\nMember with Highest Monthly Income:")
print(highest_income)
# c. Members with income greater than Rs. 60000
income_above_60k = df[df["MonthlyIncome"] > 60000]
print("\nMembers with Monthly Income greater than Rs. 60000:")
print(income_above_60k)
# d. Average monthly income of female members
avg_female_income = df[df["Gender"] == "Female"]["MonthlyIncome"].mean()
print("\nAverage Monthly Income of Female Members:")
print(avg_female_income)
7. Load Titanic dataset from sklearn liabrary and do the following:
a. Find total number of passengers with age less than 30
b. Find total fare paid by passengers of first class
c. Compare number of survivors of each passenger class
d. Compute descriptive statistics for any numeric attribute genderwise
import pandas as pd
from [Link] import fetch_openml
# Load Titanic dataset
titanic = fetch_openml(name='titanic', version=1, as_frame=True)
df = [Link]
print("First 5 records of dataset:")
print([Link]())
# a. Total number of passengers with age < 30
age_less_30 = df[df['age'] < 30]
print("\nTotal passengers with age less than 30:", len(age_less_30))
# b. Total fare paid by passengers of first class
first_class_fare = df[df['pclass'] == 1]['fare'].sum()
print("\nTotal fare paid by passengers of first class:", first_class_fare)
# c. Compare number of survivors of each passenger class
survivors_by_class = df[df['survived'] == 1].groupby('pclass')['survived'].count()
print("\nNumber of survivors in each passenger class:")
print(survivors_by_class)
# d. Descriptive statistics for a numeric attribute (Fare) gender-wise
gender_stats = [Link]('sex')['fare'].describe()
print("\nDescriptive statistics of Fare gender-wise:")
print(gender_stats)