0% found this document useful (0 votes)
4 views17 pages

Numpy Pandas

The document provides an introduction to NumPy and Pandas, essential libraries for numerical computing and data manipulation in Python. It covers creating and manipulating NumPy arrays, basic array operations, and introduces Pandas' Series and DataFrame structures. The document includes code examples for installation, array creation, and various operations to facilitate data analysis.

Uploaded by

ibraheeeem0218
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views17 pages

Numpy Pandas

The document provides an introduction to NumPy and Pandas, essential libraries for numerical computing and data manipulation in Python. It covers creating and manipulating NumPy arrays, basic array operations, and introduces Pandas' Series and DataFrame structures. The document includes code examples for installation, array creation, and various operations to facilitate data analysis.

Uploaded by

ibraheeeem0218
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

keyboard_arrow_down Topics to be covered:

NumPy: An introduction to numerical computing with Python, focusing on arrays and mathematical operations.
Pandas: Learning how to work with structured data using DataFrames, including data loading, cleaning, and basic analysis.
Data Import: Understanding how to import various data formats into your Python environment.

Let's get started!

Reasoning: The next instruction is to provide commands to install the numpy and pandas libraries using pip install in a new code
cell.

!pip install numpy pandas

Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (2.0.2)


Requirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (2.2.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.3)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas) (1.

These libraries, NumPy and Pandas, are fundamental tools in Python for data science. They provide powerful data structures and
functions essential for numerical computation and data manipulation, and will be used extensively throughout this notebook.

keyboard_arrow_down NumPy: Creating and Manipulating Arrays

Subtask:
Provide a comprehensive section on creating NumPy arrays from various sources (lists, arange, zeros, ones, random) and basic array
manipulations. Include at least 5 distinct examples, fulfilling 'Program 1: Introduction to NumPy Arrays'.

Reasoning: The next step is to import the NumPy library and demonstrate the creation of various NumPy arrays as specified in the
instructions for 'Program 1: Introduction to NumPy Arrays'. This includes creating arrays from a Python list, using [Link]() ,
[Link]() , [Link]() , and [Link]() or [Link]() .

import numpy as np

# 1. Create a NumPy array from a Python list


list_data = [1, 2, 3, 4, 5]
array_from_list = [Link](list_data)
print(f"Array from list: {array_from_list}")
print(f"Type of array from list: {type(array_from_list)}\n")

Array from list: [1 2 3 4 5]


Type of array from list: <class '[Link]'>

# 2. Create a NumPy array using [Link]()


array_arange = [Link](10) # Numbers from 0 to 9
print(f"Array using arange: {array_arange}\n")

array_arange_custom = [Link](2,10) # Numbers from 2 to 9


print(f"Array using arange: {array_arange_custom}\n")

array_arange_custom_step = [Link](2,15,3) # Numbers from 2 to 14


print(f"Array using arange: {array_arange_custom_step}\n")

Array using arange: [0 1 2 3 4 5 6 7 8 9]

Array using arange: [2 3 4 5 6 7 8 9]

Array using arange: [ 2 5 8 11 14]

# 3. Create a NumPy array filled with zeros using [Link]()


# Shape: (3, 4), dtype: float
array_zeros = [Link]((3, 4), dtype=float)
print(f"Array of zeros:\n{array_zeros}\n")
Array of zeros:
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]

# 4. Create a NumPy array filled with ones using [Link]()


# Shape: (2, 3), dtype: int
array_ones = [Link]((2, 3), dtype=int)
print(f"Array of ones:\n{array_ones}\n")

Array of ones:
[[1 1 1]
[1 1 1]]

# 5. Create a NumPy array with random numbers using [Link]()


# A 2x2 array of random floats in [0, 1)
array_random_float = [Link](2, 2)
print(f"Array of random floats:\n{array_random_float}\n")

Array of random floats:


[[0.41323894 0.61809523]
[0.40657137 0.9904206 ]]

# Optional: Create a NumPy array with random integers using [Link]()


# A 3x3 array of random integers between 0 (inclusive) and 10 (exclusive)
array_random_int = [Link](0, 10, size=(3, 3))
print(f"Array of random integers:\n{array_random_int}")

Array of random integers:


[[0 0 4]
[4 8 3]
[4 8 0]]

keyboard_arrow_down Basic Array Operations: Joining, Splitting, Searching, and Sorting

NumPy arrays are highly versatile, allowing for a wide range of operations to manipulate and analyze data efficiently. This section will
cover some fundamental array operations, including combining arrays (joining), dividing arrays (splitting), finding specific elements
(searching), and ordering elements (sorting).

Reasoning: Following the introduction to basic array operations, the next step is to provide code examples demonstrating joining,
splitting, searching, and sorting NumPy arrays to fulfill the 'basic array manipulations' part of the subtask.

import numpy as np

# --- 1. Joining Arrays ---


print("\n--- Joining Arrays ---")
arr1 = [Link]([1, 2, 3])
arr2 = [Link]([4, 5, 6])

# Concatenate 1D arrays
joined_array_1d = [Link]((arr1, arr2))
print(f"Concatenated 1D array: {joined_array_1d}")

--- Joining Arrays ---


Concatenated 1D array: [1 2 3 4 5 6]

arr_a = [Link]([[1, 2], [3, 4]])


arr_b = [Link]([[5, 6], [7, 8]])

# Stacking arrays vertically


stacked_vertical = [Link]((arr_a, arr_b))
print(f"\nVertically stacked arrays:\n{stacked_vertical}")

# Stacking arrays horizontally


stacked_horizontal = [Link]((arr_a, arr_b))
print(f"\nHorizontally stacked arrays:\n{stacked_horizontal}")

Vertically stacked arrays:


[[1 2]
[3 4]
[5 6]
[7 8]]

Horizontally stacked arrays:


[[1 2 5 6]
[3 4 7 8]]

# --- 2. Splitting Arrays ---


print("\n--- Splitting Arrays ---")
long_array = [Link](1, 10)

# Split into 3 equal parts


split_array_equal = np.array_split(long_array, 3)
print(f"Split into 3 equal parts: {split_array_equal}")

# Split a 2D array vertically


matrix = [Link]([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
split_matrix_v = [Link](matrix, 3) # Split into 3 rows
print(f"\nVertically split matrix (into rows):\n{split_matrix_v}")

split_matrix_h = [Link](matrix, 2) # Split into 2


print(f"\Horizontally split matrix (into cols):\n{split_matrix_h}")

--- Splitting Arrays ---


Split into 3 equal parts: [array([1, 2, 3]), array([4, 5, 6]), array([7, 8, 9])]

Vertically split matrix (into rows):


[array([[1, 2, 3, 4]]), array([[5, 6, 7, 8]]), array([[ 9, 10, 11, 12]])]
\Horizontally split matrix (into cols):
[array([[ 1, 2],
[ 5, 6],
[ 9, 10]]), array([[ 3, 4],
[ 7, 8],
[11, 12]])]
<>:15: SyntaxWarning: invalid escape sequence '\H'
<>:15: SyntaxWarning: invalid escape sequence '\H'
/tmp/[Link]: SyntaxWarning: invalid escape sequence '\H'
print(f"\Horizontally split matrix (into cols):\n{split_matrix_h}")

# --- 3. Searching Arrays ---


print("\n--- Searching Arrays ---")
search_array = [Link]([10, 20, 30, 40, 50, 20])

# Find indices where value is 20


search_result = [Link](search_array == 20)
print(f"Indices where value is 20: {search_result}")

--- Searching Arrays ---


Indices where value is 20: (array([1, 5]),)

# --- 4. Sorting Arrays ---


print("\n--- Sorting Arrays ---")
unsorted_array = [Link]([3, 1, 4, 1, 5, 9, 2, 6])

# Sort the array


sorted_array = [Link](unsorted_array)
print(f"Sorted array: {sorted_array}")

# Sort a 2D array along an axis (e.g., column-wise)


unsorted_matrix = [Link]([[3, 2, 1], [6, 5, 4]])
sorted_matrix_axis1 = [Link](unsorted_matrix, axis=1)
print(f"\nSorted 2D array (axis=1):\n{sorted_matrix_axis1}")

--- Sorting Arrays ---


Sorted array: [1 1 2 3 4 5 6 9]

Sorted 2D array (axis=1):


[[1 2 3]
[4 5 6]]

keyboard_arrow_down Array Attributes, Indexing, Slicing, and Iterating

Beyond creation and basic manipulation, understanding NumPy array attributes, how to access specific elements (indexing), extract
subarrays (slicing), and process elements one by one (iterating) is crucial for effective data handling. These operations are fundamental
for data analysis and scientific computing.
Reasoning: Following the introduction to array attributes, indexing, slicing, and iterating, the next step is to provide code examples
demonstrating these concepts with various NumPy arrays.

import numpy as np

# Sample array for demonstration


arr = [Link]([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(f"Original Array:\n{arr}\n")

# --- 1. Array Attributes ---


print("--- Array Attributes ---")
print(f"Shape of array (dimensions): {[Link]}") # (rows, columns)
print(f"Number of dimensions (rank): {[Link]}")
print(f"Total number of elements: {[Link]}")
print(f"Data type of elements: {[Link]}\n")

Original Array:
[[1 2 3]
[4 5 6]
[7 8 9]]

--- Array Attributes ---


Shape of array (dimensions): (3, 3)
Number of dimensions (rank): 2
Total number of elements: 9
Data type of elements: int64

# --- 2. Indexing ---


print("--- Indexing ---")
# Access a single element (row, column)
print(f"Element at (0, 0): {arr[0, 0]}")
print(f"Element at (1, 2): {arr[1, 2]}\n")

--- Indexing ---


Element at (0, 0): 1
Element at (1, 2): 6

# --- 3. Slicing ---


print("--- Slicing ---")
# Get the first row
print(f"First row: {arr[0, :]}")
# Get the last column
print(f"Last column: {arr[:, -1]}")
# Get a sub-array (rows 0-1, columns 1-2)
print(f"Sub-array (rows 0-1, cols 1-2):\n{arr[0:2, 1:3]}\n")

--- Slicing ---


First row: [1 2 3]
Last column: [3 6 9]
Sub-array (rows 0-1, cols 1-2):
[[2 3]
[5 6]]

# --- 4. Iterating ---


print("--- Iterating ---")
print("Iterating through rows:")
for row in arr:
print(row)

print("\nIterating through flattened array (element by element):")


for element in [Link]:
print(element, end=' ')
print("\n")

# Conditional Indexing (Boolean Masking)


print("--- Conditional Indexing ---")
conditional_array = [Link](10, 20).reshape(2, 5)
print(f"Original conditional array:\n{conditional_array}")
# Select elements greater than 15
mask = conditional_array > 15
print(f"Elements > 15: {conditional_array[mask]}\n")

--- Iterating ---


Iterating through rows:
[1 2 3]
[4 5 6]
[7 8 9]
Iterating through flattened array (element by element):
1 2 3 4 5 6 7 8 9

--- Conditional Indexing ---


Original conditional array:
[[10 11 12 13 14]
[15 16 17 18 19]]
Elements > 15: [16 17 18 19]

Pandas: Overview, Series, and DataFrame Creation

Subtask:
Introduce the pandas library, explain Series and DataFrames, and demonstrate various ways to create them (from dictionaries, lists,
NumPy arrays, CSVs).

keyboard_arrow_down Pandas: Overview, Series, and DataFrame Creation

Pandas is a powerful, open-source data analysis and manipulation library for Python. It is built on top of the NumPy library and provides
data structures and operations for manipulating numerical tables and time series.

Key Data Structures:


1. Series: A one-dimensional labeled array capable of holding any data type (integers, strings, floats, Python objects, etc.). It's
essentially a column in a spreadsheet or a SQL table. Each element has an associated label, called an index.

2. DataFrame: A two-dimensional labeled data structure with columns of potentially different types. You can think of it like a
spreadsheet, a SQL table, or a dictionary of Series objects. It is the most commonly used pandas object and is ideal for
representing tabular data.

Pandas is crucial for data science because it simplifies tasks such as data loading, cleaning, manipulation, and analysis, making it an
indispensable tool for working with structured data.

Reasoning: The next step is to import the pandas library and demonstrate the creation of various Pandas data structures (Series and
DataFrames) from different sources as outlined in the instructions.

import pandas as pd
import numpy as np

print("--- Pandas Library Imported ---")

# 1. Create a Pandas Series from a Python list


series_data = [10, 20, 30, 40, 50]
labels = ['A', 'B', 'C', 'D', 'E']
my_series = [Link](series_data)
my_series_l = [Link](series_data, index=labels)
print(f"\nSeries from list:\n{my_series}")
print(f"\nSeries from list of lables:\n{my_series_l}")
print(f"Type of Series: {type(my_series)}\n")

--- Pandas Library Imported ---

Series from list:


0 10
1 20
2 30
3 40
4 50
dtype: int64

Series from list of lables:


A 10
B 20
C 30
D 40
E 50
dtype: int64
Type of Series: <class '[Link]'>

my_series_l['A':'D']
0

A 10

B 20

C 30

D 40

dtype: int64

# 2. Create a Pandas DataFrame from a Python dictionary


dict_data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [24, 27, 22, 32],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df_from_dict = [Link](dict_data)
print(f"DataFrame from dictionary:\n{df_from_dict}\n")

DataFrame from dictionary:


Name Age City
0 Alice 24 New York
1 Bob 27 Los Angeles
2 Charlie 22 Chicago
3 David 32 Houston

# 3. Create a Pandas DataFrame from a list of lists or a 2D NumPy array


# From a list of lists
list_of_lists_data = [['Red', 1], ['Green', 2], ['Blue', 3]]
df_from_list_of_lists = [Link](list_of_lists_data, columns=['Color', 'Value'])
print(f"DataFrame from list of lists:\n{df_from_list_of_lists}\n")

# From a 2D NumPy array


numpy_array_data = [Link]([['Apple', 10], ['Banana', 20], ['Cherry', 15]])
df_from_numpy_array = [Link](numpy_array_data, columns=['Fruit', 'Quantity'])
print(f"DataFrame from NumPy array:\n{df_from_numpy_array}\n")

DataFrame from list of lists:


Color Value
0 Red 1
1 Green 2
2 Blue 3

DataFrame from NumPy array:


Fruit Quantity
0 Apple 10
1 Banana 20
2 Cherry 15

# 4. Create a Pandas DataFrame by reading a CSV file from a direct URL


# Example: A sample CSV from a GitHub repository
csv_url = '[Link]
try:
df_from_csv = pd.read_csv(csv_url)
print(f"DataFrame from CSV (first 5 rows):\n{df_from_csv.head()}\n")
except Exception as e:
print(f"Error reading CSV from URL: {e}")

Error reading CSV from URL: HTTP Error 404: Not Found

Reasoning: The previous attempt to read a CSV from a URL failed due to a 404 error. To complete the subtask of demonstrating
DataFrame creation from a CSV, I need to provide a valid and accessible CSV URL. I will use a different, publicly available CSV URL.

keyboard_arrow_down Pandas: Basic DataFrame Operations

Subtask:
Cover essential DataFrame operations including indexing, re-indexing, dropping entries, selecting entries, and calculating summary
statistics. Provide at least 5 distinct examples, fulfilling

Program 4: Introduction to pandas Data Frames


Program 5: Handling Missing Data and Summary Statistics
keyboard_arrow_down Basic DataFrame Operations

This section delves into the fundamental operations for manipulating Pandas DataFrames. We will explore how to select specific data
using various indexing methods, re-arrange data through re-indexing, remove unwanted data by dropping entries, and gain insights into
the dataset by calculating summary statistics.

Reasoning: The next step is to demonstrate various basic DataFrame operations including indexing, re-indexing, dropping entries,
selecting entries, and calculating summary statistics as outlined in the instructions.

import pandas as pd
import numpy as np

# Assuming df_from_dict and df_from_csv_fixed are already defined from previous cells
# For demonstration purposes, re-creating them if this cell is run independently
try:
df_from_dict # Check if df_from_dict exists
df_from_csv_fixed # Check if df_from_csv_fixed exists
except NameError:
print("Re-creating df_from_dict and df_from_csv_fixed for standalone execution.")

valid_csv_url = '[Link]
df_from_csv_fixed = pd.read_csv(valid_csv_url)

import pandas as pd

# Try reading a CSV from a different, valid public URL


# Example: Iris dataset from UCI Machine Learning Repository via a raw GitHub link
valid_csv_url = '[Link]

try:
df_from_csv_fixed = pd.read_csv(valid_csv_url)
print(f"DataFrame from fixed CSV URL (first 5 rows):\n{df_from_csv_fixed.head()}\n")
except Exception as e:
print(f"Error reading CSV from the new URL: {e}")

DataFrame from fixed CSV URL (first 5 rows):


sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

dict_data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [24, 27, 22, 32],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df_from_dict = [Link](dict_data)
print(df_from_dict)
print("\n--- Basic DataFrame Operations ---\n")

# --- 1. Indexing ---


print("1. Indexing:")
# Select a single column
print(f"\nSingle column 'Name':\n{df_from_dict['Name']}")

# Select multiple columns


print(f"\nMultiple columns 'Name' and 'Age':\n{df_from_dict[['Name', 'Age']]}")

# Select rows by label using .loc[] (first row)


print(f"\nRow by label (index 0) using .loc[]:\n{df_from_dict.loc[0]}")

# Select rows by integer position using .iloc[] (second row)


print(f"\nRow by integer position (index 1) using .iloc[]:\n{df_from_dict.iloc[1]}")

Name Age City


0 Alice 24 New York
1 Bob 27 Los Angeles
2 Charlie 22 Chicago
3 David 32 Houston

--- Basic DataFrame Operations ---

1. Indexing:

Single column 'Name':


0 Alice
1 Bob
2 Charlie
3 David
Name: Name, dtype: object

Multiple columns 'Name' and 'Age':


Name Age
0 Alice 24
1 Bob 27
2 Charlie 22
3 David 32

Row by label (index 0) using .loc[]:


Name Alice
Age 24
City New York
Name: 0, dtype: object

Row by integer position (index 1) using .iloc[]:


Name Bob
Age 27
City Los Angeles
Name: 1, dtype: object

Start coding or generate with AI.

# --- 2. Re-indexing ---


print("\n2. Re-indexing:")
# Create a new DataFrame with a custom index
new_index_df = df_from_dict.copy()
new_index_df.index = ['a', 'b', 'c', 'd']
print(f"\nDataFrame with new custom index:\n{new_index_df}")

# Re-ordering existing index (example: reverse order)


reordered_df = df_from_dict.reindex(index=[3, 2, 1, 0])
print(f"\nDataFrame with reordered index:\n{reordered_df}")

2. Re-indexing:

DataFrame with new custom index:


Name Age City
a Alice 24 New York
b Bob 27 Los Angeles
c Charlie 22 Chicago
d David 32 Houston

DataFrame with reordered index:


Name Age City
3 David 32 Houston
2 Charlie 22 Chicago
1 Bob 27 Los Angeles
0 Alice 24 New York

# --- 3. Dropping Entries ---


print("\n3. Dropping Entries:")
# Drop a column (from a copy to avoid modifying original DataFrame)
df_no_city = df_from_dict.drop('City', axis=1)
print(f"\nDataFrame after dropping 'City' column:\n{df_no_city}")

# Drop a row by index label (e.g., index 1, 'Bob')


df_no_bob = df_from_dict.drop(1, axis=0)
print(f"\nDataFrame after dropping row with index 1:\n{df_no_bob}")

3. Dropping Entries:

DataFrame after dropping 'City' column:


Name Age
0 Alice 24
1 Bob 27
2 Charlie 22
3 David 32

DataFrame after dropping row with index 1:


Name Age City
0 Alice 24 New York
2 Charlie 22 Chicago
3 David 32 Houston

# --- 4. Selecting Entries (Filtering) ---


print("\n4. Selecting Entries (from Iris dataset):")
# Filter rows where 'sepal_length' is greater than 5.0
filtered_iris_df = df_from_csv_fixed[df_from_csv_fixed['sepal_length'] > 5.0]
print(f"\nIris DataFrame where 'sepal_length' > 5.0 (first 5 rows):\n{filtered_iris_df.head()}")
print(f"Total rows after filtering: {len(filtered_iris_df)}")

4. Selecting Entries (from Iris dataset):

Iris DataFrame where 'sepal_length' > 5.0 (first 5 rows):


sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
10 5.4 3.7 1.5 0.2 setosa
14 5.8 4.0 1.2 0.2 setosa
15 5.7 4.4 1.5 0.4 setosa
Total rows after filtering: 118

# --- 5. Calculating Summary Statistics ---


print("\n5. Calculating Summary Statistics (from Iris dataset):")
# Use .describe() to get an overview of numerical columns
print(f"\nSummary statistics for Iris DataFrame:\n{df_from_csv_fixed.describe()}")

# Calculate the mean of a specific column, e.g., 'petal_width'


mean_petal_width = df_from_csv_fixed['petal_width'].mean()
print(f"\nMean of 'petal_width' column: {mean_petal_width:.2f}")

5. Calculating Summary Statistics (from Iris dataset):

Summary statistics for Iris DataFrame:


sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

Mean of 'petal_width' column: 1.20

Pandas: Handling Missing Data

Subtask:
Demonstrate techniques for identifying, analyzing, and handling missing data (e.g., isnull(), dropna(), fillna()). Provide at least 5 distinct
examples, fulfilling 'Program 5: Handling Missing Data and Summary Statistics'.

keyboard_arrow_down Pandas: Handling Missing Data

Missing data is a common challenge in real-world datasets and can significantly impact the accuracy and reliability of data analysis and
machine learning models. Pandas provides robust tools to identify, analyze, and handle these missing values effectively.

Why is handling missing data important?


Prevents errors: Many analytical functions and models cannot operate with missing values.
Improves model performance: Properly handled missing data can lead to more accurate models.
Maintains data integrity: Understanding missing patterns can reveal issues in data collection.

In this section, we will explore various techniques to manage missing data using Pandas, including identifying where data is missing,
dropping incomplete rows or columns, and filling missing values with appropriate substitutes.

Reasoning: The next step is to import the pandas and numpy libraries and then create a sample DataFrame with intentional missing
values for demonstration, as per the instructions.

import pandas as pd
import numpy as np

print("--- Pandas and NumPy Libraries Imported ---")

# Create a sample DataFrame with intentional missing values


data = {
'A': [1, 2, [Link], 4, 5],
'B': [6, [Link], 8, 9, 10],
'C': [11, 12, 13, [Link], 15],
'D': [16, 17, 18, 19, [Link]],
'E': [21, 22, 23, 24, 25],
'F': [[Link], [Link], [Link], [Link], [Link]] # A column with all NaN values
}
df_missing = [Link](data)

print("\nOriginal DataFrame with Missing Values:")


print(df_missing)

--- Pandas and NumPy Libraries Imported ---

Original DataFrame with Missing Values:


A B C D E F
0 1.0 6.0 11.0 16.0 21 NaN
1 2.0 NaN 12.0 17.0 22 NaN
2 NaN 8.0 13.0 18.0 23 NaN
3 4.0 9.0 NaN 19.0 24 NaN
4 5.0 10.0 15.0 NaN 25 NaN

Reasoning: Now that a sample DataFrame with missing values has been created, the next step is to demonstrate how to identify these
missing values using isnull() , isnull().sum() , and isnull().sum().sum() as per the subtask instructions.

print(
"\n--- Identifying Missing Values ---")

# a. Use [Link]() to get a boolean DataFrame indicating missing values


print("\nBoolean DataFrame for missing values ([Link]()):")
print(df_missing.isnull())

--- Identifying Missing Values ---

Boolean DataFrame for missing values ([Link]()):


A B C D E F
0 False False False False False True
1 False True False False False True
2 True False False False False True
3 False False True False False True
4 False False False True False True

Missing values count per column ([Link]().sum()):


A 1
B 1
C 1
D 1
E 0
F 5
dtype: int64

Total number of missing values in the DataFrame: 9

Non-missing values count per column ([Link]().sum()):


A 4
B 4
C 4
D 4
E 5
F 0
dtype: int64

# b. Use [Link]().sum() to count missing values per column


print("\nMissing values count per column ([Link]().sum()):")
print(df_missing.isnull().sum())

Missing values count per column ([Link]().sum()):


A 1
B 1
C 1
D 1
E 0
F 5
dtype: int64

# c. Use [Link]().sum().sum() to get the total number of missing values in the DataFrame
total_missing_values = df_missing.isnull().sum().sum()
print(f"\nTotal number of missing values in the DataFrame: {total_missing_values}")

Total number of missing values in the DataFrame: 9

# Additionally, check for non-missing values


print("\nNon-missing values count per column ([Link]().sum()):")
print(df_missing.notnull().sum())
Non-missing values count per column ([Link]().sum()):
A 4
B 4
C 4
D 4
E 5
F 0
dtype: int64

Reasoning: The next step is to demonstrate how to handle missing values by dropping them using [Link]() ,
[Link](how='all') , and [Link](axis=1) as specified in the instructions.

print(
"\n--- Handling Missing Values: Dropping Entries ---")

# Original DataFrame for comparison


print("\nOriginal DataFrame:")
print(df_missing)

--- Handling Missing Values: Dropping Entries ---

Original DataFrame:
A B C D E F
0 1.0 6.0 11.0 16.0 21 NaN
1 2.0 NaN 12.0 17.0 22 NaN
2 NaN 8.0 13.0 18.0 23 NaN
3 4.0 9.0 NaN 19.0 24 NaN
4 5.0 10.0 15.0 NaN 25 NaN

# a. Use [Link]() to drop rows containing any missing values


df_dropped_any_row = df_missing.dropna()
print("\nDataFrame after dropping rows with ANY missing values ([Link]()):")
print(df_dropped_any_row)

DataFrame after dropping rows with ANY missing values ([Link]()):


Empty DataFrame
Columns: [A, B, C, D, E, F]
Index: []

# b. Use [Link](how='all') to drop rows only if all their values are missing
df_dropped_all_row = df_missing.dropna(how='all')
print("\nDataFrame after dropping rows with ALL missing values ([Link](how='all')):")
print(df_dropped_all_row)

DataFrame after dropping rows with ALL missing values ([Link](how='all')):


A B C D E F
0 1.0 6.0 11.0 16.0 21 NaN
1 2.0 NaN 12.0 17.0 22 NaN
2 NaN 8.0 13.0 18.0 23 NaN
3 4.0 9.0 NaN 19.0 24 NaN
4 5.0 10.0 15.0 NaN 25 NaN

# c. Use [Link](axis=1) to drop columns containing any missing values


df_dropped_any_col = df_missing.dropna(axis=1)
print("\nDataFrame after dropping columns with ANY missing values ([Link](axis=1)):")
print(df_dropped_any_col)

DataFrame after dropping columns with ANY missing values ([Link](axis=1)):


E
0 21
1 22
2 23
3 24
4 25

# Show dropping columns where all values are NaN


df_dropped_all_col = df_missing.dropna(axis=1, how='all')
print("\nDataFrame after dropping columns with ALL missing values ([Link](axis=1, how='all')):")
print(df_dropped_all_col)

DataFrame after dropping columns with ALL missing values ([Link](axis=1, how='all')):
A B C D E
0 1.0 6.0 11.0 16.0 21
1 2.0 NaN 12.0 17.0 22
2 NaN 8.0 13.0 18.0 23
3 4.0 9.0 NaN 19.0 24
4 5.0 10.0 15.0 NaN 25

Reasoning: The next step is to demonstrate how to handle missing values by filling them using [Link]() with a scalar, column
means, and ffill/bfill methods, as outlined in the subtask instructions.

print(
"\n--- Handling Missing Values: Filling Entries ---")

# Original DataFrame for comparison


print("\nOriginal DataFrame:")
print(df_missing)

--- Handling Missing Values: Filling Entries ---

Original DataFrame:
A B C D E F
0 1.0 6.0 11.0 16.0 21 NaN
1 2.0 NaN 12.0 17.0 22 NaN
2 NaN 8.0 13.0 18.0 23 NaN
3 4.0 9.0 NaN 19.0 24 NaN
4 5.0 10.0 15.0 NaN 25 NaN

# a. Use [Link](value) to fill all NaNs with a specific scalar value (e.g., 0)
df_filled_zero = df_missing.fillna(0)
print("\nDataFrame after filling all NaNs with 0 ([Link](0)):")
print(df_filled_zero)

DataFrame after filling all NaNs with 0 ([Link](0)):


A B C D E F
0 1.0 6.0 11.0 16.0 21 0.0
1 2.0 0.0 12.0 17.0 22 0.0
2 0.0 8.0 13.0 18.0 23 0.0
3 4.0 9.0 0.0 19.0 24 0.0
4 5.0 10.0 15.0 0.0 25 0.0

# b. Use [Link]([Link]()) to fill NaNs with the mean of their respective numeric columns
# Note: .mean() will ignore non-numeric columns like 'F' if it had object dtype, but here 'F' is float with all NaN.
# For column 'F', mean() would be NaN, so fillna([Link]()) would still leave 'F' as NaN.
# Let's re-evaluate mean for 'F' specifically for clearer demonstration if needed, but for now, rely on default behavior.
df_filled_mean = df_missing.fillna(df_missing.mean(numeric_only=True))
print("\nDataFrame after filling NaNs with column means ([Link]([Link]())):")
print(df_filled_mean)

DataFrame after filling NaNs with column means ([Link]([Link]())):


A B C D E F
0 1.0 6.00 11.00 16.0 21 NaN
1 2.0 8.25 12.00 17.0 22 NaN
2 3.0 8.00 13.00 18.0 23 NaN
3 4.0 9.00 12.75 19.0 24 NaN
4 5.0 10.00 15.00 17.5 25 NaN

# c. Use [Link](method='ffill') to forward-fill missing values


df_filled_ffill = df_missing.fillna(method='ffill')
print("\nDataFrame after forward-filling NaNs ([Link](method='ffill')):")
print(df_filled_ffill)

DataFrame after forward-filling NaNs ([Link](method='ffill')):


A B C D E F
0 1.0 6.0 11.0 16.0 21 NaN
1 2.0 6.0 12.0 17.0 22 NaN
2 2.0 8.0 13.0 18.0 23 NaN
3 4.0 9.0 13.0 19.0 24 NaN
4 5.0 10.0 15.0 19.0 25 NaN
/tmp/[Link]: FutureWarning: [Link] with 'method' is deprecated and will raise in a future v
df_filled_ffill = df_missing.fillna(method='ffill')

# Optional: Use [Link](method='bfill') to backward-fill missing values


df_filled_bfill = df_missing.fillna(method='bfill')
print("\nDataFrame after backward-filling NaNs ([Link](method='bfill')):")
print(df_filled_bfill)

DataFrame after backward-filling NaNs ([Link](method='bfill')):


A B C D E F
0 1.0 6.0 11.0 16.0 21 NaN
1 2.0 8.0 12.0 17.0 22 NaN
2 4.0 8.0 13.0 18.0 23 NaN
3 4.0 9.0 15.0 19.0 24 NaN
4 5.0 10.0 15.0 NaN 25 NaN
/tmp/[Link]: FutureWarning: [Link] with 'method' is deprecated and will raise in a future ve
df_filled_bfill = df_missing.fillna(method='bfill')

Reasoning: The previous code generated a FutureWarning for [Link](method='ffill') and [Link](method='bfill') .
To address this and use the recommended syntax, I will replace fillna(method='ffill') with ffill() and
fillna(method='bfill') with bfill() .

print(
"\n--- Handling Missing Values: Filling Entries ---")

# Original DataFrame for comparison


print("\nOriginal DataFrame:")
print(df_missing)

--- Handling Missing Values: Filling Entries ---

Original DataFrame:
A B C D E F
0 1.0 6.0 11.0 16.0 21 NaN
1 2.0 NaN 12.0 17.0 22 NaN
2 NaN 8.0 13.0 18.0 23 NaN
3 4.0 9.0 NaN 19.0 24 NaN
4 5.0 10.0 15.0 NaN 25 NaN

# a. Use [Link](value) to fill all NaNs with a specific scalar value (e.g., 0)
df_filled_zero = df_missing.fillna(0)
print("\nDataFrame after filling all NaNs with 0 ([Link](0)):")
print(df_filled_zero)

DataFrame after filling all NaNs with 0 ([Link](0)):


A B C D E F
0 1.0 6.0 11.0 16.0 21 0.0
1 2.0 0.0 12.0 17.0 22 0.0
2 0.0 8.0 13.0 18.0 23 0.0
3 4.0 9.0 0.0 19.0 24 0.0
4 5.0 10.0 15.0 0.0 25 0.0

# b. Use [Link]([Link]()) to fill NaNs with the mean of their respective numeric columns
# Note: .mean() will ignore non-numeric columns like 'F' if it had object dtype, but here 'F' is float with all NaN.
# For column 'F', mean() would be NaN, so fillna([Link]()) would still leave 'F' as NaN.
# Let's re-evaluate mean for 'F' specifically for clearer demonstration if needed, but for now, rely on default behavior.
df_filled_mean = df_missing.fillna(df_missing.mean(numeric_only=True))
print("\nDataFrame after filling NaNs with column means ([Link]([Link]())):")
print(df_filled_mean)

DataFrame after filling NaNs with column means ([Link]([Link]())):


A B C D E F
0 1.0 6.00 11.00 16.0 21 NaN
1 2.0 8.25 12.00 17.0 22 NaN
2 3.0 8.00 13.00 18.0 23 NaN
3 4.0 9.00 12.75 19.0 24 NaN
4 5.0 10.00 15.00 17.5 25 NaN

# c. Use [Link]() to forward-fill missing values


df_filled_ffill = df_missing.ffill()
print("\nDataFrame after forward-filling NaNs ([Link]()):")
print(df_filled_ffill)

DataFrame after forward-filling NaNs ([Link]()):


A B C D E F
0 1.0 6.0 11.0 16.0 21 NaN
1 2.0 6.0 12.0 17.0 22 NaN
2 2.0 8.0 13.0 18.0 23 NaN
3 4.0 9.0 13.0 19.0 24 NaN
4 5.0 10.0 15.0 19.0 25 NaN

# Optional: Use [Link]() to backward-fill missing values


df_filled_bfill = df_missing.bfill()
print("\nDataFrame after backward-filling NaNs ([Link]()):")
print(df_filled_bfill)
DataFrame after backward-filling NaNs ([Link]()):
A B C D E F
0 1.0 6.0 11.0 16.0 21 NaN
1 2.0 8.0 12.0 17.0 22 NaN
2 4.0 8.0 13.0 18.0 23 NaN
3 4.0 9.0 15.0 19.0 24 NaN
4 5.0 10.0 15.0 NaN 25 NaN

Importing Datasets

Subtask:
Provide examples of how to import datasets from common external sources using direct URLs. Include at least 3 distinct examples.

keyboard_arrow_down Importing Datasets from External Sources

Accessing datasets from external sources is a fundamental step in any data science workflow. Whether the data resides on platforms
like Kaggle, GitHub, or public repositories like UCI Machine Learning, pandas provides convenient functions, especially
pd.read_csv() , to directly load this data into DataFrames using their URLs.

This section will demonstrate how to import datasets from common external sources directly into your Colab environment using
pd.read_csv() .

import kagglehub

# Download latest version


path = kagglehub.dataset_download("amar5693/student-placement-prediction-dataset")

print("Path to dataset files:", path)

data = pd.read_csv("/kaggle/input/student-placement-prediction-dataset/Student_Placement_Career_Prediction_Dataset.csv")
print("First 5 rows of the dataset:")

[Link]()

Using Colab cache for faster access to the 'student-placement-prediction-dataset' dataset.


Path to dataset files: /kaggle/input/student-placement-prediction-dataset
First 5 rows of the dataset:
student_id age gender branch cgpa internship_count project_count certifications_count coding_skills_score com

0 1 23 Male IT 8.99 4 6 2 41

1 2 24 Female Mechanical 8.36 4 7 5 42

2 3 22 Female EEE 7.18 4 5 2 84

3 4 24 Female Mechanical 6.28 0 2 4 60

4 5 24 Female IT 5.30 3 4 1 78

Next steps: Generate code with data New interactive sheet

[Link]()

<class '[Link]'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 student_id 10000 non-null int64
1 age 10000 non-null int64
2 gender 10000 non-null object
3 branch 10000 non-null object
4 cgpa 10000 non-null float64
5 internship_count 10000 non-null int64
6 project_count 10000 non-null int64
7 certifications_count 10000 non-null int64
8 coding_skills_score 10000 non-null int64
9 communication_skills_score 10000 non-null int64
10 soft_skills_score 10000 non-null int64
11 hackathon_participation 10000 non-null object
12 placement_status 10000 non-null object
dtypes: float64(1), int64(8), object(4)
memory usage: 1015.8+ KB
[Link]()

student_id age cgpa internship_count project_count certifications_count coding_skills_score co

count 10000.00000 10000.000000 10000.000000 10000.000000 10000.00000 10000.00000 10000.000000

mean 5000.50000 22.499900 7.520948 1.996500 4.01950 2.50910 70.026600

std 2886.89568 1.708711 1.447154 1.405591 2.00148 1.71657 17.485545

min 1.00000 20.000000 5.000000 0.000000 1.00000 0.00000 40.000000

25% 2500.75000 21.000000 6.280000 1.000000 2.00000 1.00000 55.000000

50% 5000.50000 23.000000 7.530000 2.000000 4.00000 3.00000 70.000000

75% 7500.25000 24.000000 8.780000 3.000000 6.00000 4.00000 85.000000

max 10000.00000 25.000000 10.000000 4.000000 7.00000 5.00000 100.000000

notnull_count = [Link]().sum()

notnull_count

student_id 0

age 0

gender 0

branch 0

cgpa 0

internship_count 0

project_count 0

certifications_count 0

coding_skills_score 0

communication_skills_score 0

soft_skills_score 0

hackathon_participation 0

placement_status 0

dtype: int64

# Program to fetch data from a public API using requests library


import requests

# Step 1: API URL


url = "[Link]

# Step 2: Send GET request to the API


response = [Link](url)

# Step 3: Check if request was successful


if response.status_code == 200:
# Step 4: Convert response to JSON
data = [Link]()

# Step 5: Display first 5 records


print("Fetched Data from API:\n")
for i in range(5):
print("Post ID:", data[i]['id'])
print("Title:", data[i]['title'])
print("Body:", data[i]['body'])
print("-" * 50)
else:
print("Failed to fetch data")
print("Status Code:", response.status_code)

Fetched Data from API:

Post ID: 1
Title: sunt aut facere repellat provident occaecati excepturi optio reprehenderit
Body: quia et suscipit
suscipit recusandae consequuntur expedita et cum
reprehenderit molestiae ut ut quas totam
nostrum rerum est autem sunt rem eveniet architecto
--------------------------------------------------
Post ID: 2
Title: qui est esse
Body: est rerum tempore vitae
sequi sint nihil reprehenderit dolor beatae ea dolores neque
fugiat blanditiis voluptate porro vel nihil molestiae ut reiciendis
qui aperiam non debitis possimus qui neque nisi nulla
--------------------------------------------------
Post ID: 3
Title: ea molestias quasi exercitationem repellat qui ipsa sit aut
Body: et iusto sed quo iure
voluptatem occaecati omnis eligendi aut ad
voluptatem doloribus vel accusantium quis pariatur
molestiae porro eius odio et labore et velit aut
--------------------------------------------------
Post ID: 4
Title: eum et est occaecati
Body: ullam et saepe reiciendis voluptatem adipisci
sit amet autem assumenda provident rerum culpa
quis hic commodi nesciunt rem tenetur doloremque ipsam iure
quis sunt voluptatem rerum illo velit
--------------------------------------------------
Post ID: 5
Title: nesciunt quas odio
Body: repudiandae veniam quaerat sunt sed
alias aut fugiat sit autem sed est
voluptatem omnis possimus esse voluptatibus quis
est aut tenetur dolor neque
--------------------------------------------------

# Program 2: Introduction to Web Scraping using Beautiful Soup


# Objective
# To demonstrate how to fetch a web page, parse its HTML, and extract data using the Beautiful Soup library.
# Program to demonstrate basic web scraping using Beautiful Soup

import requests
from bs4 import BeautifulSoup

# Step 1: URL of the webpage to scrape


url = "[Link]

# Step 2: Send HTTP request to the webpage


response = [Link](url)

# Step 3: Check if request is successful


if response.status_code == 200:

# Step 4: Parse HTML content using Beautiful Soup


soup = BeautifulSoup([Link], '[Link]')

# Step 5: Find all quote blocks


quotes = soup.find_all('span', class_='text')

# Step 6: Display first 5 quotes


print("Quotes scraped from website:\n")
for i in range(5):
print(f"Quote {i+1}: {quotes[i].text}")
else:
print("Failed to retrieve webpage")

Quotes scraped from website:

Quote 1: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Quote 2: “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Quote 3: “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everythin
Quote 4: “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Quote 5: “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”

Start coding or generate with AI.


Could not connect to the reCAPTCHA service. Please check your internet connection and reload to get a reCAPTCHA challenge.

You might also like