0% found this document useful (0 votes)
80 views8 pages

NumPy and Pandas Tutorial Guide

Uploaded by

omvati343
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views8 pages

NumPy and Pandas Tutorial Guide

Uploaded by

omvati343
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

NumPy and Pandas for Data Analysis AI ML Training

NumPy Tutorial
Introduction

NumPy (Numerical Python) is a library for the Python programming language, adding support
for large, multi-dimensional arrays and matrices, along with a large collection of high-level
mathematical functions to operate on these arrays.

Installation

To install NumPy, use the following command:

pip install numpy

Basic Operations

Importing NumPy

import numpy as np

Creating Arrays

# Create a 1D array
array_1d = [Link]([1, 2, 3, 4, 5])
print(array_1d)

# Create a 2D array
array_2d = [Link]([[1, 2, 3], [4, 5, 6]])
print(array_2d)

# Create an array with zeros


zeros_array = [Link]((3, 4))
print(zeros_array)

# Create an array with ones


ones_array = [Link]((2, 3))
print(ones_array)

# Create an identity matrix


identity_matrix = [Link](3)
print(identity_matrix)

# Create an array with a range of values


range_array = [Link](10, 20, 2)
print(range_array)

# Create an array with evenly spaced values


linspace_array = [Link](0, 1, 5)
print(linspace_array)

Array Operations

# Arithmetic operations
a = [Link]([1, 2, 3])
b = [Link]([4, 5, 6])

LinkedIn: [Link]/in/nidhi-grover-raheja-904211138 1 |Pa ge


NumPy and Pandas for Data Analysis AI ML Training

print(a + b) # Addition
print(a - b) # Subtraction
print(a * b) # Element-wise multiplication
print(a / b) # Element-wise division

# Matrix multiplication
matrix_a = [Link]([[1, 2], [3, 4]])
matrix_b = [Link]([[5, 6], [7, 8]])
print([Link](matrix_a, matrix_b))

# Broadcasting
array_broadcast = [Link]([1, 2, 3])
print(array_broadcast + 1) # Adds 1 to each element

# Statistical operations
print([Link](a)) # Mean
print([Link](a)) # Median
print([Link](a)) # Standard deviation
print([Link](a)) # Sum
print([Link](a)) # Minimum
print([Link](a)) # Maximum

Indexing and Slicing

array = [Link]([1, 2, 3, 4, 5, 6])

# Indexing
print(array[0]) # First element
print(array[-1]) # Last element

# Slicing
print(array[1:4]) # Elements from index 1 to 3
print(array[:3]) # First three elements
print(array[3:]) # Elements from index 3 to end
print(array[::2]) # Every second element

Reshaping Arrays

array = [Link](1, 10)


reshaped_array = [Link]((3, 3))
print(reshaped_array)

# Flattening arrays
flattened_array = reshaped_array.flatten()
print(flattened_array)

Pandas Tutorial
Introduction

Pandas is a library providing high-performance, easy-to-use data structures and data analysis
tools for the Python programming language.

Installation

LinkedIn: [Link]/in/nidhi-grover-raheja-904211138 2 |Pa ge


NumPy and Pandas for Data Analysis AI ML Training

To install Pandas, use the following command:

pip install pandas

Basic Operations

Importing Pandas

import pandas as pd

Creating DataFrames

# Create a DataFrame from a dictionary


data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32],
'City': ['New York', 'Paris', 'Berlin', 'London']
}
df = [Link](data)
print(df)

# Create a DataFrame from a CSV file


df_from_csv = pd.read_csv('path_to_csv_file.csv')
print(df_from_csv)

Viewing Data

# Display the first few rows


print([Link]())

# Display the last few rows


print([Link]())

# Display the data types of columns


print([Link])

# Display the shape of the DataFrame


print([Link])

# Display summary statistics


print([Link]())

Selecting Data

# Select a single column


print(df['Name'])

# Select multiple columns


print(df[['Name', 'City']])

# Select rows by index


print([Link][0]) # First row
print([Link][0:2]) # First two rows

# Select rows by label


print([Link][0]) # First row
print([Link][0:2]) # First three rows (inclusive)

LinkedIn: [Link]/in/nidhi-grover-raheja-904211138 3 |Pa ge


NumPy and Pandas for Data Analysis AI ML Training

# Conditional selection
print(df[df['Age'] > 30])

Adding and Dropping Columns

# Add a new column


df['Country'] = ['USA', 'France', 'Germany', 'UK']
print(df)

# Drop a column
df = [Link]('Country', axis=1)
print(df)

Handling Missing Data

# Create a DataFrame with missing values


data_with_nan = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, None, 35, 32],
'City': ['New York', 'Paris', None, 'London']
}
df_nan = [Link](data_with_nan)
print(df_nan)

# Drop rows with missing values


df_dropped_nan = df_nan.dropna()
print(df_dropped_nan)

# Fill missing values


df_filled_nan = df_nan.fillna({'Age': df_nan['Age'].mean(), 'City':
'Unknown'})
print(df_filled_nan)

Grouping and Aggregating Data

# Group by a column and calculate mean


print([Link]('City').mean())

# Group by multiple columns and calculate sum


print([Link](['City', 'Name']).sum())

Merging DataFrames

# Create two DataFrames


df1 = [Link]({'Name': ['John', 'Anna'], 'Age': [28, 24]})
df2 = [Link]({'Name': ['Peter', 'Linda'], 'City': ['Berlin',
'London']})

# Concatenate DataFrames
df_concat = [Link]([df1, df2], ignore_index=True)
print(df_concat)

# Merge DataFrames
df_merge = [Link](df1, df2, on='Name', how='inner')
print(df_merge)

LinkedIn: [Link]/in/nidhi-grover-raheja-904211138 4 |Pa ge


NumPy and Pandas for Data Analysis AI ML Training

Exporting Data

# Export DataFrame to CSV


df.to_csv('[Link]', index=False)

# Export DataFrame to Excel


df.to_excel('[Link]', index=False)

Advanced Pandas Tutorial


Handling Time Series Data

Pandas provides robust support for time series data. Here's how to work with it.

Creating Time Series Data

# Create a date range


date_range = pd.date_range(start='2023-01-01', periods=10, freq='D')
print(date_range)

# Create a DataFrame with time series data


time_series_data = {
'Date': date_range,
'Value': [Link](10)
}
df_time_series = [Link](time_series_data)
df_time_series.set_index('Date', inplace=True)
print(df_time_series)

Resampling Time Series Data

# Resample to weekly frequency and calculate the mean


df_resampled = df_time_series.resample('W').mean()
print(df_resampled)

# Resample to monthly frequency and calculate the sum


df_resampled_monthly = df_time_series.resample('M').sum()
print(df_resampled_monthly)

Working with Categorical Data


# Create a DataFrame with categorical data
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'City': ['New York', 'Paris', 'Berlin', 'London'],
'Gender': ['Male', 'Female', 'Male', 'Female']
}
df_categorical = [Link](data)

# Convert a column to categorical type


df_categorical['Gender'] = df_categorical['Gender'].astype('category')
print(df_categorical)

# Get the categories and codes


print(df_categorical['Gender'].[Link])

LinkedIn: [Link]/in/nidhi-grover-raheja-904211138 5 |Pa ge


NumPy and Pandas for Data Analysis AI ML Training

print(df_categorical['Gender'].[Link])

Pivot Tables
# Create a DataFrame
data = {
'Name': ['John', 'Anna', 'John', 'Anna', 'John', 'Anna'],
'Month': ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar'],
'Sales': [150, 200, 130, 210, 170, 220]
}
df_sales = [Link](data)

# Create a pivot table


pivot_table = df_sales.pivot_table(values='Sales', index='Name',
columns='Month', aggfunc='sum')
print(pivot_table)

Handling Large Datasets


# Read a large CSV file in chunks
chunk_size = 1000
chunks = pd.read_csv('large_dataset.csv', chunksize=chunk_size)

# Process each chunk


for chunk in chunks:
# Perform operations on the chunk
print([Link])

Applying Functions

Using apply()

# Create a DataFrame
data = {
'A': [1, 2, 3],
'B': [4, 5, 6]
}
df = [Link](data)

# Define a function
def add_one(x):
return x + 1

# Apply the function to each element


print([Link](add_one))

# Apply the function to each column


print([Link](lambda x: x + 1))

# Apply the function to each row


print([Link](lambda x: x + 1, axis=1))

Joining DataFrames
# Create two DataFrames
df1 = [Link]({
'key': ['A', 'B', 'C', 'D'],

LinkedIn: [Link]/in/nidhi-grover-raheja-904211138 6 |Pa ge


NumPy and Pandas for Data Analysis AI ML Training

'value': [1, 2, 3, 4]
})
df2 = [Link]({
'key': ['B', 'D', 'E', 'F'],
'value': [5, 6, 7, 8]
})

# Inner join
inner_joined = [Link](df1, df2, on='key', how='inner')
print(inner_joined)

# Left join
left_joined = [Link](df1, df2, on='key', how='left')
print(left_joined)

# Right join
right_joined = [Link](df1, df2, on='key', how='right')
print(right_joined)

# Outer join
outer_joined = [Link](df1, df2, on='key', how='outer')
print(outer_joined)

Window Functions
# Create a DataFrame with time series data
data = {
'Date': pd.date_range(start='2023-01-01', periods=10, freq='D'),
'Value': [Link](10)
}
df = [Link](data)
df.set_index('Date', inplace=True)

# Calculate rolling mean


rolling_mean = df['Value'].rolling(window=3).mean()
print(rolling_mean)

# Calculate expanding sum


expanding_sum = df['Value'].expanding().sum()
print(expanding_sum)

# Calculate exponentially weighted mean


ewm_mean = df['Value'].ewm(span=3).mean()
print(ewm_mean)

Handling JSON Data


# Create a JSON string
json_str = '''
[
{"Name": "John", "Age": 28, "City": "New York"},
{"Name": "Anna", "Age": 24, "City": "Paris"},
{"Name": "Peter", "Age": 35, "City": "Berlin"}
]
'''

# Read JSON string into DataFrame


df_json = pd.read_json(json_str)
print(df_json)

LinkedIn: [Link]/in/nidhi-grover-raheja-904211138 7 |Pa ge


NumPy and Pandas for Data Analysis AI ML Training

# Export DataFrame to JSON


df_json.to_json('[Link]', orient='records', lines=True)

Advanced Indexing with MultiIndex


# Create a MultiIndex DataFrame
arrays = [
['A', 'A', 'B', 'B'],
['one', 'two', 'one', 'two']
]
index = [Link].from_arrays(arrays, names=('first', 'second'))
df_multi = [Link]({'value': [1, 2, 3, 4]}, index=index)
print(df_multi)

# Accessing data in MultiIndex DataFrame


print(df_multi.loc['A'])
print(df_multi.loc[('A', 'one')])

Combining DataFrames with concat and append


# Create DataFrames
df1 = [Link]({
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']
})
df2 = [Link]({
'A': ['A3', 'A4', 'A5'],
'B': ['B3', 'B4', 'B5']
})

# Concatenate DataFrames
concatenated = [Link]([df1, df2], ignore_index=True)
print(concatenated)

# Append DataFrames
appended = [Link](df2, ignore_index=True)
print(appended)

Performance Tips
# Use vectorized operations instead of loops
data = [Link]({
'A': range(1000000),
'B': range(1000000)
})

# Inefficient way: Using loops


data['C'] = [x + y for x, y in zip(data['A'], data['B'])]

# Efficient way: Using vectorized operations


data['C'] = data['A'] + data['B']

LinkedIn: [Link]/in/nidhi-grover-raheja-904211138 8 |Pa ge

Common questions

Powered by AI

In NumPy, static arrays are created using functions like np.array(), np.zeros(), or np.ones(), which require predefined shapes and data types . These arrays are efficient for numerical computations but lack flexibility. In contrast, Pandas allows dynamic manipulation of data via DataFrames, which can be modified by adding or dropping columns, handling missing data, and inputting data from external sources like CSV and JSON files . This flexibility makes Pandas suitable for time-series data manipulation, summary statistics, and merging datasets, as it automatically adjusts to changing data structures.

Grouping and aggregating in Pandas allows for segmenting datasets into groups based on distinct column values and applying aggregation functions like mean or sum across these groups, providing insights into data trends and distributions . While NumPy can perform basic aggregation using operations like sum and mean, Pandas' grouping capabilities, such as df.groupby(['City']).mean(), enable detailed categorical analysis that NumPy's flat handling of arrays cannot easily achieve . This ability to relate and analyze multiple data dimensions enhances descriptive data analysis without complex computations.

NumPy provides support for n-dimensional arrays and a suite of mathematical operations to analyze data, making it an ideal foundation for data processing tasks . Pandas builds upon NumPy's capabilities by offering data structures like Series and DataFrames that facilitate data manipulation and analysis with additional features like indexing, grouping, and merging datasets . Thus, NumPy is often used for performance optimization with large datasets, while Pandas offers easier and more expressive methods to structure, filter, and aggregate data, making them complementary tools in a data analyst's toolkit.

The key difference between merging and joining DataFrames in Pandas lies in the semantics of data sources alignment . Merging involves combining DataFrames based on common keys or indices using methods like inner, outer, left, and right joins, directly impacting the resulting dataset's data integrity by determining which entries from the combined inputs are included . Joining is closely related but assumes the datasets have shared indices, simplifying the process when working with aligned datasets. Correct selection of merge or join method critically maintains data relationships and ensures inclusion of relevant dataset portions while avoiding data loss or redundancy.

In NumPy, data filtering is primarily achieved through basic indexing and slicing of arrays, which focuses on specific positions or fixed patterns . For example, obtaining elements using array[1:3] relies on know positions. Pandas offers more advanced data filtering capabilities through conditional selection, allowing operations based on data values and conditions such as df[df['Age'] > 30]. This approach functions similarly to SQL-like queries, enabling analysts to easily filter large datasets using logical conditions across multiple columns, offering more flexibility compared to NumPy's position-based filtering.

Challenges with MultiIndex DataFrames in Pandas include increased complexity in data selection and manipulation, which can lead to longer and less intuitive code as accessing data requires a clear understanding of the multi-level structure . These challenges can be addressed by using methods like reset_index() to flatten the DataFrame when convenient or using specific loc indexers to precisely target desired data slices . Thorough documentation and consistent use of naming conventions help reduce complexity, making multi-dimensional data handling in MultiIndex DataFrames more manageable.

Pivot tables enhance the analytical capabilities of Pandas by allowing transformation of DataFrames into summarized tables organized across multiple dimensions with aggregation functions like sum or average . This is particularly useful for business analytics, as pivot tables can quickly display patterns and insights by rearranging categorical data into a format that highlights trends, comparisons, and distributions. For example, using pivot_table with indices and columns enables the automatic computation and arrangement of data summaries without manual computations, streamlining comprehensive data analysis and decision-making processes.

Window functions in Pandas facilitate time-series data analysis by allowing operations over a specified number of previous observations, enabling trend identification and noise reduction . For instance, functions like rolling().mean() compute moving averages, smoothing out short-term fluctuations, while ewm().mean() calculates exponentially weighted moving averages, giving more weight to recent observations . These operations are essential in financial data analysis, stock market predictions, and weather data monitoring as they help identify underlying patterns and forecasts in time-dependent datasets.

It's recommended to fill missing data instead of dropping it in datasets for machine learning to prevent data loss, which can lead to biased models or insufficient training inputs . Dropping rows or columns with missing data could potentially discard valuable information, especially in sparse datasets. Techniques like using the mean, median, or a constant value to fill missing entries can preserve the original dataset size and maintain statistical representativity . These practices ensure models have enough information to learn effectively, maintain balanced feature distributions, and improve generalization abilities.

Using vectorized operations in Pandas and NumPy is crucial for performance optimization, as these operations are implemented in C, making them inherently faster than Python loops . In situations involving large datasets, loops generate overhead from Python's interpretation process, which is inefficient compared to vectorized operations that leverage memory-efficient, low-level array computations. For example, adding two columns with vectorized operation data['C'] = data['A'] + data['B'] in Pandas is significantly faster than looping over elements . This practice not only speedups computations but also leads to cleaner, more readable code.

You might also like