What is Python?
Python is an interpreted, high-level, general-purpose programming language.
Created by Guido van Rossum and released in 1991, Python emphasizes code
readability with its clean and easy-to-understand syntax. It supports multiple
programming paradigms such as object-oriented, imperative, functional, and
procedural programming.
Key Features of Python:
Easy to learn and use: Python has a simple syntax similar to English.
Open-source: Freely available for use and modification.
Extensive Libraries: Rich libraries such as Pandas, NumPy, Matplotlib, Seaborn, etc.
Platform Independent: Python programs can run on any operating system.
Interpreted Language: Code is executed line-by-line, which makes debugging easier.
Dynamic Typing: No need to declare variable types explicitly.
Applications of Python in Data Analysis:
Python has become the most preferred language for data analysis because of:
Data Cleaning and Preprocessing: Libraries like Pandas make it easy to filter,
manipulate, and clean datasets.
Statistical Analysis: Libraries like SciPy and Statsmodels provide robust statistical
methods.
Data Visualization: With Matplotlib and Seaborn, Python allows the creation of
interactive and publication-quality graphs.
Machine Learning & Predictive Analytics: Integrated tools like scikit-learn and
TensorFlow.
Automation: Automating repetitive data processing tasks
Basics of Python Programming
Variables and Data Types
A variable is a container for storing data values. Python has dynamically
typed variables — you don’t need to declare the type.
Common Data Types:
int – integer numbers
float – decimal numbers
str – text (strings)
bool – Boolean values (True, False)
list – ordered, mutable sequence
tuple – ordered, immutable sequence
set – unordered collection of unique items
dict – key-value pairs
Variables in Python
What is a Variable?
A variable in Python is a named storage location used to hold a value that can be
modified during program execution. It acts as a reference to a memory location where
data is stored.
x = 10
Rules for Naming Variables
1. Must begin with a letter (A–Z or a–z) or an underscore _.
2. The rest of the name can include letters, numbers, and underscores.
3. Case-sensitive – Age and age are different variables.
4. No reserved keywords (e.g., if, for, class, def, etc.).
5. Cannot contain spaces – use underscores (_) instead.
6. Should be meaningful – always use descriptive names for clarity.
Valid Variable Names:
name = "John" _age = 25 total_marks = 95.5
Invalid Variable Names:
2ndname = "Alice"
my name = "Bob"
Variable Assignment
Python supports multiple ways of assigning values to variables:
Single assignment:
x=5
Multiple assignment:
x, y, z = 1, 2, 3
Same value to multiple variables:
a = b = c = 10
Variable Types (Implicit Typing)
Python is dynamically typed — the type of a variable is determined automatically
when a value is assigned.
x = 10 # int y = 3.14 # float name = "Ram" # str
Use type() to check the variable type:
type(x) # Output: <class 'int'>
2. List in Python
What is a List?
A List is a mutable, ordered collection of items. Items can be of any data type (integer,
string, float, another list, etc.).
fruits = ["apple", "banana", "cherry"]
Characteristics of Lists:
Mutable: You can change, add, or remove items.
Ordered: Maintains the order of insertion.
Allows duplicate values.
Supports indexing and slicing.
Common List Operations:
[Link]("mango") # Add item [Link]("banana") # Remove item fruits[0] = "kiwi" # Modify item
3. Tuple in Python
What is a Tuple?
A Tuple is an immutable, ordered collection of items. Once created, its elements cannot be modified.
dimensions = (10, 20, 30)
Characteristics of Tuples:
Immutable: Cannot modify after creation.
Ordered: Preserves order of elements.
Faster than lists for reading data.
Useful for fixed collections like coordinates, RGB values, etc.
Tuple Use-Cases:
Representing read-only data.
Used in dictionary keys (as they are hashable).
Preferred when data integrity must be preserved.
4. Dictionary in Python
What is a Dictionary?
A Dictionary is an unordered, mutable collection of key-value pairs. It is used to
store data values like a real-life dictionary where each word (key) has a definition
(value).
student = { "name": "Alice", "age": 20, "course": "Data Science" }
Characteristics of Dictionaries:
Key-value mapping.
Keys are unique and immutable.
Values can be any data type.
Unordered (in versions < Python 3.7) but insertion-ordered in ≥ Python 3.7.
Common Dictionary Operations:
student["age"] = 21 # Modify value student["grade"] = "A" # Add new key-value pair del student["course"]
# Delete a key
Pandas and Data Preprocessing
What is Pandas?
Pandas (short for Python Data Analysis Library) is a powerful and flexible open-
source tool built on top of NumPy. It is specifically designed for data manipulation and
analysis.
Key Features:
Data wrangling: filtering, transformation, merging, reshaping
Reading/writing from multiple file formats
Handling missing data efficiently
Supports labeled axes (rows and columns)
Pandas Data Structures
1. Series
A one-dimensional array with axis labels.
Can hold any data type (integers, strings, floats, etc.)
2. DataFrame
A two-dimensional labeled data structure with columns that can be of different types
(like a table in a database or an Excel spreadsheet).
More commonly used in real-world analysis.
Data Importing and Exporting
Data can be imported from CSV, Excel, SQL, JSON, and more.
Exporting allows processed data to be saved in a desired format for reporting or further
use.
Supported formats:
read_csv(), read_excel(), to_csv(), to_excel(), etc.
Data Cleaning & Preprocessing
1. Handling Missing Values
Missing values are common in real-world datasets.
They can be handled by:
Removing missing data
Replacing with mean/median/mode
Using forward/backward fill
2. Data Imputation
Filling missing values using statistics (mean, median, etc.)
Maintains data consistency and avoids dropping rows/columns unnecessarily.
3. Data Transformation
Includes renaming columns, converting data types, scaling values, formatting strings,
etc.
Helps prepare data in a standard format for analysis or machine learning.
Data Visualization
Importance of Visualization
Helps understand trends, patterns, and relationships in data.
Makes complex data easier to interpret.
Essential for reporting insights to stakeholders.
Aids in data storytelling.
Matplotlib – Basic Visualizations
1. Line Plot
Shows trends over intervals (e.g., time).
Good for stock prices, temperature variation.
2. Bar Chart
Used for comparing quantities across categories.
3. Scatter Plot
Displays correlation/relationship between two numerical variables.
4. Histogram
Shows distribution of a dataset
Seaborn – Advanced Visualizations
Seaborn is built on top of Matplotlib and offers a high-level interface for attractive
statistical graphics.
1. Box Plot
Displays distribution and outliers.
Highlights median and quartiles.
2. Violin Plot
Combines box plot and kernel density estimation.
3. Pair Plot
Creates scatter plots for all pairwise combinations of variables in a dataset.
4. Heatmap
Used for correlation matrices or any matrix-style data.
Color-coded representation of values.
Supplementary: Introduction to NumPy
What is NumPy?
NumPy (Numerical Python) is the foundation for scientific computing in Python. It
provides support for large, multi-dimensional arrays and matrix operations, along
with a collection of high-level mathematical functions.
Key Concepts:
Efficient memory management and performance.
Array broadcasting (performing operations on arrays of different shapes).
Vectorized operations (faster than Python loops).
Random number generation, linear algebra, statistics, etc.
Introduction to Data Visualization and Its Importance
What is Data Visualization?
Data Visualization refers to the graphical representation of information and data. By
using visual elements like charts, graphs, and maps, data visualization tools make it
easier to understand trends, outliers, and patterns in data.
Importance in Data Analysis:
Helps identify patterns and trends.
Makes complex data easier to understand.
Facilitates better decision-making.
Supports storytelling with data.
Useful for data exploration and reporting.
Tools commonly used: Matplotlib, Seaborn, Plotly, Tableau, etc.
2. Basic Plots Using Matplotlib
Matplotlib
Matplotlib is the foundational Python library for creating static, animated, and
interactive visualizations.
import [Link] as plt
A. Line Plot
Used to visualize trends over time or continuous variables.
import [Link] as plt x = [1, 2, 3, 4, 5] y = [10, 15, 13, 18, 16] [Link](x, y) [Link]("Line Plot")
[Link]("X-Axis") [Link]("Y-Axis") [Link](True) [Link]()
B. Bar Plot
Used to compare categories or discrete data.
x = ["Apple", "Banana", "Cherry"] y = [10, 15, 7] [Link](x, y, color='skyblue') [Link]("Bar Plot")
[Link]("Quantity") [Link]()
C. Scatter Plot
Used to visualize relationship between two numerical variables.
python
CopyEdit
x = [1, 2, 3, 4, 5] y = [5, 7, 6, 8, 9] [Link](x, y, color='red') [Link]("Scatter Plot") [Link]("X Value")
[Link]("Y Value") [Link]()
D. Histogram
Used to show the distribution of a variable (frequency plot).
import numpy as np data = [Link](1000) [Link](data, bins=30, color='green')
[Link]("Histogram") [Link]("Value") [Link]("Frequency") [Link]()
3. Advanced Visualization Using Seaborn
What is Seaborn?
Seaborn is a high-level visualization library built on top of Matplotlib. It provides a
more attractive and informative statistical visualizations.
import seaborn as sns import [Link] as plt
We use built-in datasets like tips, iris, or titanic.
A. Box Plot
Used to visualize the distribution and detect outliers.
[Link](x='day', y='total_bill', data=sns.load_dataset('tips')) [Link]("Box Plot of Total Bill by Day")
[Link]()
B. Violin Plot
Combines boxplot and KDE (kernel density estimation). Shows distribution and
probability density.
[Link](x='day', y='total_bill', data=sns.load_dataset('tips')) [Link]("Violin Plot of Total Bill by
Day") [Link]()
C. Pair Plot
Used to visualize pairwise relationships in a dataset.
[Link](sns.load_dataset('iris'), hue='species') [Link]("Pair Plot of Iris Dataset", y=1.02)
[Link]()
D. Heatmap
Used for visualizing correlation or any matrix data.
iris = sns.load_dataset('iris') correlation_matrix = [Link]() [Link](correlation_matrix,
annot=True, cmap='coolwarm') [Link]("Heatmap of Iris Feature Correlation") [Link]()