0% found this document useful (0 votes)
16 views19 pages

Data Analysis and Python Libraries Guide

The document discusses the significance of data in decision-making and the responsibilities of programmers and data analysts in ensuring data integrity. It covers key concepts in experimental design, Python libraries for data analysis, and provides an introduction to NumPy and Pandas for handling data structures. Additionally, it highlights the importance of data visualization using Matplotlib, including various plot types and customization options.

Uploaded by

kritikant0607
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views19 pages

Data Analysis and Python Libraries Guide

The document discusses the significance of data in decision-making and the responsibilities of programmers and data analysts in ensuring data integrity. It covers key concepts in experimental design, Python libraries for data analysis, and provides an introduction to NumPy and Pandas for handling data structures. Additionally, it highlights the importance of data visualization using Matplotlib, including various plot types and customization options.

Uploaded by

kritikant0607
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1 Why Data Matters

Data plays a crucial role in shaping

• policy decisions,

• product design,

• business strategies,

• scientific discoveries.

When we analyze or present data, it is essential to keep it unbiased, fair, and transparent.

• Good data leads to good conclusions.

• Bad or biased data leads to misleading results, wrong decisions, and unfair policies.

2 Responsibility of a Programmer/Data Analyst


When designing algorithms that use or display data, one must:

• ensure the data is inclusive and representative;

• avoid misleading visualizations or selective reporting;

• present data in a neutral, transparent manner.

Bias in data can arise due to:

• poor experimental design,

• small or unrepresentative samples,

• incorrect measurement methods,

• misinterpretation or misrepresentation.

3 Importance of Experimental Design


Even excellent programming cannot fix **poorly collected data**. A good experiment must be:

• fair,

• repeatable,

• accurate,

• unbiased.

If the design is biased, the results will inevitably be biased.

4 Key Variables in Experimental Data


4.1 Independent Variable
• The variable that the researcher changes or controls.

• Examples: temperature, time, pressure, voltage, chemical concentration.


4.2 Dependent Variable
• The variable that the researcher measures.

• It depends on the independent variable.

• Examples: reaction rate, wave amplitude, current, stress/strain.

4.3 Control Variables


• The factors that must remain constant for the experiment to be fair.

• They ensure that only the independent variable affects the outcome.

• Examples: room temperature, material type, sample size, measurement instruments.

Python Libraries: Key Points


5 What is a Library?
• A library in Python is a reusable collection of code that provides specific functionality.

• Libraries are made up of smaller components called modules.

• Many Python libraries are open source, meaning they can be freely downloaded and used.

• In data analysis, commonly used libraries include:

– pandas — for data manipulation,


– NumPy — for numerical computations,
– Matplotlib — for visualization.

• Different areas of Python use different specialized libraries:

– GUI frameworks: Kivy, tkinter, PyQt, PySimpleGUI.


– Game development: Pygame, Pyglet.
– Machine learning: TensorFlow (developed by Google).

• These examples highlight the wide variety of tools available for different applications.

6 Installing Libraries
• Some modules (like the math module) are built into Python.

• External libraries must be installed separately.

• In Python 3.9 and above, library installation is done using the pip installer.

• Pip commands are executed through the Command Prompt or terminal.

• Installation may require administrator permissions depending on:

– system security settings,


– where Python is installed,
– whether the device belongs to you or your organization.

• If the computer belongs to an employer:


– Python paths may need adjustments,
– installation permissions may vary.

• If you have administrator access, you can:

– open Command Prompt as an administrator,


– run pip freely for library installation.
NumPy: A Detailed Introduction
NumPy (Numerical Python) is a fundamental Python library used for scientific computing. It provides:

• Efficient multi-dimensional arrays

• Fast mathematical operations

• Linear algebra routines

• Random number tools

1. Creating Arrays
[Link]()
Creates an array from a Python list (commas required in the list).
 
1 2 3
[Link]([[1, 2, 3], [4, 5, 6]]) ⇒
4 5 6

[Link]()
Creates an array filled with zeros.
 
0. 0. 0.
[Link]((2, 3)) ⇒
0. 0. 0.

[Link]()
Creates an array filled with ones.
 
1. 1.
[Link]((3, 2)) ⇒ 1. 1.
1. 1.

[Link]()
Creates a range of numbers (like Python’s range but returns an array).

[Link](0, 10, 2) ⇒ [0, 2, 4, 6, 8]

[Link]()
Creates evenly spaced values between two numbers.

[Link](0, 1, 5) ⇒ [0, 0.25, 0.5, 0.75, 1]

2. Array Properties
[Link]
Returns the dimensions (#rows, #columns).

[Link]([[1,2,3],[4,5,6]]).shape = (2, 3)
[Link]
Returns the number of dimensions.

[Link]((3,4)).ndim = 2

[Link]
Shows the data type (typically int64, float64).

[Link]([1,2,3]).dtype = int64

3. Indexing and Slicing


Basic Indexing
arr = [Link]([10, 20, 30])

arr[1] = 20

2D Indexing
A = [Link]([[1, 2], [3, 4]])

A[0, 1] = 2, A[1, 0] = 3

Slicing
arr = [Link]([0,1,2,3,4,5])

arr[1:4] = [1, 2, 3]

4. Array Operations
Element-wise Operations
If
A = [1, 2, 3], B = [4, 5, 6]
Then:
A + B = [5, 7, 9]
A × B = [4, 10, 18]

Scalar Operations
2 × [1, 2, 3] = [2, 4, 6]

5. Matrix Operations
[Link]()
Matrix multiplication.
  
1 2 5 6
A= , B=
3 4 7 8
 
19 22
[Link](A, B) =
43 50
A.T
Transpose of matrix A.
 
T 1 3
A =
2 4

6. Useful Functions
[Link]()
Adds elements across an axis.

[Link]([[1,2],[3,4]]) = 10

[Link]()
Computes average.

[Link]([1,2,3,4]) = 2.5

[Link](), [Link]()
Finds maximum and minimum.

[Link]([1,9,3]) = 9

7. Reshaping Arrays
[Link]()
Changes array shape without changing data.
 
0 1 2
[Link](6).reshape(2,3) ⇒
3 4 5

8. Random Module
[Link]()
Generates normal-distributed random numbers.
 
0.1 −0.3
[Link](2,2) ⇒
1.2 0.7

[Link]()
Random integers in a range.

[Link](1,10,5) ⇒ [3, 7, 1, 9, 6]

9. Why NumPy Does Not Show Commas in Output


Although Python lists require commas:
[1, 2, 3]
NumPy prints arrays in matrix form without commas:
 
1 2 3
This is only for display purposes.

NumPy internally stores commas but hides them to resemble mathematical matrices.

10. Data Types in NumPy


NumPy uses fixed-size types for speed:
• int64 : 64-bit integer
• float64 : 64-bit floating point
• bool : True/False
• object : Python objects
Example:
[Link]([1,2,3]).dtype = int64

Conclusion
NumPy provides fast, memory-efficient arrays and powerful mathematical tools. It is the foundation
for scientific computing, machine learning, data science, signal processing, and numerical simulations in
Python.

NumPy Exercises: Array Operations


Exercise 1: Array Creation & Reshaping
import numpy as np

a = [Link](12)
b = [Link](3, 4)
c = b[::2, 1:]
Question: What are the values of a, b, and c?

Exercise 2: Indexing & Slicing


x = [Link]([[10, 20, 30],
[40, 50, 60],
[70, 80, 90]])

y = x[::2, 1]
z = x[:, ::2]
Question: What are the values of y and z?

Exercise 3: Boolean Masking


p = [Link]([3, 6, 9, 12, 15, 18])

mask = p % 4 == 0
q = p[mask]
Question: What are the values of mask and q?
Exercise 4: Broadcasting
m = [Link](6).reshape(2, 3)
n = [Link]([10, 20, 30])
r = m + n

Question: What is the value of r?

CustomerID Country State City Zip Code


1 USA Georgia Atlanta 30332
2 USA Georgia Atlanta 30331
3 USA Florida Melbourne 30912
4 USA Florida Tampa 30123
5 India Karnataka Bangalore 560001
6 India Maharashtra Mumbai 578234
7 India Karnataka Hubli 569823
8 India Maharashtra Mumbai 578234
9 Germany Bavaria Munich 80331
10 Canada Ontario Toronto M4B 1B3

Table 1: Customer Information Table

What is a DataFrame?
A DataFrame is a two-dimensional, table-like data structure in Pandas, similar to a spreadsheet or SQL
table. It consists of rows and columns and allows efficient data manipulation and analysis.
A DataFrame has the following key components:

• Index: Unique labels for rows (default is numerical starting from 0).

• Columns: Labeled names for different features or attributes.

• Data: The actual values stored in the DataFrame.

7 Creating DataFrames in Pandas


Creating a DataFrame from a Dictionary
import pandas as pd

data = {
"CustomerID": [1, 2, 3],
"Country": ["USA", "India", "Canada"],
"State": ["Georgia", "Karnataka", "Ontario"],
"City": ["Atlanta", "Bangalore", "Toronto"],
"Zip Code": ["30332", "560001", "M4B 1B3"]
}

df = [Link](data)
print(df)
Creating a DataFrame from a List of Lists
data = [[1, "USA", "Georgia", "Atlanta", "30332"],
[2, "India", "Karnataka", "Bangalore", "560001"],
[3, "Canada", "Ontario", "Toronto", "M4B 1B3"]]

columns = ["CustomerID", "Country", "State", "City", "Zip Code"]


df = [Link](data, columns=columns)
print(df)

Creating a DataFrame from a CSV File


df = pd.read_csv("[Link]")
print(df)

Creating a DataFrame from an Excel File


df = pd.read_excel("[Link]")
print(df)

Creating a DataFrame from a NumPy Array


import numpy as np

array = [Link]([[1, "USA", "Atlanta"], [2, "India", "Bangalore"], [3, "Canada", "Toronto"]])
df = [Link](array, columns=["CustomerID", "Country", "City"])
print(df)

Creating an Empty DataFrame


df = [Link](columns=["CustomerID", "Country", "State", "City", "Zip Code"])
print(df)

8 Common DataFrame Methods


Displaying Data
print([Link]()) # Shows the first 5 rows
print([Link](3)) # Shows the last 3 rows
print([Link]()) # Displays column details and data types
print([Link]()) # Shows statistical summary of numerical columns

Accessing Specific Data


print(df["Country"]) # Accesses a single column
print(df[["Country", "State"]]) # Accesses multiple columns
print([Link][1]) # Accesses a row by index
print([Link][0:2]) # Accesses first two rows using position-based indexing
Filtering Data Using Conditions
usa_customers = df[df["Country"] == "USA"] # Filters rows where Country is USA
georgia_customers = df[(df["Country"] == "USA") & (df["State"] == "Georgia")]
print(georgia_customers)

Adding New Columns


df["Membership"] = ["Gold", "Silver", "Bronze"] # Adds a new column
print(df)

Modifying Values
[Link][1, "City"] = "Delhi" # Updates a specific value
print(df)

Deleting Columns and Rows


[Link](columns=["Zip Code"], inplace=True) # Deletes a column
[Link](index=2, inplace=True) # Deletes a row
print(df)

Sorting Data
sorted_df = df.sort_values(by="State") # Sorts by the ’State’ column
print(sorted_df)

Grouping Data
grouped_df = [Link]("Country").count() # Groups by country and counts entries
print(grouped_df)

9 Conclusion
DataFrames are a powerful tool for handling structured data in Python. They allow for easy data manip-
ulation, filtering, sorting, and analysis. Mastering Pandas DataFrames will help students efficiently work
with large datasets in real-world applications.

10 Introduction
Matplotlib is a widely used Python library for data visualization. It enables the graphical representation
of numerical data using different types of plots. Along with basic plots, Matplotlib provides several
customization options such as markers, line styles, colors, grids, legends, and figure sizing.
The commonly used module is:
import [Link] as plt
11 Line Plot
A line plot is used to represent trends or functional relationships between two variables.

Basic Line Plot


x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]

[Link](x, y)
[Link]("Line Plot")
[Link]()

Figure 1: lineplot(without label)


Line Plot with Markers
Markers highlight individual data points on the curve.
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
[Link](x, y, marker=’o’)
[Link]("Line Plot with Markers")
[Link]()

Figure 2: lineplot(with markers)

Common marker symbols:

• ’o’ : Circle

• ’s’ : Square
′′
• : Triangle

• ’*’ : Star
Line Styles and Colors
[Link](x, y, linestyle=’--’, color=’r’, marker=’o’)
[Link]("Line Style and Color Example")
[Link]()

Common line styles:

• ’-’ : Solid

• ’--’ : Dashed

• ’:’ : Dotted

• ’-.’ : Dash-dot

Figure 3: lineplot(with linestyle)


12 Multiple Line Plots and Legend
y2 = [2, 6, 12, 20, 30]

[Link](x, y, label="y = x^2")


[Link](x, y2, label="y = x(x+1)")
[Link]("X")
[Link]("Y")
[Link]()
[Link]()
[Link]()

Figure 4: Caption
13 Bar Plot
A bar plot is used to compare values across different categories.
categories = ["A", "B", "C", "D"]
values = [3, 7, 5, 9]

[Link](categories, values)
[Link]("Category")
[Link]("Value")
[Link]("Bar Plot")
[Link]()

Figure 5: Barplot

Bar plots are suitable for categorical data, and the bars are separated by gaps.
14 Histogram
A histogram represents the frequency distribution of continuous data.
data = [45, 50, 55, 60, 60, 65, 70, 75, 80]

[Link](data, bins=5)
[Link]("Value")
[Link]("Frequency")
[Link]("Histogram")
[Link]()

In a histogram, bars touch each other, indicating continuity of data.

Figure 6: Histogram
15 Scatter Plot
A scatter plot shows the relationship between two numerical variables using individual points.
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]
[Link](x, y)
[Link]("X")
[Link]("Y")
[Link]("Scatter Plot")
[Link]()

Scatter plots are useful for correlation analysis and experimental data visualization.

Figure 7: scatter plot


16 Pie Chart
A pie chart represents data as proportions of a whole.
labels = ["A", "B", "C", "D"]
sizes = [25, 30, 20, 25]

[Link](sizes, labels=labels, autopct=’%1.1f%%’)


[Link]("Pie Chart")
[Link]()

Pie charts are best used when the number of categories is small.

Figure 8: pie chart

17 Figure Size and Saving Plots


The size of a plot can be controlled using the figure() function.
[Link](figsize=(6,4))
[Link](x, y)
[Link]()

Plots can be saved as image files.


[Link](x, y)
[Link]("[Link]")
[Link]()

18 Comparison of Plot Types


Plot Type Data Type Purpose
Line Plot Continuous Trend analysis
Bar Plot Categorical Comparison
Histogram Continuous Distribution
Scatter Plot Numerical Relationship
Pie Chart Proportional Percentage contribution
19 Conclusion
Matplotlib offers a wide range of plotting techniques along with extensive customization options such as
markers, colors, line styles, legends, grids, and figure sizing. Proper selection of plot type and styling
enhances clarity and interpretation of data in scientific and engineering applications.

You might also like