UNIT I
Introduction to Data Science
1. Concept of Data Science
Data Science is an interdisciplinary field that involves extracting useful knowledge and insights
from data using scientific methods, algorithms, and systems. It combines concepts from
statistics, mathematics, programming, machine learning, and data analysis to understand
patterns in large datasets.
In today's digital world, organizations generate massive amounts of data from websites, social
media, mobile applications, sensors, and business transactions. Data Science helps organizations
analyze this data to make better decisions, predict trends, and improve services.
A Data Scientist is a professional who collects, processes, and analyzes data to discover useful
insights. They use tools such as Python, R, SQL, and machine learning algorithms to analyze
large datasets.
Data Science Process
The data science workflow usually follows these steps:
1. Data Collection
Gathering raw data from different sources like databases, websites, sensors, or APIs.
2. Data Cleaning
Removing incorrect, incomplete, or duplicate data to improve data quality.
3. Data Exploration
Understanding patterns and relationships using visualization and statistics.
4. Data Modeling
Applying algorithms or machine learning models to analyze data.
5. Interpretation and Communication
Presenting results through reports, dashboards, or visualizations.
Importance of Data Science
Data science plays a major role in many industries such as:
• Healthcare (disease prediction)
• Finance (fraud detection)
• Marketing (customer behavior analysis)
• E-commerce (recommendation systems)
• Transportation (traffic prediction)
For example, companies like Amazon and Netflix use data science to recommend products or
movies based on user preferences.
Thus, data science helps organizations transform raw data into useful knowledge for decision
making.
2. Traits of Big Data
Big Data refers to extremely large and complex datasets that cannot be processed using
traditional data processing systems.
Big data is characterized by several important features called “Traits of Big Data”, commonly
known as the 5 V’s of Big Data.
1. Volume
Volume refers to the large amount of data generated every second.
Examples include:
• Social media posts
• Online transactions
• Sensor data
• Video uploads
Companies like Facebook and Google generate petabytes of data daily.
Example:
Millions of tweets are posted every day on social media platforms.
2. Velocity
Velocity refers to the speed at which data is generated and processed.
Many applications require real-time data processing.
Examples include:
• Stock market transactions
• Online banking systems
• Real-time traffic monitoring
For example, stock trading systems must process thousands of transactions every second.
3. Variety
Variety refers to the different types of data formats generated from multiple sources.
Data can be classified into:
1. Structured Data
Organized data stored in tables or databases.
2. Semi-Structured Data
Data with some structure such as JSON or XML files.
3. Unstructured Data
Data without a specific format such as images, videos, emails, or social media posts.
Example:
A company may store customer data in databases while also collecting images and videos from
social media.
4. Veracity
Veracity refers to the accuracy and reliability of data.
Since data comes from many sources, it may contain errors, missing values, or inconsistencies.
Example:
• Fake social media accounts
• Incorrect survey responses
• Sensor errors
Data scientists must clean and validate data before analysis to ensure reliable results.
5. Value
Value refers to the useful insights that can be extracted from data.
Even though organizations collect huge amounts of data, the real goal is to generate meaningful
information that improves business decisions.
Example:
A retail company analyzing customer purchase data can identify:
• Popular products
• Customer preferences
• Seasonal sales trends
Thus, the ultimate purpose of big data is to create value for organizations and society.
3. Web Scraping
Web Scraping is the process of automatically extracting data from websites using software tools
or scripts.
Many websites contain valuable information such as:
• Product prices
• News articles
• Weather data
• Social media content
Web scraping allows data scientists to collect this data for analysis.
How Web Scraping Works
The basic steps involved in web scraping are:
1. Sending a request to a website
A program sends a request to access a webpage.
2. Downloading the webpage content
The webpage is returned in HTML format.
3. Parsing the webpage
The program extracts required data from HTML tags.
4. Saving the data
The extracted information is stored in files or databases.
Tools Used for Web Scraping
Common tools used for web scraping include:
• Python libraries (BeautifulSoup, Scrapy)
• Selenium
• APIs
• Data extraction tools
Example:
A company may use web scraping to collect product prices from different e-commerce
websites to compare prices and monitor competitors.
Applications of Web Scraping
Web scraping is widely used in:
• Market research
• Price comparison websites
• News aggregation
• Job portals
• Social media analysis
Ethical Considerations
While web scraping is useful, it must follow certain rules:
• Respect website terms and conditions
• Avoid excessive requests to servers
• Protect user privacy
Responsible web scraping ensures ethical and legal data collection.
4. Analysis vs Reporting
In data science, analysis and reporting are two different ways of using data to support decision
making.
Although both involve working with data, their purpose and approach are different.
Data Reporting
Data reporting refers to presenting data in a structured format such as reports, tables,
dashboards, or charts.
The main goal of reporting is to summarize historical data and show what has already
happened.
Characteristics of Reporting
• Focuses on past data
• Provides summaries and statistics
• Uses dashboards and charts
• Helps monitor business performance
Example:
A company generating a monthly sales report showing:
• Total sales
• Revenue
• Number of customers
Reporting answers questions like:
• What happened last month?
• How many products were sold?
• What was the total revenue?
Data Analysis
Data analysis involves examining data in depth to discover patterns, relationships, and insights.
It focuses not only on past events but also on predicting future trends and identifying hidden
patterns.
Characteristics of Data Analysis
• Uses statistical methods and algorithms
• Identifies patterns and correlations
• Supports decision making
• Helps predict future outcomes
Example:
Analyzing customer purchase data to determine:
• Which products customers buy together
• What factors influence sales
• Which customers are likely to stop buying
Analysis answers questions like:
• Why did sales increase?
• What factors influence customer behavior?
• What will happen in the future?
Difference between Analysis and Reporting
Feature Reporting Analysis
Purpose Summarize past data Discover insights
Focus What happened Why it happened
Complexity Simple summaries Advanced analysis
Tools Dashboards, reports Statistical models
Outcome Information Insights and predictions
Conclusion
Data science is a powerful field that helps organizations extract valuable insights from data. It
combines statistics, programming, and machine learning to analyze large datasets and support
decision making.
Big data plays an important role in data science due to its volume, velocity, variety, veracity,
and value. Techniques such as web scraping allow data scientists to collect information from
the internet for analysis.
Additionally, understanding the difference between data analysis and reporting is essential.
Reporting focuses on summarizing past data, while analysis explores deeper insights and predicts
future trends.
As technology continues to grow, data science will become increasingly important in industries
such as healthcare, finance, business, and research.
UNIT II
Programming Tools for Data Science
1. Introduction to Programming Tools for Data Science
Programming tools are essential in Data Science because they help data scientists collect,
analyze, and visualize large datasets. These tools make it possible to process complex data
efficiently and extract useful insights.
One of the most widely used programming languages in data science is Python. Python is
popular because it is easy to learn, flexible, and has many powerful libraries for data analysis and
machine learning.
Programming tools help perform many important tasks such as:
• Data collection
• Data cleaning
• Data visualization
• Statistical analysis
• Machine learning modeling
Python provides several libraries (toolkits) that simplify these tasks. Some of the most important
Python toolkits used in data science include:
• NumPy
• Matplotlib
• Scikit-learn
• Natural Language Toolkit (NLTK)
These libraries help data scientists perform calculations, visualize data, and build predictive
models.
2. Toolkits using Python
Python provides many libraries that simplify data science tasks. These libraries act as toolkits
that contain pre-built functions for analysis and visualization.
NumPy
NumPy stands for Numerical Python. It is one of the most important libraries used for
numerical and mathematical operations in Python.
Features of NumPy
• Supports large multi-dimensional arrays
• Performs mathematical operations efficiently
• Provides functions for linear algebra and statistics
• Faster than traditional Python lists
Example
A dataset containing student marks can be stored in a NumPy array and analyzed quickly using
mathematical operations.
Example operations include:
• Mean calculation
• Standard deviation
• Matrix multiplication
NumPy forms the foundation for many other data science libraries.
Matplotlib
Matplotlib is a powerful Python library used for data visualization.
It allows users to create graphs and charts to understand patterns in data.
Features
• Creates bar charts, line graphs, and scatter plots
• Supports customization of graphs
• Helps visualize complex datasets
Example:
A company can visualize monthly sales using a line chart to understand growth trends.
Matplotlib is widely used in data analysis and research.
Scikit-learn
Scikit-learn is a machine learning library used for building predictive models.
Features
• Classification algorithms
• Regression models
• Clustering techniques
• Data preprocessing tools
Example Applications
• Spam email detection
• Customer segmentation
• Sales prediction
Scikit-learn allows data scientists to build machine learning models with minimal coding.
NLTK
Natural Language Toolkit (NLTK) is a Python library used for Natural Language Processing
(NLP).
It helps analyze and process human language data such as text and speech.
Features
• Text processing
• Tokenization
• Sentiment analysis
• Language modeling
Example:
NLTK can be used to analyze customer reviews to determine positive or negative sentiment.
It is widely used in applications such as:
• Chatbots
• Text analysis
• Language translation systems
3. Visualizing Data
Data Visualization is the graphical representation of data using charts and graphs. It helps in
understanding patterns, trends, and relationships in datasets.
Visualization makes complex data easier to interpret.
Common types of charts used in data science include:
• Bar charts
• Line charts
• Scatter plots
Bar Charts
A bar chart represents data using rectangular bars.
Each bar shows the value of a specific category.
Example
A bar chart can display the number of students in different departments such as:
• Computer Science
• Mechanical Engineering
• Civil Engineering
Advantages
• Easy comparison between categories
• Simple to understand
Line Charts
A line chart shows trends over time by connecting data points with lines.
Example
A line chart can represent monthly sales of a company over a year.
Advantages
• Shows growth or decline trends
• Useful for time-series data
Scatter Plots
A scatter plot shows the relationship between two variables.
Each point represents a pair of values.
Example
A scatter plot may show the relationship between:
• Hours studied
• Exam scores
Advantages
• Helps detect correlations
• Identifies patterns in data
4. Working with Data
Data scientists must know how to collect and process data before analysis.
Working with data involves several tasks such as:
• Reading files
• Scraping the web
• Using APIs
• Cleaning data
Reading Files
Data is often stored in files such as:
• CSV files
• Excel spreadsheets
• JSON files
Python libraries like Pandas help read these files into data structures for analysis.
Example:
A dataset containing sales records can be read from a CSV file and analyzed.
Scraping the Web
Web scraping is the process of extracting information from websites automatically.
Python libraries like BeautifulSoup and Scrapy are commonly used for this purpose.
Example:
A company may scrape product prices from e-commerce websites to analyze market trends.
Web scraping allows data scientists to collect large datasets from the internet.
Using APIs
An API (Application Programming Interface) allows programs to access data from online
services.
Example: Twitter API
The Twitter API allows developers to collect tweets, user information, and trends from the
**Twitter platform.
Example uses:
• Social media sentiment analysis
• Trend detection
• Public opinion analysis
APIs provide structured and reliable data access.
5. Cleaning and Munging Data
Data Cleaning and Data Munging are processes used to transform raw data into a usable
format.
Data munging involves:
• Removing duplicate records
• Handling missing values
• Correcting incorrect data
• Converting data formats
Example:
If a dataset contains missing values in the "Age" column, those values may be replaced with the
average age.
Clean data ensures accurate analysis and reliable results.
6. Manipulating Data
Data Manipulation involves modifying datasets to prepare them for analysis.
Common operations include:
• Filtering data
• Sorting data
• Grouping data
• Aggregating values
Example:
A dataset containing sales transactions may be grouped by product category to calculate total
sales.
Python libraries such as Pandas provide powerful tools for data manipulation.
7. Rescaling Data
Rescaling is the process of adjusting the scale of numerical data so that variables have similar
ranges.
This is important for many machine learning algorithms.
Two common techniques are:
Normalization
Converts values into a range between 0 and 1.
Standardization
Adjusts data so that it has:
• Mean = 0
• Standard deviation = 1
Rescaling helps algorithms perform better and produce accurate predictions.
8. Dimensionality Reduction
Dimensionality Reduction is the process of reducing the number of variables (features) in a
dataset while keeping important information.
Large datasets often contain many variables, which can make analysis complex.
Dimensionality reduction techniques simplify the dataset.
Advantages
• Reduces computational cost
• Improves model performance
• Removes irrelevant features
Example Method
One common technique is Principal Component Analysis (PCA).
PCA transforms many variables into a smaller set of components that capture most of the data
variation.
Example:
A dataset with 100 variables may be reduced to 10 important components.
Conclusion
Programming tools play a crucial role in data science. Libraries such as NumPy, Matplotlib,
Scikit-learn, and NLTK help data scientists perform complex operations efficiently.
Visualization techniques like bar charts, line charts, and scatter plots help interpret data patterns
clearly.
In addition, working with data involves tasks such as reading files, web scraping, using APIs,
and cleaning datasets. Techniques like data manipulation, rescaling, and dimensionality
reduction help prepare data for machine learning and analysis.
Together, these tools and techniques enable data scientists to transform raw data into valuable
insights for decision-making.
every topic python code example
1. Using NumPy for Data Operations
Using NumPy to perform numerical calculations.
import numpy as np
# Create a NumPy array
data = [Link]([10, 20, 30, 40, 50])
# Calculate statistics
mean = [Link](data)
sum_value = [Link](data)
print("Array:", data)
print("Mean:", mean)
print("Sum:", sum_value)
Output example
Array: [10 20 30 40 50]
Mean: 30.0
Sum: 150
2. Using Matplotlib for Visualization
Creating graphs using Matplotlib.
import [Link] as plt
months = ["Jan", "Feb", "Mar", "Apr"]
sales = [100, 150, 200, 250]
[Link](months, sales)
[Link]("Monthly Sales")
[Link]("Months")
[Link]("Sales")
[Link]()
This creates a line chart of monthly sales.
3. Machine Learning with Scikit-learn
Example using Scikit-learn.
from sklearn.linear_model import LinearRegression
import numpy as np
# Sample data
x = [Link]([[1],[2],[3],[4]])
y = [Link]([2,4,6,8])
model = LinearRegression()
[Link](x, y)
prediction = [Link]([[5]])
print("Predicted value:", prediction)
This predicts values using linear regression.
4. Text Processing with NLTK
Example using Natural Language Toolkit.
import nltk
from [Link] import word_tokenize
text = "Data Science is very interesting"
words = word_tokenize(text)
print(words)
Output
['Data', 'Science', 'is', 'very', 'interesting']
This process is called tokenization.
5. Bar Chart Example
import [Link] as plt
subjects = ["Math", "Science", "English"]
marks = [85, 90, 78]
[Link](subjects, marks)
[Link]("Student Marks")
[Link]()
This creates a bar chart of student marks.
6. Line Chart Example
import [Link] as plt
days = [1,2,3,4,5]
temperature = [30,32,31,33,35]
[Link](days, temperature)
[Link]("Temperature Trend")
[Link]("Day")
[Link]("Temperature")
[Link]()
7. Scatter Plot Example
import [Link] as plt
study_hours = [1,2,3,4,5]
marks = [40,50,60,70,80]
[Link](study_hours, marks)
[Link]("Study Hours")
[Link]("Marks")
[Link]()
This shows relationship between study hours and marks.
8. Reading Files (CSV Data)
import pandas as pd
data = pd.read_csv("[Link]")
print([Link]())
This reads data from a CSV file.
9. Web Scraping Example
Using BeautifulSoup to scrape data.
import requests
from bs4 import BeautifulSoup
url = "[Link]
response = [Link](url)
soup = BeautifulSoup([Link], "[Link]")
print([Link])
This extracts the title of a webpage.
10. Using APIs (Twitter Example)
Using Twitter API.
import requests
url = "[Link]
headers = {
"Authorization": "Bearer YOUR_ACCESS_TOKEN"
}
response = [Link](url, headers=headers)
print([Link]())
This fetches tweets from the Twitter platform.
11. Cleaning Data Example
import pandas as pd
data = pd.read_csv("[Link]")
# Remove missing values
clean_data = [Link]()
print(clean_data)
12. Manipulating Data Example
import pandas as pd
data = {
"Name": ["A", "B", "C"],
"Marks": [80, 90, 70]
}
df = [Link](data)
sorted_data = df.sort_values("Marks")
print(sorted_data)
13. Rescaling Data Example
from [Link] import MinMaxScaler
import numpy as np
data = [Link]([[10],[20],[30]])
scaler = MinMaxScaler()
scaled = scaler.fit_transform(data)
print(scaled)
This converts values between 0 and 1.
14. Dimensionality Reduction Example
Using Principal Component Analysis.
from [Link] import PCA
import numpy as np
data = [Link]([
[1,2,3],
[4,5,6],
[7,8,9]
])
pca = PCA(n_components=2)
reduced = pca.fit_transform(data)
print(reduced)
This reduces 3 features to 2 features.