0% found this document useful (0 votes)

4 views14 pages

Understanding Data Science Fundamentals

Data Science Notes

Uploaded by

Sandeep Mazumdar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views14 pages

Understanding Data Science Fundamentals

Data Science Notes

Uploaded by

Sandeep Mazumdar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Data science is the study of data to extract meaningful insights for business.

It is a multidisciplinary
approach that combines principles and practices from the fields of mathematics, statistics, artificial
intelligence, and computer engineering to analyze large amounts of data. data science with
examples?Data science is the domain of study that deals with vast volumes of data using modern
tools and techniques to find unseen patterns, derive meaningful information, and make business
decisions. For example, finance companies can use a customer's banking and bill-paying history to
assess creditworthiness and loan risk. use of data science?What Is Data Science Useful for? Data
science can identify patterns, permitting the making of inferences and predictions, from seemingly
unstructured or unrelated data. Tech companies that collect user data can use techniques to turn
what's collected into sources of useful or profitable information.

Data Science Applications

• Fraud and Risk Detection.

• Healthcare.

• Internet Search.

• Targeted Advertising.

• Website Recommendations.●Advanced Image Recognition.●Speech Recognition.

• Airline Route Planning.

What is Categorical Data? Categorical data is a type of data that is used to
group information with similar characteristics, while numerical data is a type
of data that expresses information in the form of numbers. Categorical data is data
that classifies an observation as belonging to one or more categories. For example, an item might
be judged as good or bad, or a response to a survey might includes categories such as agree,
disagree, or no [Link]
data is data that can be counted or measured
in numerical values. The two main types of quantitative data are discrete data
and continuous data. Height in feet, age in years, and weight in pounds are
examples of quantitative data. Qualitative data is descriptive data that is not
expressed numerically. Example- Quantitative Information – Involves a measurable quantity—
numbers are used. Some examples are length, mass, temperature, and time. Quantitative
information is often called data, but can also be things other than numbers.
A quantitative variable is a variable that reflects a notion of magnitude, that
is, if the values it can take are numbers. A quantitative variable represents
thus a measure and is numerical. Quantitative variables are divided into two
types: discrete and continuous. Quantitative data are data about numeric variables (e.g.
how many; how much; or how often). Qualitative data are measures of 'types' and may be
represented by a name, symbol, or a number code. Qualitative data are data about categorical
variables (e.g. what type). Quantitative Variables - Variables whose values result from counting or
measuring something. Examples: height, weight, time in the 100 yard dash, number of items sold to
a shopper. Qualitative Variables - Variables that are not measurement variables.

What measures the spread of a quantitative variable?The variance and the standard deviation are
measures of the spread of the data around the mean. They summarise how close each observed
data value is to the mean value. The most common measure of variation, or spread, is the standard
deviation. The standard deviation is a number that measures how far data values are from their
mean.

What Is a Sampling Distribution? A sampling distribution is a probability

distribution of a statistic that is obtained through repeated sampling of a
specific population. It describes a range of possible outcomes for a statistic,
such as the mean or mode of some variable, of a population. A sampling
distribution of a statistic is a type of probability distribution created by drawing many random
samples of a given size from the same population. These distributions help you understand how a
sample statistic varies from sample to sample. The sampling distribution of a proportion is when
you repeat your survey or poll for all possible samples of the population. For example: instead of
polling asking 1000 cat owners what cat food their pet prefers, you could repeat your poll multiple
times.
Data preprocessing transforms the data into a format that is more easily and
effectively processed in data mining, machine learning and other data
science tasks. The techniques are generally used at the earliest stages of the
machine learning and AI development pipeline to ensure accurate results. Data
preparation and filtering steps can take considerable amount of processing time. Examples of data
preprocessing include cleaning, instance selection, normalization, one hot encoding,
transformation, feature extraction and selection, etc. The product of data preprocessing is the final
Data cleaning is the process of fixing or removing incorrect,
training set.
corrupted, incorrectly formatted, duplicate, or incomplete data within a
dataset. When combining multiple data sources, there are many opportunities
for data to be duplicated or mislabeled. Data cleaning is a process by which inaccurate,
poorly formatted, or otherwise messy data is organized and corrected. For example, if you conduct
a survey and ask people for their phone numbers, people may enter their numbers in different
formats.
boxplot -A box and whisker plot—also called a box plot—displays the five-
number summary of a set of data. The five-number summary is the minimum,
first quartile, median, third quartile, and maximum. In a box plot, we draw a
box from the first quartile to the third quartile. A vertical line goes through the
box at the median. Example-It is a type of chart that depicts a group of numerical data through
their quartiles. It is a simple way to visualize the shape of our data. It makes comparing
characteristics of data between categories very easy. First Quartile (Q1) – 25% of the data lies below
the First (lower) Quartile.

What is HDFS? HDFS is a distributed file system that handles large data sets
running on commodity hardware. It is used to scale a single Apache Hadoop
cluster to hundreds (and even thousands) of nodes. HDFS is one of the major
components of Apache Hadoop, the others being MapReduce and YARN.
characteristics of HDFS?HDFS employs a NameNode and DataNode architecture to implement a
distributed file system that provides high-performance access to data across highly scalable
Hadoop [Link] of HDFS ●Data replication. ...●Fault tolerance and reliability. ...●High
availability. ...

• Scalability. ...

• High throughput. ...●Data locality.

The file allocation table is located at the very first sector on a disk, which is referred to as disk
sector 0. The file allocation table, or FAT, is part of an older type of file system storage which is still
supported by most operating systems. Advantages

• Uses the whole disk block for data.

• A bad disk block doesn't cause all successive blocks lost.

• Random access is provided although its not too fast.

• Only FAT needs to be traversed in each file operation.

Shuffling in MapReduce-The process of transferring data from the mappers
to reducers is known as shuffling i.e. the process by which the system
performs the sort and transfers the map output to the reducer as input.
Sorting is one of the basic MapReduce algorithms to process and analyze
data. MapReduce implements sorting algorithm to automatically sort the
output key-value pairs from the mapper by their keys. Sorting methods are
implemented in the mapper class itself.

Spark architecture consists of four components, including the spark driver, executors, cluster
administrators, and worker nodes. It uses the Dataset and data frames as the fundamental data
storage mechanism to optimise the Spark process and big data computation.
NoSQL databases store data in documents rather than relational tables.
Accordingly, we classify them as “not only SQL” and subdivide them by a
variety of flexible data models. Types of NoSQL databases include pure
document databases, key-value stores, wide-column databases, and graph

databases. For example, document databases like MongoDB are general purpose databases.
Key-value databases are ideal for large volumes of data with simple lookup queries. NoSQL is
faster than relational database management system because it uses
different data structure compared to relational databases. Cassandra data
structure is faster than relational database structure. NoSQL databases are
mainly used in Bigdata and real time web applications.
Big data refers to data that is so large, fast or complex that it's difficult or impossible to process
using traditional methods. The act of accessing and storing large amounts of information for
analytics has been around for a long time. Big data also encompasses a wide variety of data types,
including the following: structured data, such as transactions and financial records; unstructured
data, such as text, documents and multimedia files; and. semistructured data, such as web server
logs and streaming data from sensors. Big data is a collection of data from many different sources
and is often describe by five characteristics: volume, value, variety, velocity, and veracity.

The MapReduce paradigm was created in 2003 to enable processing of large

data sets in a massively parallel manner. The goal of the MapReduce model
is to simplify the approach to transformation and analysis of large datasets,
as well as to allow developers to focus on algorithms instead of data
management. MapReduce is a programming model used to perform distributed processing in
parallel in a Hadoop cluster, which Makes Hadoop working so fast. When you are dealing with Big
Data, serial processing is no more of any use.
In data science, the similarity measure is a way of measuring
• Similarity.
how data samples are related or closed to each other.
• 1)Cosine Similarity:

• 2) Manhattan distance:

• 3) Euclidean distance:

• 4) Minkowski distance.

• 5) Jaccard similarity:The
Jaccard similarity measures the similarity
between two sets of data to see which members are shared and
distinct. The Jaccard similarity is calculated by dividing the number of
observations in both sets by the number of observations in either set.
In other words, the Jaccard similarity can be computed as the size of
the intersection divided by the size of the union of two sets.

Cosine similarity measures the similarity between two vectors of an inner

product space. It is measured by the cosine of the angle between two vectors
and determines whether two vectors are pointing in roughly the same
direction. It is often used to measure document similarity in text analysis.
1)Cosine Similarity: The Manhattan distance, often called Taxicab distance or City
Block distance, calculates the distance between real-valued vectors. Imagine
vectors that describe objects on a uniform grid such as a chessboard.
Manhattan distance then refers to the distance between two vectors if they
could only move right angles. Euclidean distance calculates the distance
between two real-valued vectors. You are most likely to use Euclidean
distance when calculating the distance between two rows of data that have
numerical values, such a floating point or integer values. Minkowski Distance-
It is a generalization of the Euclidean and Manhattan distance measures and
adds a parameter, called the “order” or “p“, that allows different distance
measures to be calculated. The Minkowski distance measure is calculated as
follows: EuclideanDistance = (sum for i to N (abs(v1[i] – v2[i]))^p)^(1/p)
Streaming data is data that is generated continuously by thousands of data sources, which
typically send in the data records simultaneously, and in small sizes (order of Kilobytes) Examples
include location data, stock prices, IT system monitoring, fraud detection, retail inventory, sales,
customer activity, and more. The following companies use some of these data types to power their
business activity. Data Streams Types

• Sensor readings from machines.

• e-Commerce purchase data.

• Stock exchange data to predict the stock price.

• Credit card transactions for fraud detection.

• Social media sentiment analysis.

Transaction processing is a style of computing, typically performed by large

server computers, that supports interactive applications. In transaction
processing, work is divided into individual, indivisible operations, called
transactions. As the term suggests, transactional data means data that is
related to the transactions of the organization. For example, when a product
is purchased or sold, the same is captured simultaneously. It is the
transactional data for that product.
Data Classification in data science refers to the process that tags and
categorizes any kind of data so that it can be better understood and
analyzed. The latter is what we'll be focusing on. But also, a well-planned Data
Classification system makes essential data easy to find and retrieve. Data
classification is the process of organizing data into categories that make it easy to retrieve, sort
and store for future use. A well-planned data classification system makes essential data easy to
find and retrieve.

Collaborative Filtering refers to other users' past preferences to other users

based on their similar interests. The similarity between the two is calculated
by each user's past score on the item, which is used to calculate the similarity
between users. Collaborative filtering is used by most recommendation
systems to find similar patterns or information of the users, this technique
can filter out items that users like on the basis of the ratings or reactions by
similar users. movies, news, applications, and so many other items. Examples of
collaborative filtering applications
Today, collaborative filtering is in widespread use across different industries, and the two most
famous examples of CF applications are Amazon and Netflix. Amazon uses CF to match products
to customers based on their past purchases.

What is Bloom Filter? A Bloom filter is a space-efficient probabilistic data structure that is used to
test whether an element is a member of a set. For example, checking availability of username is set
membership problem, where the set is the list of all registered [Link]- A bloom filter is a
probabilistic data structure that is based on hashing. It is extremely space efficient and is typically
used to add elements to a set and test if an element is in a set. Though, the elements themselves
are not added to a set. Instead a hash of the elements is added to the set.

The PageRank algorithm measures the importance of each node within the graph, based on the
number incoming relationships and the importance of the corresponding source nodes. The
underlying assumption roughly speaking is that a page is only as important as the pages that link to
it. The Page Rank algorithm begins with the conversion of every URL from the

database into a number. The next phase is to save each hyperlink in a

database using the integer IDs to recognize the Web pages. The iteration is
initiated after sorting the link structure by the parent ID and removing dangling
links.

Social media mining is the process of obtaining big data from user-generated content on social
media sites and mobile apps in order to extract actionable patterns, form conclusions about users,
and act upon the information, often for the purpose of advertising to users or conducting research.
Social media data mining is used to uncover hidden patterns and trends from social media
platforms like Twitter, LinkedIn, Facebook, and others. This is typically done through machine
learning, mathematics, and statistical [Link]-Social
media mining faces grand
challenges such as the big data paradox, obtaining sufficient samples, the
noise removal fallacy, and evaluation dilemma. Social media mining
represents the virtual world of social media in a computable way, measures it,
and designs models that can help us understand its interactions.
Bar Graph. The pictorial representation of data in groups, either in horizontal
or vertical bars where the length of the bar represents the value of the data
present on axis. They (bar graphs) are usually used to display or impart the
information belonging to 'categorical data' i.e; data that fit in some category. A
bar chart is used for the representation of data in a graphical form. It is used
to compare the frequency and the sums and the different categories of data.
Representation of information by using horizontal and vertical bars in the bar
chart also called column charts.
What is linear regression? Linear regression analysis is used to predict the
value of a variable based on the value of another variable. The variable you
want to predict is called the dependent variable. The variable you are using to
predict the other variable's value is called the independent variable. Linear
regression is commonly used for predictive analysis and modeling. For example, it can be used to
quantify the relative impacts of age, gender, and diet (the predictor variables) on height (the
outcome variable).

Clustering is a popular unsupervised method and an essential tool for Big

Data Analysis. Clustering can be used either as a pre-processing step to
reduce data dimensionality before running the learning algorithm, or as a
statistical tool to discover useful patterns within a dataset. Clustering is used
to identify groups of similar objects in datasets with two or more variable
quantities. In practice, this data may be collected from marketing, biomedical,
or geospatial databases, among many other places.
A confusion matrix is a table that is often used to describe the performance
of a classification model (or "classifier") on a set of test data for which the
true values are known. The confusion matrix itself is relatively simple to
understand, but the related terminology can be confusing. use confusion
matrix?This is where confusion matrices are useful. A confusion matrix presents a table layout of
the different outcomes of the prediction and results of a classification problem and helps visualize
its outcomes. It plots a table of all the predicted and actual values of a classifier.

Understanding Data Types and Analytics
No ratings yet
Understanding Data Types and Analytics
90 pages
FODS Unit I
No ratings yet
FODS Unit I
68 pages
Understanding Data and Big Data Concepts
No ratings yet
Understanding Data and Big Data Concepts
51 pages
Data - Course Notes
No ratings yet
Data - Course Notes
5 pages
Data Preprocessing Techniques in Mining
No ratings yet
Data Preprocessing Techniques in Mining
56 pages
Key Concepts in Data Science
No ratings yet
Key Concepts in Data Science
12 pages
Data Analysis Techniques Overview
No ratings yet
Data Analysis Techniques Overview
9 pages
Understanding Data: Types and Mining Techniques
No ratings yet
Understanding Data: Types and Mining Techniques
31 pages
Understanding Data Mining and KDD
No ratings yet
Understanding Data Mining and KDD
22 pages
Statistics Notes & Presentation (All in One) - Compressed
No ratings yet
Statistics Notes & Presentation (All in One) - Compressed
97 pages
Steps in Exploratory Data Analysis
No ratings yet
Steps in Exploratory Data Analysis
35 pages
Data Mining: Techniques and Processes
No ratings yet
Data Mining: Techniques and Processes
25 pages
Data Science Notes Mcs
No ratings yet
Data Science Notes Mcs
12 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
32 pages
Multivariate Data Visualization Techniques
No ratings yet
Multivariate Data Visualization Techniques
9 pages
Introduction to Data Science Basics
No ratings yet
Introduction to Data Science Basics
6 pages
Data Science Overview and Key Concepts
No ratings yet
Data Science Overview and Key Concepts
21 pages
Data Preprocessing Overview by Indu Joshi
No ratings yet
Data Preprocessing Overview by Indu Joshi
44 pages
Chapter 3 - Data Preprocessing
No ratings yet
Chapter 3 - Data Preprocessing
54 pages
FDS (Module 2& 3)
No ratings yet
FDS (Module 2& 3)
22 pages
Comprehensive Guide to Data Processing
No ratings yet
Comprehensive Guide to Data Processing
22 pages
Data Science Fundamentals Overview
No ratings yet
Data Science Fundamentals Overview
53 pages
Understanding Data and Analytics Basics
No ratings yet
Understanding Data and Analytics Basics
26 pages
Understanding Data Collection in Analytics
No ratings yet
Understanding Data Collection in Analytics
26 pages
Types of Attributes in Data Mining
No ratings yet
Types of Attributes in Data Mining
8 pages
DS
No ratings yet
DS
8 pages
Understanding Data Types and Quality
No ratings yet
Understanding Data Types and Quality
66 pages
Data Mining: Concepts and Applications
100% (1)
Data Mining: Concepts and Applications
18 pages
Understanding Data Analytics Basics
No ratings yet
Understanding Data Analytics Basics
30 pages
Understanding Big Data Analytics Basics
No ratings yet
Understanding Big Data Analytics Basics
56 pages
Types and Properties of Data Attributes
No ratings yet
Types and Properties of Data Attributes
40 pages
Most Frequent Attribute in Data Analysis
No ratings yet
Most Frequent Attribute in Data Analysis
86 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
62 pages
Data Preprocessing for Machine Learning
No ratings yet
Data Preprocessing for Machine Learning
65 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
37 pages
Data Science Components Explained
No ratings yet
Data Science Components Explained
3 pages
Understanding Data Mining Techniques
No ratings yet
Understanding Data Mining Techniques
10 pages
Data Mining Fundamentals and Techniques
No ratings yet
Data Mining Fundamentals and Techniques
5 pages
Introduction to Data Analytics Overview
No ratings yet
Introduction to Data Analytics Overview
34 pages
Ch2 - Tagged
No ratings yet
Ch2 - Tagged
30 pages
Understanding Data Mining and Types
No ratings yet
Understanding Data Mining and Types
89 pages
Big Data Analytics Course Overview
No ratings yet
Big Data Analytics Course Overview
278 pages
Data Science Overview and Key Concepts
No ratings yet
Data Science Overview and Key Concepts
11 pages
Data Collection and Management Guide
No ratings yet
Data Collection and Management Guide
16 pages
Data Mining Principles and Challenges
No ratings yet
Data Mining Principles and Challenges
77 pages
Section1-Data Mining
No ratings yet
Section1-Data Mining
44 pages
Data Preprocessing in Data Science
No ratings yet
Data Preprocessing in Data Science
127 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
40 pages
Data Analytics Techniques Overview
100% (1)
Data Analytics Techniques Overview
13 pages
Data Preprocessing and Attribute Types
No ratings yet
Data Preprocessing and Attribute Types
13 pages
Data Analysis Process Overview
No ratings yet
Data Analysis Process Overview
9 pages
Data Analysis and Visualization Guide
No ratings yet
Data Analysis and Visualization Guide
18 pages
Data Preprocessing Notes in PDF
No ratings yet
Data Preprocessing Notes in PDF
50 pages
Data Mining Techniques Overview
No ratings yet
Data Mining Techniques Overview
52 pages
Unit 1 Dma
No ratings yet
Unit 1 Dma
57 pages
Data Science Overview and Concepts
No ratings yet
Data Science Overview and Concepts
32 pages
Data Mining Lecture 2
No ratings yet
Data Mining Lecture 2
19 pages
eVTOL Aircraft Dynamics Simulation
No ratings yet
eVTOL Aircraft Dynamics Simulation
19 pages
Consistent Cuts in Distributed Systems
No ratings yet
Consistent Cuts in Distributed Systems
35 pages
Uninterspersed Historical Contexts
No ratings yet
Uninterspersed Historical Contexts
2 pages
Genuine Progress Indicator Analysis
No ratings yet
Genuine Progress Indicator Analysis
12 pages
Transverse Permeability of NC2 Fabrics
No ratings yet
Transverse Permeability of NC2 Fabrics
43 pages
Complete Blood Count Report for Patient
No ratings yet
Complete Blood Count Report for Patient
1 page
Anesthesia Considerations for Shoulder Surgery
No ratings yet
Anesthesia Considerations for Shoulder Surgery
4 pages
GameChange Solar Genius - Tracker 2P Technical - Datasheet 7 13 3022
No ratings yet
GameChange Solar Genius - Tracker 2P Technical - Datasheet 7 13 3022
2 pages
Earthquake-Resistant Structural Design
No ratings yet
Earthquake-Resistant Structural Design
25 pages
FAHU-01 Drain Pipe Design Calculations
No ratings yet
FAHU-01 Drain Pipe Design Calculations
18 pages
TIG-200 Welding Machine Manual
No ratings yet
TIG-200 Welding Machine Manual
11 pages
Key Elements of Emergency Evacuation Plans
No ratings yet
Key Elements of Emergency Evacuation Plans
4 pages
Valve Workshop (Hakim Alkatiri)
No ratings yet
Valve Workshop (Hakim Alkatiri)
82 pages
Non-Infectious Diseases & Immunology Quiz
No ratings yet
Non-Infectious Diseases & Immunology Quiz
4 pages
Overview of Indian Temple Architecture
No ratings yet
Overview of Indian Temple Architecture
35 pages
Grade 9 Maths Revision Booklet
No ratings yet
Grade 9 Maths Revision Booklet
76 pages
OET Grammar: Article Usage Rules
No ratings yet
OET Grammar: Article Usage Rules
4 pages
Directory of Steel Rolling Mills in India
No ratings yet
Directory of Steel Rolling Mills in India
52 pages
Maynooth University Economics Final Exam 2022
No ratings yet
Maynooth University Economics Final Exam 2022
3 pages
History of Forensic Science Development
No ratings yet
History of Forensic Science Development
4 pages
Integrated Reading Curriculum Overview
100% (1)
Integrated Reading Curriculum Overview
67 pages
Tigertop™: Installation Instructions
No ratings yet
Tigertop™: Installation Instructions
13 pages
Respiratory
No ratings yet
Respiratory
3 pages
John Deere 6020 Series Tractor Attachments
No ratings yet
John Deere 6020 Series Tractor Attachments
50 pages
Cancer Sun with Aquarius Rising
No ratings yet
Cancer Sun with Aquarius Rising
1 page
Understanding Earthquakes and Faults
No ratings yet
Understanding Earthquakes and Faults
30 pages
Understanding-Using - OAE Von Kemp
No ratings yet
Understanding-Using - OAE Von Kemp
12 pages
ENGG1700 Week 4 Review: Support Reactions
No ratings yet
ENGG1700 Week 4 Review: Support Reactions
3 pages
Quality Assurance Requirements Overview
No ratings yet
Quality Assurance Requirements Overview
8 pages
Breakwater Design and Performance Analysis
No ratings yet
Breakwater Design and Performance Analysis
3 pages

Understanding Data Science Fundamentals

Uploaded by

Understanding Data Science Fundamentals

Uploaded by

Data science is the study of data to extract meaningful insights for business.

Data Science Applications

• Fraud and Risk Detection.

• Website Recommendations.●Advanced Image Recognition.●Speech Recognition.

• Airline Route Planning.

What Is a Sampling Distribution? A sampling distribution is a probability

• High throughput. ...●Data locality.

• Uses the whole disk block for data.

• A bad disk block doesn't cause all successive blocks lost.

• Random access is provided although its not too fast.

• Only FAT needs to be traversed in each file operation.

The MapReduce paradigm was created in 2003 to enable processing of large

Cosine similarity measures the similarity between two vectors of an inner

• Sensor readings from machines.

• e-Commerce purchase data.

• Stock exchange data to predict the stock price.

• Credit card transactions for fraud detection.

• Social media sentiment analysis.

Transaction processing is a style of computing, typically performed by large

Collaborative Filtering refers to other users' past preferences to other users

database into a number. The next phase is to save each hyperlink in a

Clustering is a popular unsupervised method and an essential tool for Big

You might also like