0% found this document useful (0 votes)
4 views14 pages

Understanding Data Science Fundamentals

Data Science Notes

Uploaded by

Sandeep Mazumdar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views14 pages

Understanding Data Science Fundamentals

Data Science Notes

Uploaded by

Sandeep Mazumdar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data science is the study of data to extract meaningful insights for business.

It is a multidisciplinary
approach that combines principles and practices from the fields of mathematics, statistics, artificial
intelligence, and computer engineering to analyze large amounts of data. data science with
examples?Data science is the domain of study that deals with vast volumes of data using modern
tools and techniques to find unseen patterns, derive meaningful information, and make business
decisions. For example, finance companies can use a customer's banking and bill-paying history to
assess creditworthiness and loan risk. use of data science?What Is Data Science Useful for? Data
science can identify patterns, permitting the making of inferences and predictions, from seemingly
unstructured or unrelated data. Tech companies that collect user data can use techniques to turn
what's collected into sources of useful or profitable information.

Data Science Applications

• Fraud and Risk Detection.

• Healthcare.

• Internet Search.

• Targeted Advertising.

• Website Recommendations.●Advanced Image Recognition.●Speech Recognition.

• Airline Route Planning.


What is Categorical Data? Categorical data is a type of data that is used to
group information with similar characteristics, while numerical data is a type
of data that expresses information in the form of numbers. Categorical data is data
that classifies an observation as belonging to one or more categories. For example, an item might
be judged as good or bad, or a response to a survey might includes categories such as agree,
disagree, or no [Link]
data is data that can be counted or measured
in numerical values. The two main types of quantitative data are discrete data
and continuous data. Height in feet, age in years, and weight in pounds are
examples of quantitative data. Qualitative data is descriptive data that is not
expressed numerically. Example- Quantitative Information – Involves a measurable quantity—
numbers are used. Some examples are length, mass, temperature, and time. Quantitative
information is often called data, but can also be things other than numbers.
A quantitative variable is a variable that reflects a notion of magnitude, that
is, if the values it can take are numbers. A quantitative variable represents
thus a measure and is numerical. Quantitative variables are divided into two
types: discrete and continuous. Quantitative data are data about numeric variables (e.g.
how many; how much; or how often). Qualitative data are measures of 'types' and may be
represented by a name, symbol, or a number code. Qualitative data are data about categorical
variables (e.g. what type). Quantitative Variables - Variables whose values result from counting or
measuring something. Examples: height, weight, time in the 100 yard dash, number of items sold to
a shopper. Qualitative Variables - Variables that are not measurement variables.

What measures the spread of a quantitative variable?The variance and the standard deviation are
measures of the spread of the data around the mean. They summarise how close each observed
data value is to the mean value. The most common measure of variation, or spread, is the standard
deviation. The standard deviation is a number that measures how far data values are from their
mean.

What Is a Sampling Distribution? A sampling distribution is a probability


distribution of a statistic that is obtained through repeated sampling of a
specific population. It describes a range of possible outcomes for a statistic,
such as the mean or mode of some variable, of a population. A sampling
distribution of a statistic is a type of probability distribution created by drawing many random
samples of a given size from the same population. These distributions help you understand how a
sample statistic varies from sample to sample. The sampling distribution of a proportion is when
you repeat your survey or poll for all possible samples of the population. For example: instead of
polling asking 1000 cat owners what cat food their pet prefers, you could repeat your poll multiple
times.
Data preprocessing transforms the data into a format that is more easily and
effectively processed in data mining, machine learning and other data
science tasks. The techniques are generally used at the earliest stages of the
machine learning and AI development pipeline to ensure accurate results. Data
preparation and filtering steps can take considerable amount of processing time. Examples of data
preprocessing include cleaning, instance selection, normalization, one hot encoding,
transformation, feature extraction and selection, etc. The product of data preprocessing is the final
Data cleaning is the process of fixing or removing incorrect,
training set.
corrupted, incorrectly formatted, duplicate, or incomplete data within a
dataset. When combining multiple data sources, there are many opportunities
for data to be duplicated or mislabeled. Data cleaning is a process by which inaccurate,
poorly formatted, or otherwise messy data is organized and corrected. For example, if you conduct
a survey and ask people for their phone numbers, people may enter their numbers in different
formats.
boxplot -A box and whisker plot—also called a box plot—displays the five-
number summary of a set of data. The five-number summary is the minimum,
first quartile, median, third quartile, and maximum. In a box plot, we draw a
box from the first quartile to the third quartile. A vertical line goes through the
box at the median. Example-It is a type of chart that depicts a group of numerical data through
their quartiles. It is a simple way to visualize the shape of our data. It makes comparing
characteristics of data between categories very easy. First Quartile (Q1) – 25% of the data lies below
the First (lower) Quartile.

What is HDFS? HDFS is a distributed file system that handles large data sets
running on commodity hardware. It is used to scale a single Apache Hadoop
cluster to hundreds (and even thousands) of nodes. HDFS is one of the major
components of Apache Hadoop, the others being MapReduce and YARN.
characteristics of HDFS?HDFS employs a NameNode and DataNode architecture to implement a
distributed file system that provides high-performance access to data across highly scalable
Hadoop [Link] of HDFS ●Data replication. ...●Fault tolerance and reliability. ...●High
availability. ...

• Scalability. ...

• High throughput. ...●Data locality.


The file allocation table is located at the very first sector on a disk, which is referred to as disk
sector 0. The file allocation table, or FAT, is part of an older type of file system storage which is still
supported by most operating systems. Advantages

• Uses the whole disk block for data.

• A bad disk block doesn't cause all successive blocks lost.

• Random access is provided although its not too fast.

• Only FAT needs to be traversed in each file operation.


Shuffling in MapReduce-The process of transferring data from the mappers
to reducers is known as shuffling i.e. the process by which the system
performs the sort and transfers the map output to the reducer as input.
Sorting is one of the basic MapReduce algorithms to process and analyze
data. MapReduce implements sorting algorithm to automatically sort the
output key-value pairs from the mapper by their keys. Sorting methods are
implemented in the mapper class itself.

Spark architecture consists of four components, including the spark driver, executors, cluster
administrators, and worker nodes. It uses the Dataset and data frames as the fundamental data
storage mechanism to optimise the Spark process and big data computation.
NoSQL databases store data in documents rather than relational tables.
Accordingly, we classify them as “not only SQL” and subdivide them by a
variety of flexible data models. Types of NoSQL databases include pure
document databases, key-value stores, wide-column databases, and graph

databases. For example, document databases like MongoDB are general purpose databases.
Key-value databases are ideal for large volumes of data with simple lookup queries. NoSQL is
faster than relational database management system because it uses
different data structure compared to relational databases. Cassandra data
structure is faster than relational database structure. NoSQL databases are
mainly used in Bigdata and real time web applications.
Big data refers to data that is so large, fast or complex that it's difficult or impossible to process
using traditional methods. The act of accessing and storing large amounts of information for
analytics has been around for a long time. Big data also encompasses a wide variety of data types,
including the following: structured data, such as transactions and financial records; unstructured
data, such as text, documents and multimedia files; and. semistructured data, such as web server
logs and streaming data from sensors. Big data is a collection of data from many different sources
and is often describe by five characteristics: volume, value, variety, velocity, and veracity.

The MapReduce paradigm was created in 2003 to enable processing of large


data sets in a massively parallel manner. The goal of the MapReduce model
is to simplify the approach to transformation and analysis of large datasets,
as well as to allow developers to focus on algorithms instead of data
management. MapReduce is a programming model used to perform distributed processing in
parallel in a Hadoop cluster, which Makes Hadoop working so fast. When you are dealing with Big
Data, serial processing is no more of any use.
In data science, the similarity measure is a way of measuring
• Similarity.
how data samples are related or closed to each other.
• 1)Cosine Similarity:

• 2) Manhattan distance:

• 3) Euclidean distance:

• 4) Minkowski distance.

• 5) Jaccard similarity:The
Jaccard similarity measures the similarity
between two sets of data to see which members are shared and
distinct. The Jaccard similarity is calculated by dividing the number of
observations in both sets by the number of observations in either set.
In other words, the Jaccard similarity can be computed as the size of
the intersection divided by the size of the union of two sets.

Cosine similarity measures the similarity between two vectors of an inner


product space. It is measured by the cosine of the angle between two vectors
and determines whether two vectors are pointing in roughly the same
direction. It is often used to measure document similarity in text analysis.
1)Cosine Similarity: The Manhattan distance, often called Taxicab distance or City
Block distance, calculates the distance between real-valued vectors. Imagine
vectors that describe objects on a uniform grid such as a chessboard.
Manhattan distance then refers to the distance between two vectors if they
could only move right angles. Euclidean distance calculates the distance
between two real-valued vectors. You are most likely to use Euclidean
distance when calculating the distance between two rows of data that have
numerical values, such a floating point or integer values. Minkowski Distance-
It is a generalization of the Euclidean and Manhattan distance measures and
adds a parameter, called the “order” or “p“, that allows different distance
measures to be calculated. The Minkowski distance measure is calculated as
follows: EuclideanDistance = (sum for i to N (abs(v1[i] – v2[i]))^p)^(1/p)
Streaming data is data that is generated continuously by thousands of data sources, which
typically send in the data records simultaneously, and in small sizes (order of Kilobytes) Examples
include location data, stock prices, IT system monitoring, fraud detection, retail inventory, sales,
customer activity, and more. The following companies use some of these data types to power their
business activity. Data Streams Types

• Sensor readings from machines.

• e-Commerce purchase data.

• Stock exchange data to predict the stock price.

• Credit card transactions for fraud detection.

• Social media sentiment analysis.

Transaction processing is a style of computing, typically performed by large


server computers, that supports interactive applications. In transaction
processing, work is divided into individual, indivisible operations, called
transactions. As the term suggests, transactional data means data that is
related to the transactions of the organization. For example, when a product
is purchased or sold, the same is captured simultaneously. It is the
transactional data for that product.
Data Classification in data science refers to the process that tags and
categorizes any kind of data so that it can be better understood and
analyzed. The latter is what we'll be focusing on. But also, a well-planned Data
Classification system makes essential data easy to find and retrieve. Data
classification is the process of organizing data into categories that make it easy to retrieve, sort
and store for future use. A well-planned data classification system makes essential data easy to
find and retrieve.

Collaborative Filtering refers to other users' past preferences to other users


based on their similar interests. The similarity between the two is calculated
by each user's past score on the item, which is used to calculate the similarity
between users. Collaborative filtering is used by most recommendation
systems to find similar patterns or information of the users, this technique
can filter out items that users like on the basis of the ratings or reactions by
similar users. movies, news, applications, and so many other items. Examples of
collaborative filtering applications
Today, collaborative filtering is in widespread use across different industries, and the two most
famous examples of CF applications are Amazon and Netflix. Amazon uses CF to match products
to customers based on their past purchases.

What is Bloom Filter? A Bloom filter is a space-efficient probabilistic data structure that is used to
test whether an element is a member of a set. For example, checking availability of username is set
membership problem, where the set is the list of all registered [Link]- A bloom filter is a
probabilistic data structure that is based on hashing. It is extremely space efficient and is typically
used to add elements to a set and test if an element is in a set. Though, the elements themselves
are not added to a set. Instead a hash of the elements is added to the set.

The PageRank algorithm measures the importance of each node within the graph, based on the
number incoming relationships and the importance of the corresponding source nodes. The
underlying assumption roughly speaking is that a page is only as important as the pages that link to
it. The Page Rank algorithm begins with the conversion of every URL from the

database into a number. The next phase is to save each hyperlink in a


database using the integer IDs to recognize the Web pages. The iteration is
initiated after sorting the link structure by the parent ID and removing dangling
links.

Social media mining is the process of obtaining big data from user-generated content on social
media sites and mobile apps in order to extract actionable patterns, form conclusions about users,
and act upon the information, often for the purpose of advertising to users or conducting research.
Social media data mining is used to uncover hidden patterns and trends from social media
platforms like Twitter, LinkedIn, Facebook, and others. This is typically done through machine
learning, mathematics, and statistical [Link]-Social
media mining faces grand
challenges such as the big data paradox, obtaining sufficient samples, the
noise removal fallacy, and evaluation dilemma. Social media mining
represents the virtual world of social media in a computable way, measures it,
and designs models that can help us understand its interactions.
Bar Graph. The pictorial representation of data in groups, either in horizontal
or vertical bars where the length of the bar represents the value of the data
present on axis. They (bar graphs) are usually used to display or impart the
information belonging to 'categorical data' i.e; data that fit in some category. A
bar chart is used for the representation of data in a graphical form. It is used
to compare the frequency and the sums and the different categories of data.
Representation of information by using horizontal and vertical bars in the bar
chart also called column charts.
What is linear regression? Linear regression analysis is used to predict the
value of a variable based on the value of another variable. The variable you
want to predict is called the dependent variable. The variable you are using to
predict the other variable's value is called the independent variable. Linear
regression is commonly used for predictive analysis and modeling. For example, it can be used to
quantify the relative impacts of age, gender, and diet (the predictor variables) on height (the
outcome variable).

Clustering is a popular unsupervised method and an essential tool for Big


Data Analysis. Clustering can be used either as a pre-processing step to
reduce data dimensionality before running the learning algorithm, or as a
statistical tool to discover useful patterns within a dataset. Clustering is used
to identify groups of similar objects in datasets with two or more variable
quantities. In practice, this data may be collected from marketing, biomedical,
or geospatial databases, among many other places.
A confusion matrix is a table that is often used to describe the performance
of a classification model (or "classifier") on a set of test data for which the
true values are known. The confusion matrix itself is relatively simple to
understand, but the related terminology can be confusing. use confusion
matrix?This is where confusion matrices are useful. A confusion matrix presents a table layout of
the different outcomes of the prediction and results of a classification problem and helps visualize
its outcomes. It plots a table of all the predicted and actual values of a classifier.

You might also like