0% found this document useful (0 votes)
6 views12 pages

Data Science Notes Mcs

Data science is a multidisciplinary field that utilizes mathematics, statistics, computer science, and domain expertise to extract insights from data for various industries including healthcare, finance, and retail. It involves different types of data (structured, semi-structured, unstructured) and methods of analysis such as descriptive, inferential, and predictive analysis. The data science life cycle consists of phases like problem definition, data collection, analysis, modeling, and deployment to create data-driven solutions.

Uploaded by

2200813526.neha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views12 pages

Data Science Notes Mcs

Data science is a multidisciplinary field that utilizes mathematics, statistics, computer science, and domain expertise to extract insights from data for various industries including healthcare, finance, and retail. It involves different types of data (structured, semi-structured, unstructured) and methods of analysis such as descriptive, inferential, and predictive analysis. The data science life cycle consists of phases like problem definition, data collection, analysis, modeling, and deployment to create data-driven solutions.

Uploaded by

2200813526.neha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data science is a multidisciplinary field that combines principles and practices

from mathematics, statistics, computer science, and domain expertise to


extract valuable insights from data.

It involves collecting, cleaning, analyzing, and visualizing data to identify


patterns, trends, and relationships, and to make predictions. These insights
can be used to inform business decisions, solve complex problems, and
improve operations.

Data science integrates the principles of computer science and


mathematics and domain knowledge to create mathematical models that
shows relationships amongst data attributes. In addition, data
science uses data to perform predictive analysis

Data science is widely used across many industries:


1.​ Healthcare: Predicting disease outbreaks, personalized treatments, and
improving patient care.
2.​ Finance: Fraud detection, managing risks, algorithmic trading, and customer
segmentation.
3.​ Retail: Managing inventory, recommendation systems, and analyzing
shopping patterns.
4.​ Manufacturing: Predicting maintenance needs, quality control, and
improving supply chains.
5.​ Transportation: Optimizing routes, forecasting demand, and supporting
self-driving vehicles.

TYPES OF DATA
1.​ Structured Data
2.​ Semi-Structured Data
3.​ Unstructured data
4.​ Data Streams

Structured Data: Structured data send to data that is organized and design
in a specific way to make it easily readable and understand by both humans
and machines. This is typically achieved through the use of a well-defined
schema or data model, which provides a structure for the data.

Figure 2 shows the sample structure of data that may be stored in a


relational database system. One of the key characteristics of structured data
is that it can be associated with a schema. In addition, each schema element
may be related to a specific data type.

Customer (custID, custName, custPhone, custAddress, custCategory,


custPAN, custAadhar)
Account (AccountNumber,custIDoffirstaccountholder,AccountType,
AccountBalance)
JointHolders (AccountNumber, custID)
Transaction(transDate, transType, AccountNumber, Amountoftransaction)
Figure 2: A sample schema of structured data

Semi-structured Data:As the name suggest Semi-structured has some


structure in it. The structure of semi-structured data is due to the use of tags
or key/value pairs The common form of semi-structured data is produced
through XML, JSON objects, Server
logs, EDI data, etc. The example of semi-structured data is shown in the
Figure 3.
<Book>
<title>Data Science and Big Data</title>
<author>R Raman</author>
<author>C V Shekhar</author>
<yearofpublication>2020</yearofpublication>
</Book>

"Book": {
"Title":"Data Science",
"Price": 5000,
"Year": 2020
}
Figure 3: Sample semi-structured data

Unstructured Data:
The unstructured data does not follow any schema definition. For example, a
written text like content of this Unit is unstructured. You may add certain
headings or meta data for unstructured data.
Data Streams
A data stream is characterised by a sequence of data over a period of time.
Such data may be structured, semi-structured or unstructured, but it gets
generated repeatedly. For example, IoT devices like weather sensors will
generate data stream of pressure, temperature, wind direction, wind speed,
humidity etc for a particular place where it is installed. Such data is huge for
many applications are required to be processed in real time. In general, not
all the data of streams is required to be stored and such data is required to be
processed for a specific duration of time.

Statistical Data Types:


There are two distinct types of data that can be used in statistical analysis.
These are – Categorical data and Quantitative data

Categorical Data:
categorical data provides descriptive information about qualitative
attributes, quantitative data offers numerical values for measuring and
analyzing quantities

Quantitative Data: Quantitative data is the numeric data, which can be


used to define different scale of data. The qualitative data is also of
two basic types –discrete, which represents distinct numbers like 2, 3,
5,… or continuous, which represent a continuous values of a given
variable, for example, your height can be measured using continuous
scale.

Measurement scale of data:


the measurement scales of data refer to how data values are categorized,
ranked, or quantified. These are foundational in choosing the right analysis
method.

1. Nominal Scale (Categorical - No Order)

●​ Definition: Labels or names without any numeric value or order.​

●​ Examples: Gender (Male/Female), Blood Type (A, B, AB, O), Colors.​

●​ Operations: Only counting or mode; no sorting or calculations.​

●​ ✅ Used for classification only.​

2. Ordinal Scale (Categorical - With Order)

●​ Definition: Data with a natural order, but intervals between values are
not known.​

●​ Examples: Rank in a competition (1st, 2nd, 3rd), Satisfaction level


(High, Medium, Low).​

●​ Operations: Median and mode are valid; mean is not.​

●​ ✅ Used for ranking or preferences.​

3. Interval Scale (Quantitative - Equal Intervals, No True Zero)


●​ Definition: Numeric scale with equal intervals but no absolute zero.​

●​ Examples: Temperature in Celsius or Fahrenheit, IQ scores.​

●​ Operations: Addition/subtraction valid; ratios are not meaningful (e.g.,


20°C is not twice as hot as 10°C).​

●​ ✅ Used for comparison of differences.​

4. Ratio Scale (Quantitative - Equal Intervals, True Zero)

●​ Definition: Same as interval scale but with a true zero point.​

●​ Examples: Age, Weight, Height, Income, Distance.​

●​ Operations: All arithmetic operations allowed.​

●​ ✅ Used for true comparisons and ratios (e.g., 20 kg is twice 10 kg).​

BASIC METHODS OF DATA ANALYSIS:


The data for data science is obtained from several data sources. This data is
first cleaned of errors, duplication, aggregated and then presented in a form
that can be analysed by various methods. In this section, we define some of
the basic methods used for analysing data.
These are: Descriptive analysis, Exploratory data analysis and Inferential data
analysis.

Descriptive Analysis​

●​ Summarizes and describes features of a dataset.​

●​ Includes measures like mean, median, mode, standard deviation, and


frequency distribution.​

●​ Often visualized using charts, graphs, and tables.​

Inferential Analysis​

●​ Draws conclusions about a population based on a sample.​

●​ Uses statistical techniques like hypothesis testing, confidence


intervals, and regression analysis.​

Exploratory Data Analysis (EDA)​

●​ Focuses on discovering patterns, trends, and relationships within data.​

●​ Uses visualizations (e.g., scatter plots, histograms) and summary


statistics.​

●​ Often the first step in data analysis.​

Predictive Analysis​
●​ Uses historical data to make predictions about future outcomes.​

●​ Involves machine learning and statistical models like linear


regression, decision trees, etc.​

Diagnostic Analysis​

●​ Investigates why something happened in the data.​

●​ Often includes drill-downs, data mining, and correlation analysis.​

Prescriptive Analysis​

●​ Suggests actionable steps based on data insights.​

●​ Often used in decision-making systems with optimization algorithms or


simulations.

Common Misconceptions in Data Analysis


Data Analysis Is Only About Numbers​
Misconception: Data analysis is solely a numerical or statistical process.​
Clarification: While quantitative analysis is a significant component, data
analysis also involves understanding the context, patterns, and qualitative
aspects of data.​

More Data Equals Better Results​


Misconception: Having more data automatically leads to better insights.​
Clarification: The quality of data is more important than quantity. Large
volumes of poor-quality or irrelevant data can lead to misleading
conclusions.​

Correlation Implies Causation​


Misconception: A correlation between two variables indicates that one
causes the other.​
Clarification: Correlation does not establish causation. Two variables may be
correlated due to coincidence or the presence of a third influencing factor.
APPLICATIONS OF DATA SCIENCE:
Data science has widespread applications across many fields.
In healthcare, it is used for disease prediction, drug discovery, personalized
treatments, and medical imaging analysis.
In finance, it powers fraud detection, risk management, customer
segmentation, and algorithmic trading.
Retail and e-commerce use data science for recommendation systems,
inventory forecasting, and customer sentiment analysis.
Transportation and logistics apply it for route optimization, self-driving
vehicles, and demand prediction.
In entertainment, data science drives personalized content recommendation,
audience analysis, and ad targeting.
Manufacturing benefits through predictive maintenance, quality control, and
supply chain optimization.
Education uses data science for personalized learning, predicting student
performance, and curriculum development.
In agriculture, it enables precision farming, crop yield prediction, and early
pest/disease detection.

DATA SCIENCE LIFE CYCLE:


The data science life cycle is a structured approach to developing and
deploying data-driven solutions. It typically involves six key phases: problem
definition, data acquisition and exploration, research and development,
validation, delivery, and monitoring. Each phase has iterative steps, ensuring
a thorough and systematic process.

Data Science Project Requirements Analysis Phase


The first and foremost step for data science project would be to identify the objectives
of a data science project. This identification of objectives is also coupled with the
study of benefits of the project, resource requirements and cost of the project. In
addition, you need to make a project plan, which includes project deliverables and
associated time frame. In addition, the data that is required to be used for the project is
also decided. This phase is similar as that of requirement study and project planning
and scheduling.
Data collection and Preparation Phase
In this phase, first all the data sources are identified, followed by designing the
process of data collection. It may be noted that data collection may be a continuous
process. Once the data sources are identified then data is checked for duplication of
data, consistency of data, missing data, and availability timeline of data. In addition,
data may be integrated, aggregated or transformed to produce data for a defined set of
attributes, which are identified in the requirements phase.
Descriptive data analysis
Next, the data is analysed using univariate and bivariate analysis techniques. This will
generate descriptive information about the data. This phase can also be used to
establish the suitability and validly of data as per the requirements of data analysis.
This is a good time to review your project requirements vis-à-vis collected data
characteristics.
Data Modelling and Model Testing
Next, a number of data models based on the data are developed. All these data models
are then tested for their validity with test data. The accuracy of various models are
compared contrasted and a final model is proposed for data analysis.
Model deployment and Refinement
The tested best model is used to address the data science problem, however, this
model must be constantly refined, as the decision making environment keeps
changing and new data sets and attributes may change with time. The refinement
process goes through all the previous steps again.

You might also like