Key Concepts in Data Science and Analysis
Key Concepts in Data Science and Analysis
2
What is CSV format?
CSV (Comma Separated Values) format is a simple file format used to store tabular data
(numbers and text) in plain text, where each line is a data record and each field within
the record is separated by a comma (or another delimiter).
Define Data source?
A Data source is the location, format, and structure from which data is collected or
retrieved (e.g., a database, a spreadsheet, a website, or a set of sensor readings).
Define Hypothesis Testing?
Hypothesis Testing is a statistical method used to determine whether there is enough
evidence in a sample of data to infer that a certain condition or hypothesis holds true
for the entire population.
Define Data cleaning?
Data cleaning is the process of detecting and correcting (or removing) corrupt,
inaccurate, incomplete, or irrelevant records from a dataset.
Define volume characteristic of data in reference to data science.
Volume refers to the immense amount of data generated every second, which is a
defining characteristic of Big Data, often measured in terabytes (TB), petabytes (PB),
and beyond.
Give examples of semistructured data.
Examples of semistructured data include XML files and JSON (JavaScript Object Notation)
files.
What is a quartile?
A quartile is one of the three points ($Q_1$, $Q_2$, $Q_3$) that divide a ranked dataset
into four equal parts, each containing 25% of the data. $Q_2$ is the median.
List different types of attributes.
Nominal (e.g., color)
Ordinal (e.g., size: small, medium, large)
Interval (e.g., temperature in Celsius)
Ratio (e.g., height, weight)
Define Data object.
A Data object (or record, sample, observation, or entity) is an entity that is described by
a set of attributes, representing a single instance or row of data in a dataset.
Write the tools used for geospatial data.
Tools used for geospatial data include GIS (Geographic Information Systems) software
like QGIS or ArcGIS.
State the methods of feature selection.
Methods of feature selection include Filter methods, Wrapper methods, and Embedded
methods.
List any two libraries used in Python for data analysis.
Pandas, NumP, Matplotlib
3
Explain different applications of Data Science.
Data Science is applied across almost every industry to extract value and drive
decision-making. Two key applications are:
Recommendation Systems: Used by e-commerce platforms (Amazon), streaming
services (Netflix, Spotify), and social media to predict user preferences and suggest
products, movies, or music. This drives engagement and sales.
Fraud Detection: Financial institutions use Data Science models (often supervised
learning) to analyze transaction patterns in real-time. The models identify anomalous
behavior that deviates from normal activity, flagging potentially fraudulent transactions.
What do you mean by Noisy data? Explain any two causes of noisy data.
Noisy data is a meaningless data point or data record that has been distorted or
includes a random error or variance.8 It is data that is too complex to be understood by
standard statistical or machine learning methods and reduces the accuracy of the
model.
Two causes of noisy data are:
Faulty Data Collection Instruments: Malfunctioning sensors or instruments used during
data gathering can introduce random errors or inaccuracies into the collected data.
Data Entry Errors: Human mistakes during manual input or recording of data (e.g., typos,
misplaced decimal points) can lead to spurious or incorrect values.
What do you mean/purpose by/of data Visualization? Give example of any two data
visualization libraries in Python.
Data Visualization is the graphic representation of [Link] purpose is to efficiently and
clearly communicate insights, trends, and patterns hidden in the [Link] transforms
complex datasets into accessible visual formats (charts, graphs, maps) that facilitate
understanding and decision-making for a wide audience.
Two data visualization libraries in Python are:
Matplotlib and Seaborn
4
Explain 3V's of data science.
The 3 V's are core characteristics used to describe Big Data:
Volume: Refers to the massive quantity of data generated daily from various sources.
The sheer size of the data requires unique processing and storage solutions (e.g., data
measured in terabytes or petabytes).
Velocity: Refers to the speed at which data is generated, collected, and processed.
High-velocity data often requires real-time or near real-time processing (e.g., streaming
data from social media feeds or sensor networks).
Variety: Refers to the many different types of data structures, including structured
(relational databases), semi-structured (XML, JSON), and unstructured data (text, video,
images).
What do you understand by structured and unstructured data? Differentiate them.
Structured Data: Data that is highly organized and follows a predefined format or model.
It typically resides in relational databases (RDBMS) where it is stored in tables with rows
and columns. Examples include names, dates, and transaction amounts.
Unstructured Data: Data that does not have a predefined format or structure. It is
challenging to process and analyze using traditional methods. Examples include social
media posts, emails, videos, and images.
Feature Structured Data Unstructured Data
Organization Highly organized (tabular format) No predefined organization
Schema Fixed and defined schema No fixed schema
Querying Easy using SQL Difficult; requires specialized tools
Storage Relational databases Data Lakes, NoSQL databases
7
Explain tools in data scientist tool box.
The data scientist's toolbox is a collection of software, languages, and frameworks
necessary for the entire data science workflow:
Programming & Libraries: Tools like Python and R provide the environment, while
libraries like Pandas (for data manipulation), NumPy (for numerical computing), and
Scikit-learn (for machine learning) provide specialized functions.
Visualization Tools: Tools like Tableau or Matplotlib/Seaborn allow for the creation of
charts and dashboards to understand and communicate data insights effectively.
8
1. Explain data science life cycle with suitable diagram.
The Data Science Life Cycle outlines the steps or phases that data science projects
typically follow from start to finish. This lifecycle provides a framework for the best
performance of each phase until the project's completion.
The typical steps in the data science life cycle include:
Step 1: Setting Goal The entire cycle revolves around a business or research goal.
It is essential to clearly understand the business objective to set a specific goal for the
analysis.
Step 2: Data Understanding This step involves the collection of all available data.
It requires describing and understanding the data, including its structure, relevance, and
data type.
Step 3: Data Preparation This is often the most time-consuming yet important step in
the entire life cycle. It includes selecting relevant data, integrating (merging) datasets,
cleaning data (treating missing or erroneous values, handling outliers), and constructing
new data or deriving new features.
Step 4: Exploratory Data Analysis (EDA) The process of getting an idea about the
solution before building the actual model. Involves exploring the distribution and
relationships within the data using graphical representations such as bar-graphs, scatter
plots, and heat maps.
Step 5: Data Modeling This is the heart of data analysis.
A model takes the prepared data as input and provides the desired output.
Step 6: Model Evaluation Although not explicitly numbered in all sections, evaluation
is a crucial step where the model's performance is rigorously tested before deployment.
Step 7: Model DeploymentThe final step where the evaluated model is deployed in the
desired format and channel. One of the goals is to convince the business that the
findings will change the business process as expected.
2. Explain 3 V's of data science with diagram.
The 3 V's of data science—Volume, Velocity, and Variety—epitomize the expansion of
data that has occurred at the turn of the twenty-first century. These three specific
characteristics define the challenges and opportunities of Big Data.
Volume Refers to the increasing size and scope of the data.
Explanation: Data collections today are massive, which requires complex and powerful
algorithms to handle, process, and analyze.
Velocity
Refers to the speed at which data is accumulated or acquired.
Explanation: Data is generated at an extremely fast and unprecedented speed, requiring
efficient processing capabilities.
Variety
Refers to the diverse types of data that are available.
Explanation: Data comes in a massive array of types, including structured, unstructured
(like text, images, and video), and semi-structured formats.
9
3. Explain concept and use of data visualisation.
Concept of Data Visualization
Data Visualization is the graphical representation of information and data. It is a process
that transforms the representation of raw data into meaningful insights in a visual
format. The representation can be viewed as a mapping between the original, usually
numerical, data and graphic elements like lines or points in a chart.
Data visualization is essential because the human brain processes visual content better
than plain textual information. By using visual elements like charts, graphs, and maps,
visualization tools provide an accessible way to understand complex data.
Uses (Advantages) of Data Visualization
Data visualization provides numerous advantages in data analysis, including:
Identifying Patterns and Trends: It makes it easier for humans to detect trends,
patterns, correlations, and outliers in a group of data. Trend or time-series analysis is
highly demanded in the market, for instance, for studying the stock market.
Facilitating Decision-Making: A simple visualization, built with credible data, can help
businesses or organizations make quick business decisions.
Understanding Big Data: It helps humans understand the big picture of massive
datasets using a small, impactful visualization.
Detecting Outliers: With the help of visualization, outliers can be easily detected and
removed from the dataset to prevent misleading analysis and incorrect results.
Improving Response Time: Visualization gives a quick glance of the entire data, which
allows analysts to quickly identify issues and thus improve response time.
4. What are the components of data scientist tool box? Explain two of them in detail.
The Data Scientist's Toolbox consists of various tools, techniques, and skills required by
a data scientist to extract, manipulate, pre-process, and generate predictions from data.
The components can be broadly categorized as:
Programming Languages (e.g., R, Python)
Statistical and Mathematical Tools (e.g., SAS)
Data Visualization Software (e.g., Tableau)
Big Data Processing Frameworks (e.g., Apache Spark)
Here are two components/tools explained in detail:
1. R Programming
R is an open-source programming tool.
Detail: It is frequently used in data science for data handling and manipulation. It
provides a powerful environment for statistical computing and graphics.
2. Tableau
Tableau is a powerful data visualization software.
Detail: It is a widely used data science tool due to its simplicity and the ability to provide
an easy interpretation of complex data analytical tasks.
10
5. Explain different data formats in brief.
Data Format Brief Explanation
Comma Separated Values. A text file where the values in the
columns are separated by a comma. Reading from this file type is
CSV Files a fundamental necessity in data science.
JavaScript Object Notation. A lightweight, text-based open
standard designed for exchanging data over the web. It stores
JSON Files data as text in a human-readable format.
Extensible Markup Language. An example of semi-structured data
XML Files used to store and transport data, often on the web.
Derived from "tape archive." Used for collecting many files into
Tar Files one single archive file.
These are two very popular methods of compressing files. They
are used to save space or to reduce the time needed to transmit
Zip/GZip Files files across a network or the internet.
A standard way to organize and store image data. They can be in
Image Files formats like Rasterized or Vectorized.
Integers/Floats/Text Basic data types representing whole numbers, decimal numbers,
Data and character-based information, respectively.
12
8. Write short note on different methods for measuring data simillarity and
Dissimilarity.
Similarity and Dissimilarity (Proximity Measures) are fundamental to data science
applications like clustering and nearest-neighbor classification, as they quantify how
alike or unlike data objects are.
Similarity Measure: Measures how related or closed data samples are to each other. A
value of 1 indicates complete similarity.
Dissimilarity Measure: Measures how distinct data objects are. It is also known as a
distance measure. A value of 0 indicates the objects are identical.
Different methods are used based on the type of data attribute:
Dissimilarity of Numeric Data (Distance Measures): These methods measure the
"distance" between two data points (xi and xj) in a multi-dimensional space, which
corresponds to the degree of dissimilarity.
Euclidean Distance: The most common straight-line distance (L2 norm).
Manhattan Distance: Also known as city-block distance (L1 norm), calculated by
summing the absolute differences of the coordinates.
Minkowski Distance: A generalization of both Euclidean and Manhattan distances.
Proximity Measures for Binary Attributes: These methods are used for attributes with
only two states (0 or 1). Special measures are used depending on whether the attributes
are symmetric (both 0 and 1 are equally important, e.g., gender) or asymmetric (one
state is much more significant, often the '1' state, e.g., presence of a specific test result).
Proximity Measures for Nominal Attributes: Measures the similarity between
categorical values where the order does not matter (e.g., matching coefficient).
Proximity Measures for Ordinal Attributes: These measures typically use the ranks of
the ordinal values instead of the values themselves to calculate the proximity (e.g., rank
correlation).
9. Write a short note on feature extraction.
Feature Extraction is a crucial part of Dimensionality Reduction, a process used to
reduce the number of variables (features) in a dataset.
Concept and Purpose: Feature extraction is the process of reducing the data from a
high-dimensional space to a lower-dimensional space. Its goal is to create a new,
smaller set of features that retains the most useful information from the original large
set of features. Unlike feature selection, which chooses a subset of existing features,
feature extraction creates new features by combining or transforming the old ones.
Methods: There are several methods used for feature extraction, including:
Principal Component Analysis (PCA): A popular unsupervised method that identifies the
directions (principal components) along which the data varies the most, projecting the
data onto a smaller dimensional subspace.
Linear Discriminant Analysis (LDA): A supervised method that attempts to find a feature
subspace that maximizes the separation between different classes.
Generalized Discriminant Analysis (GDA).
13
10. What are the different methods for measuring the data dispersion?
Measures of Dispersion (or Variability) indicate the degree to which individual data
values scatter or vary around the average (central tendency). They are necessary to
assess how representative the average value is.
The different methods for measuring data dispersion are:
Range: The simplest measure of dispersion, calculated as the difference between the
maximum and minimum values in a dataset.
Formula (in practice): Range = Max value - Min value.
Variance: Measures the average of the squared differences from the Mean. It
provides a measure of how far the numbers in a set are spread out from their average
value. Calculation: It is calculated by finding the square of the standard deviation of the
given data distribution.
Standard Deviation (SD): The square root of the variance. It is the most widely used
measure, as it is expressed in the same units as the original data, making it easier to
interpret. It quantifies the amount of variation or dispersion of a set of data values.
Interquartile Range (IQR): Measures the spread of the middle 50% of the data. It is
the difference between the third quartile (Q3 ) and the first quartile (Q1 ).
Calculation: IQR = Q3−Q1. It is a resistant measure, meaning it is less affected by
outliers than the Range or Standard Deviation.
11. Explain any four data Visualization tools?
Data visualization tools are applications that help users create visual representations of
data (charts, graphs, maps, etc.) to communicate insights effectively.
Here are four common data visualization tools:
1. Tableau: A powerful, highly popular business intelligence (BI) tool known for its
ease of use and ability to create interactive and visually appealing dashboards and
visualizations without requiring coding.
2. Microsoft Power BI: A suite of business analytics tools provided by Microsoft. It is
excellent for data modeling, preparation, and creating interactive reports and
dashboards, especially within the Microsoft ecosystem.
3. R : While R is a programming language, it is a primary tool for statistical computing
and graphics. The `ggplot2` library within R is one of the most respected visualization
packages, enabling the creation of complex and aesthetically pleasing plots based on
the grammar of graphics.
4. Google Charts: A free, web-based tool provided by Google that allows developers
to create a wide variety of interactive charts and graphs to embed on their websites. It
uses a JavaScript API and relies on HTML5 and SVG.
14
12. What do you mean by data Transformation? State and explain 4 data
transformation techniques/strategies.
Data Transformation
Data Transformation Data transformation is the process of converting raw data into a
format or structure that would be more suitable for data analysis. This process is
important because it affects the manner in which the final data outcomes will result.
Data Transformation Strategies
Four Data Transformation Techniques/Strategies:
Rescaling - Rescaling means transforming the data so that it fits within a specific scale,
typically a range like 0-100 or 0-1.
It allows scaling all data values to lie between a specified minimum and maximum value
(e.g., between 0 and 1).
This technique helps to compare different variables on equal footing, especially when
attributes have varying scales.
Normalizing - Normalization is a technique that can be applied to improve the
identification of outliers or invalid values for numerical data.
It addresses issues where the use of different measurement units (e.g., meters vs.
inches for height) can affect data analysis and lead to very different results.
In a specific type of normalization, the data is rescaled in such a way that each row of
observation equals a length of 1 (called a unit norm in linear algebra).
Binarizing - Binarizing is the process of converting data to either 0 or 1 based on a
threshold value.
Label Encoding- Label encoding is used to convert textual labels (categorical features)
into a numeric form so that they can be used in a machine-readable format.
The categories are assigned a value from 0 to (n-1), where 'n' is the number of distinct
values for that particular categorical feature
_________________________________________________________________
13. Explain data visualization libraries in Python.
Python's ecosystem provides several powerful and widely used libraries for data
visualization, enabling users to create both static, interactive, and animated plots.
1. Matplotlib:
The foundational and most widely used plotting library for Python. It is used to generate
figures, plots, histograms, power spectra, bar charts, error charts, and scatter plots. It
offers fine-grained control over every element of a figure but can sometimes be verbose
for complex plots.
2. Seaborn:
Built on top of Matplotlib, Seaborn is specialized for creating statistical graphics. It
provides a high-level interface for drawing attractive and informative statistical graphics,
often with a more sophisticated visual appeal and fewer lines of code compared to
Matplotlib.
3. Plotly:
15
A library used to create interactive and high-quality visualizations that can be displayed
in web browsers. It supports various chart types, including 3D plots, statistical charts,
and financial charts. It is excellent for creating shareable, interactive dashboards.
4. Bokeh:
Another powerful library for creating interactive visualizations for modern web
browsers. It aims to provide elegant, concise construction of versatile graphics, and is
particularly good for creating large-scale data dashboards.
14. What is Outlier? Explain types/detection methods of outliers.
Outlier - An outlier is a data point that differs significantly from other observations.
They are extreme values that deviate from other observations and may indicate
variability in a measurement, experimental errors, or a novelty.
Detection Methods of Outliers
Outliers can be detected using various statistical, visual, or algorithmic methods:
1. Statistical Methods (Z-score and IQR):
Z-score Method: Measures how many standard deviations a data point is away from the
mean. Data points with a Z-score above a certain threshold (e.g., 3 or -3) are considered
outliers.
Interquartile Range (IQR) Method: An observation is identified as an outlier if it falls
outside the range of 1.5 times the IQR below the first quartile ($Q_1$) or above the
third quartile ($Q_3$). This method is visually represented by a Box Plot.
2. Visualization Methods:
Box Plots: A visual representation of data distribution that clearly shows the quartiles,
median, and marks points outside the whiskers as potential outliers.
Scatter Plots: Data points that lie far away from the general cluster of points can be
visually identified as outliers.
3. Distance-based Methods:
Identifies outliers as those data points that are far away from their nearest neighbors.
For instance, a point could be classified as an outlier if less than a fraction of all points
are within a specified distance from it.
4. Model-based Methods (Clustering):
Uses clustering algorithms (like K-Means) where data points that do not belong to any
cluster or are significantly far from the cluster centroid are flagged as potential outliers.
_____________________________________________________________________
15. What is data cleaning? Explain any two data cleaning methods.
Data Cleaning also known as data cleansing or scrubbing, is the process of correcting or
removing incorrect, incomplete, or duplicate data within a dataset. It is done to handle
irrelevant or missing data. The process involves filling in missing values, smoothing any
noisy data, identifying and removing outliers, and resolving inconsistencies.
Data Cleaning Methods/Operations:
Filling in Missing Values (Imputation):
16
This technique addresses missing values in the dataset, which is known as imputation.
Missing values can be filled in by guessing them from other data fields or by simply
using the mean of all non-null values for that attribute. For example, the null value of an
"age" attribute for a customer might be replaced with the mean age of all customers in
the same category.
Identifying and Removing Outliers (Smoothing Noisy Data):
Real-world data is often noisy and may contain outliers, which are errors.
Data cleaning involves identifying and removing these outliers to ensure the quality of
the data. An outlier is an extreme value that significantly deviates from other
observations.
_______________________________________
16. What do you mean by Data Reduction? Explain any two Data Reduction
strategies.
Data Reduction is an essential phase in the data preprocessing pipeline. Its primary goal
is to obtain a reduced representation of the data set, meaning a much smaller volume
of data, while ensuring that the integrity of the original data's information content is
preserved as much as possible.
Data reduction helps in: Reducing storage cost and volume. Increasing the speed of
complex query execution and data mining tasks.
Explain any two Data Reduction strategies:
i. Dimensionality Reduction
This strategy aims to reduce the number of attributes (random variables or features)
under consideration. It reduces the number of unwanted variables by obtaining a
smaller set of principal variables, which are often the most important features. It
includes methods like Feature Selection (selecting a subset of the original features) and
Feature Extraction (transforming data into a new set of dimensions, such as using
Principal Component Analysis - PCA).
Benefit: Reducing the number of dimensions helps to combat the curse of
dimensionality and can significantly improve the performance of machine learning
models.
ii. Data Cube Aggregation
This technique involves aggregating or summarizing data, which inherently provides a
much smaller data representation. A data cube stores pre-computed and summarized
multidimensional information. For example, instead of storing individual sales
transactions, the data is aggregated to show total sales per month or per year, across
different regions and products.
Benefit: This eases multidimensional analysis and speeds up data access for data mining,
achieving reduction without significant loss of analytical insight.
17
17. What do you mean by hypothesis testing? Explain null and alternate hypothesis.
Hypothesis Testing is a core inferential statistical technique used in data analysis. It is a
formal process used to determine whether there is sufficient statistical evidence in a
sample of data to conclude that a specific condition or relationship holds for the entire
population. Essentially, it's a method for testing a claim or an idea about a population
using sample data.
Explain Null and Alternate Hypothesis:
i. Null Hypothesis (H0 )
The Null Hypothesis is the statement of no effect, no difference, or no relationship. It is
the hypothesis that the analyst is trying to disprove or reject. It generally states that the
sample statistic is equal to the population statistic. Example: H0 : The average height
of all students in Pune University is 5 feet 8 inches.
ii. Alternative Hypothesis (Ha or H1 )
The Alternative Hypothesis is the claim about the population that contradicts the null
hypothesis. It is the statement that is accepted if the null hypothesis is rejected based
on the statistical evidence. It represents the researcher's new claim or finding, stating
that a difference, effect, or relationship exists in the population.
Example: Ha : The average height of all students in Pune University is not 5 feet 8
inches
The goal of the testing process is to calculate the probability of observing the sample
data assuming H0 is true. If this probability is very low, the analyst rejects H0 and
accepts Ha.
_____________________________________________________
[Link] to visualize geospatial data? Explain in detail.
Geospatial data (also called spatial data or geodata) consists of numeric data that
denotes a geographic coordinate system, primarily latitude and longitude, which
identifies the location of a physical object. The primary method of visualization is
through maps. Visualizing this data involves placing data points, regions, or movements
within their geographical context, allowing for the identification of spatial patterns,
clusters, and relationships.
Common Geospatial Visualization Techniques
i. Choropleth Maps - It divides a geographic area (like a country, state, or county) into
predefined regions and uses different colors or shades to represent the value or range
of a specific variable within that region. Ideal for visualizing aggregate data like
population density, average income, or election results across distinct administrative
boundaries. The file specifically mentions the role of a choropleth map.
ii. Heat Maps (Density Maps) - Heat maps are used to represent the concentration or
intensity of continuous data over a geographic area. Unlike choropleth maps, the colors
in a heat map do not strictly correspond to fixed geographical boundaries; instead, they
are determined by the density of data points. Areas with high concentration or intensity
of the variable appear as "hot spots" (often with intense colors like red/yellow), while
18
areas with lower values appear "cooler" (often blue/green). Useful for visualizing large
datasets with overlapping points, such as crime density, Wi-Fi usage, or regions of high
seismic activity.
iii. Point Maps and Cluster Maps - These use individual markers to represent discrete
locations or events.
Point Map: Places a single dot/point at the exact latitude and longitude of an event or
object
Cluster Map: When data points are too numerous and overlap, a cluster map groups
nearby points into a single symbol, which is often sized or labeled with the count of
grouped points, reducing clutter.
--------------------------------------------------------------------------------------------------------
[Link] Exploratory Data Analysis (EDA) in detail.
Exploratory Data Analysis (EDA) is the crucial initial process of investigating a dataset to
summarize its main characteristics, often using visual methods and descriptive statistics,
before formal modeling begins.
EDA is performed to:
Understand Data Quality: Identify missing values, erroneous data, and inconsistencies.
Discover Patterns and Anomalies: Detect important patterns, underlying trends, and
anomalies (outliers) in the data.
Form Hypotheses: Generate initial hypotheses and ideas about the solution and factors
affecting it.
Relationships: Understand the relationships between various features or entities within
the dataset.
Techniques Used in EDA
i. Descriptive Statistics - This involves generating summary statistics to describe the
features of the data. Central Tendency: Calculating mean, median, and mode.
Variability/Spread: Calculating standard deviation, variance, and range.
ii. Data Visualization - This is essential for graphical exploration, allowing patterns to be
captured easily. Bar Graphs/Histograms: Used to visualize the distribution of individual
features. Scatter Plots: Used to capture the relationship between two numerical
variables.
19
[Link] are the various types of data? Explain in detail and example
Classification by Structure
Data Type Explanation Example
Data that is well-organized in a Customer information
pre-defined manner or fixed format. It stored in a SQL table with
Structured Data
typically resides in a relational database columns like CustomerID,
management system (RDBMS). Name, Address.
Data that has no pre-defined structure, Emails, images, audio
Unstructured
format, or sequence. It is the majority of files, video files, and
Data
data generated today. social media posts.
Data that does not conform to a formal
RDBMS structure but contains tags or
Semi-structured JSON, XML, and HTML
markers to separate and organize
Data documents.
elements, making it easier to analyze than
unstructured data.
Classification by Measurement
Data Type Explanation Example
Data that can be counted and has a
The number of students in a class
Discrete finite, or countably infinite, number of
or the number of cars passing an
Data possible values. It often results from
intersection.
counting.
Data that can take any value within a The time taken by athletes to
Continuous
specified range. It is usually measured, complete a race, temperature, or
Data
not counted. a person's height or weight.
21. What are the measures of central tendency? Explain any two of them in brief.
Measures of Central Tendency are single values that attempt to describe a set of data
by identifying the central position within that set of data. They are also known as
measures of central location and are a key part of Descriptive Statistics, which aims to
summarize data to make it easier to understand.
The three most common measures of central tendency mentioned in your syllabus are:
Mean, Median, Mode
1. Mean (Arithmetic Mean) = The mean, or average, is calculated by summing up all the
values in the data set and then dividing the sum by the total number of values. It is the
most commonly used measure of central tendency because it includes every value in
the data set in its calculation. The mean is best used for data that is normally distributed.
A major disadvantage is that the mean is sensitive to extreme values (outliers), which
can skew the mean upward or downward, making it a poor representative of the typical
value in a skewed distribution.
20
2. Median = The median is the middle value in a dataset when the values are arranged
in ascending or descending order. It divides the data distribution into two halves, with
50% of the observations on either side of the median value. The median is less affected
by outliers and skewed data than the mean and is therefore often the preferred
measure of central tendency when the distribution is not symmetrical.
----------------------------------------------------------------------------------------------------------------
22. What do you mean by data discretization? Explain discretization by Histogram
analysis.
Data discretization is a data preprocessing technique characterized as a method of
translating attribute values of continuous data into a finite set of intervals (or bins) with
minimal information loss. It simplifies the original continuous data by substituting
interval marks for the actual numeric values. This transformation is crucial for
algorithms that only accept categorical attributes or to improve the performance of
other algorithms by reducing the number of values for a continuous attribute. The
technique is used to divide the continuous attributes into data with intervals.
Discretization by Histogram Analysis
Histogram analysis is an unsupervised discretization technique because it does not use
class (label) information to partition the data.
- Partitioning: In this method, a histogram distributes an attribute's observed values into
disjoint ranges called buckets or bins.
- Binning Rules: Various partitioning rules can be used to define the buckets:
Equal-width: The width of each bucket range is uniform (e.g., all bins span an
interval of 10 units).
Equal-frequency (or Equal-depth): The buckets are created such that, roughly, the
frequency (count of data samples) of each bucket is constant. This means each bucket
contains approximately the same number of contiguous data samples.
- Concept Hierarchy: The histogram analysis algorithm can be applied recursively to
each partition to automatically generate a multilevel concept hierarchy.
21