0% found this document useful (0 votes)
7 views21 pages

Key Concepts in Data Science and Analysis

The document provides definitions and explanations of various data science concepts, including primary data, data quality, outliers, data visualization, and data preprocessing. It highlights the importance of data cleaning, the role of statistics in data science, and the characteristics of structured and unstructured data. Additionally, it discusses tools and applications of data science, such as recommendation systems and fraud detection.

Uploaded by

raghvendra92
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views21 pages

Key Concepts in Data Science and Analysis

The document provides definitions and explanations of various data science concepts, including primary data, data quality, outliers, data visualization, and data preprocessing. It highlights the importance of data cleaning, the role of statistics in data science, and the characteristics of structured and unstructured data. Additionally, it discusses tools and applications of data science, such as recommendation systems and fraud detection.

Uploaded by

raghvendra92
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

What do you mean by primary data?

Primary data is original data collected for the


first time by the researcher or investigator for a specific purpose.
What do you mean by Data Quality?
Data Quality refers to the overall assessment of data's fitness to serve its purpose,
considering aspects like accuracy, completeness, consistency, reliability, and timeliness.
Define outlier. An outlier is an observation point or data value that lies an
abnormal distance from other values in a random sample from a population.
Define interquartile range. The interquartile range (IQR) is a measure of statistical
dispersion, calculated as the difference between the third quartile ($Q_3$) and the first
quartile ($Q_1$). ($IQR = Q_3 - Q_1$).
What do you mean by missing values?
Missing values are data points where the value for a variable is not stored in an
observation, often represented by NaN (Not a Number) or null.
What are uses of Zip files.
Zip files are used to compress one or more files to reduce their total size and to bundle
multiple files together into a single file for easier portability and storage.
What do you mean by XML files data format?
XML (Extensible Markup Language) files are a data format that defines a set of rules for
encoding documents in a format that is both human-readable and machine-readable,
using tags to define elements.
Define data discretization. Data discretization is a process of converting continuous
data attributes into a finite set of intervals or categories (discrete values).
What is tag/word cloud? A tag or word cloud is a visual representation of text data,
where the importance or frequency of a word is shown by its size and color.
What is visual encoding? Visual encoding is the process of mapping data attributes
(like values, categories, etc.) to visual properties (such as position, size, shape, color,
and orientation) of graphical elements (marks) in a visualization.
Define Data Science. Data Science is an interdisciplinary field that uses scientific
methods, processes, algorithms, and systems to extract knowledge and insights from
structured and unstructured data.
Define data visualization. Data visualization is the graphic representation of data,
involving the creation of visual images (like charts, graphs, and maps) to communicate
information clearly and efficiently to users.
What is the purpose of data Visualization? The main purpose of data visualization
is to communicate insights, help users explore and understand complex data, identify
patterns, trends, and outliers, and support decision-making.
List any two data visualization tools. List the visualization libraries in python.
Tools: Tableau, Power BI Python Libraries: Matplotlib, Seaborn
List any two applications of data Science.
Recommendation Systems (e.g., Netflix, Amazon)
Fraud Detection
1
What do you mean by data transformation?
Data transformation is the process of converting data from one format or structure into
another format or structure, often done to prepare data for mining or analysis (e.g.,
normalization, aggregation).
Define variance.
Variance is the average of the squared differences from the mean, measuring how far a
set of numbers is spread out from their average value.
What is nominal attribute?
A nominal attribute is a categorical attribute whose values are names or labels, where
the order among the values has no significance (e.g., hair color, marital status).
What is one hot encoding?
One-hot encoding is a process that converts categorical variables into a numerical
format that machine learning algorithms can understand. Each category value is
converted into a new binary column (0 or 1).
Define Bubble plot?
A Bubble plot is a variation of a scatter plot where the data points are replaced with
bubbles, and an additional third dimension of data is represented by the size of the
bubbles.
Define Standard deviation?
Standard deviation is a measure of the amount of variation or dispersion of a set of
values, calculated as the square root of the variance.
List any two Social media data sources.
Twitter (now X)
Facebook
What is Open Data Source?
An Open Data Source refers to data that is freely available to everyone to use and
republish as they wish, without restrictions from copyright, patents, or other
mechanisms of control (e.g., government data portals).
Define Data wrangling.
Data wrangling (or Data Munging) is the process of transforming and mapping data
from a "raw" data form into another format with the intent of making it more
appropriate and valuable for a variety of downstream purposes such as analytics.
Why data cleaning is important operation of data preprocessing?
Data cleaning is important because it corrects errors, handles missing values, smooths
noisy data, and resolves inconsistencies, ensuring the quality of the data used for
analysis, which directly impacts the accuracy and reliability of the results.
Define Statistical data analysis?
Statistical data analysis involves collecting, interpreting, and validating quantitative and
qualitative data to discover patterns, trends, and relationships, allowing for meaningful
conclusions and predictions.

2
What is CSV format?
CSV (Comma Separated Values) format is a simple file format used to store tabular data
(numbers and text) in plain text, where each line is a data record and each field within
the record is separated by a comma (or another delimiter).
Define Data source?
A Data source is the location, format, and structure from which data is collected or
retrieved (e.g., a database, a spreadsheet, a website, or a set of sensor readings).
Define Hypothesis Testing?
Hypothesis Testing is a statistical method used to determine whether there is enough
evidence in a sample of data to infer that a certain condition or hypothesis holds true
for the entire population.
Define Data cleaning?
Data cleaning is the process of detecting and correcting (or removing) corrupt,
inaccurate, incomplete, or irrelevant records from a dataset.
Define volume characteristic of data in reference to data science.
Volume refers to the immense amount of data generated every second, which is a
defining characteristic of Big Data, often measured in terabytes (TB), petabytes (PB),
and beyond.
Give examples of semistructured data.
Examples of semistructured data include XML files and JSON (JavaScript Object Notation)
files.
What is a quartile?
A quartile is one of the three points ($Q_1$, $Q_2$, $Q_3$) that divide a ranked dataset
into four equal parts, each containing 25% of the data. $Q_2$ is the median.
List different types of attributes.
Nominal (e.g., color)
Ordinal (e.g., size: small, medium, large)
Interval (e.g., temperature in Celsius)
Ratio (e.g., height, weight)
Define Data object.
A Data object (or record, sample, observation, or entity) is an entity that is described by
a set of attributes, representing a single instance or row of data in a dataset.
Write the tools used for geospatial data.
Tools used for geospatial data include GIS (Geographic Information Systems) software
like QGIS or ArcGIS.
State the methods of feature selection.
Methods of feature selection include Filter methods, Wrapper methods, and Embedded
methods.
List any two libraries used in Python for data analysis.
Pandas, NumP, Matplotlib

3
Explain different applications of Data Science.
Data Science is applied across almost every industry to extract value and drive
decision-making. Two key applications are:
Recommendation Systems: Used by e-commerce platforms (Amazon), streaming
services (Netflix, Spotify), and social media to predict user preferences and suggest
products, movies, or music. This drives engagement and sales.
Fraud Detection: Financial institutions use Data Science models (often supervised
learning) to analyze transaction patterns in real-time. The models identify anomalous
behavior that deviates from normal activity, flagging potentially fraudulent transactions.

Explain null and alternate hypothesis.


In statistical hypothesis testing:
Null Hypothesis ($H_0$): This is the default or status-quo assumption. It states that
there is no statistically significant relationship between two variables or that there is no
difference between two population means. We try to find evidence against the null
hypothesis.
Alternate Hypothesis (5$H_1$ or 6$H_a$): This is the claim or statement that the
researcher is trying to prove. It states that there is a statistically significant relationship
or difference between the variables. Rejecting the null hypothesis supports the
alternate hypothesis.

What do you mean by Noisy data? Explain any two causes of noisy data.
Noisy data is a meaningless data point or data record that has been distorted or
includes a random error or variance.8 It is data that is too complex to be understood by
standard statistical or machine learning methods and reduces the accuracy of the
model.
Two causes of noisy data are:
Faulty Data Collection Instruments: Malfunctioning sensors or instruments used during
data gathering can introduce random errors or inaccuracies into the collected data.
Data Entry Errors: Human mistakes during manual input or recording of data (e.g., typos,
misplaced decimal points) can lead to spurious or incorrect values.

What do you mean/purpose by/of data Visualization? Give example of any two data
visualization libraries in Python.
Data Visualization is the graphic representation of [Link] purpose is to efficiently and
clearly communicate insights, trends, and patterns hidden in the [Link] transforms
complex datasets into accessible visual formats (charts, graphs, maps) that facilitate
understanding and decision-making for a wide audience.
Two data visualization libraries in Python are:
Matplotlib and Seaborn

4
Explain 3V's of data science.
The 3 V's are core characteristics used to describe Big Data:
Volume: Refers to the massive quantity of data generated daily from various sources.
The sheer size of the data requires unique processing and storage solutions (e.g., data
measured in terabytes or petabytes).
Velocity: Refers to the speed at which data is generated, collected, and processed.
High-velocity data often requires real-time or near real-time processing (e.g., streaming
data from social media feeds or sensor networks).
Variety: Refers to the many different types of data structures, including structured
(relational databases), semi-structured (XML, JSON), and unstructured data (text, video,
images).
What do you understand by structured and unstructured data? Differentiate them.
Structured Data: Data that is highly organized and follows a predefined format or model.
It typically resides in relational databases (RDBMS) where it is stored in tables with rows
and columns. Examples include names, dates, and transaction amounts.
Unstructured Data: Data that does not have a predefined format or structure. It is
challenging to process and analyze using traditional methods. Examples include social
media posts, emails, videos, and images.
Feature Structured Data Unstructured Data
Organization Highly organized (tabular format) No predefined organization
Schema Fixed and defined schema No fixed schema
Querying Easy using SQL Difficult; requires specialized tools
Storage Relational databases Data Lakes, NoSQL databases

What is inferential statistics?


Inferential statistics is a branch of statistics that uses a sample of data to draw
conclusions or make inferences about the larger population from which the sample was
drawn.20 It involves techniques like hypothesis testing, confidence intervals, and
regression analysis to estimate population parameters and test theories.

Define data discretization.


Data discretization is the process of dividing the range of a continuous numerical
attribute into a set of contiguous intervals or categories (bins). This simplifies the data
by replacing numeric values with interval labels, making the data easier to understand
and use in certain types of algorithms
Define statistical data analysis?
Statistical data analysis is the process of applying statistical methods and techniques to
a set of data to interpret, examine, summarize, and draw meaningful [Link]
involves descriptive statistics (summarizing data using mean, median, etc.) and
inferential statistics (making predictions or generalizations about a population based on
a sample).
5
What is visual encoding?
Visual encoding is the process of mapping data variables (attributes) to visual properties
of graphical elements (called marks like points, lines, or areas) to create a
[Link] visual properties, such as position, size, color, shape, and texture,
serve as the visual language for representing the data and its relationships.
Structured Data: Data that is highly organized and follows a predefined format or model.
It typically resides in relational databases (RDBMS) where it is stored in tables with rows
and [Link] include names, dates, and transaction amounts.
Unstructured Data: Data that does not have a predefined format or [Link] is
challenging to process and analyze using traditional methods. Examples include social

What do you mean by data preprocessing? Give its importance.


Data preprocessing is a crucial stage in the data science pipeline that involves preparing
and transforming raw data into an understandable, clean, and suitable format for
analysis or machine learning.
Importance:
Data preprocessing is vital because real-world data is often incomplete, inconsistent,
and [Link] raw, messy data leads to poor model performance and unreliable
[Link] techniques like cleaning, integration, transformation, and
reduction ensure data quality, which directly impacts the accuracy, efficiency, and
validity of the final data analysis or model predictions.

What are the uses of XML files?


XML (Extensible Markup Language) files are used primarily for:
Data Transmission: They act as a common format for exchanging structured data
between different systems, applications, and organizations over the internet, serving a
similar purpose to JSON.
Data Structuring and Storage: XML uses customizable tags to define the structure of the
data, making it both human-readable and machine-readable for complex, hierarchical
data storage.
Configuration Files: Many software applications use XML files to store settings and
configuration parameters due to their structured and readable nature.

List the tools for data scientist.


The tools used by a data scientist can be broadly categorized as:
Programming Languages: Python (with libraries like Pandas, NumPy, Scikit-learn), R.
Databases and Storage: SQL (for relational databases), NoSQL databases (MongoDB),
Cloud Storage (AWS S3, Google Cloud Storage).
Machine Learning/Deep Learning Frameworks: TensorFlow, PyTorch, Scikit-learn.
Data Visualization Tools: Tableau, Power BI, Python/R libraries (Matplotlib, ggplot2).
Big Data Platforms: Apache Spark, Hadoop.
6
What is data cube?
A data cube is a multi-dimensional structure used in OLAP (Online Analytical Processing)
systems and data [Link] allows for data to be viewed and analyzed along
multiple dimensions simultaneously. For example, a sales data cube might allow
analysis by product, region, and time, with aggregated values (like total sales amount)
stored at the intersection of these dimensions.

Explain any two ways in which data is stored in files.


CSV (Comma Separated Values) Format: Data is stored as plain text in a tabular format.
Each line in the file is a data record, and fields (attributes) within a record are separated
by a delimiter, most commonly a [Link] format is simple, widely supported, and
excellent for basic data exchange.
JSON (JavaScript Object Notation) Format: Data is stored in a structured,
human-readable format that uses key-value pairs and ordered lists (arrays) to represent
hierarchical data. It is a language-independent format widely used for transmitting data
between a server and web application.

Explain role of statistics in data science.


Statistics forms the foundation of Data Science. Its role includes:
Exploratory Data Analysis (EDA): Using descriptive statistics (mean, variance, quartiles)
to summarize and understand the main characteristics of a dataset, identify anomalies,
and guide feature engineering.
Model Building and Evaluation: Providing the mathematical basis for machine learning
algorithms (e.g., linear regression, hypothesis testing, probability distributions).
Statistics is also crucial for evaluating model performance and determining if results are
statistically significant.

Explain two methods of data cleaning for missing values.


Missing values must be addressed as they can lead to biased or inefficient analysis.
Two methods are:
Imputation (Filling the Missing Value): This involves replacing the missing value with a
calculated [Link] imputation techniques include:
Mean/Median Imputation: Replacing a missing value with the mean (for symmetrically
distributed data) or median (for skewed data) of that attribute.
Ignoring or Deleting the Tuples/Records: If the number of missing values is very small
and the data records (rows) with missing values are not essential, the analyst may
simply delete those [Link] is only advisable when the deletion won't lead to a
significant loss of information or bias in the remaining data.

7
Explain tools in data scientist tool box.
The data scientist's toolbox is a collection of software, languages, and frameworks
necessary for the entire data science workflow:
Programming & Libraries: Tools like Python and R provide the environment, while
libraries like Pandas (for data manipulation), NumPy (for numerical computing), and
Scikit-learn (for machine learning) provide specialized functions.
Visualization Tools: Tools like Tableau or Matplotlib/Seaborn allow for the creation of
charts and dashboards to understand and communicate data insights effectively.

Write a short note on wordclouds.


A word cloud (or tag cloud) is a simple visual representation of text data. It displays the
frequency of words in a body of text where the size and often the color of each word
are proportional to its frequency or [Link] larger a word appears, the more
often it was used in the source text. Word clouds are a quick and effective tool used in
text mining and analytics for visualizing term prominence and giving an immediate
impression of the central themes of a document or set of documents.

List different types of data attributes with [Link] (variables) can be


classified into two main types:

Type Sub-Type Description Example


Categorical Nominal Values are names or labels; order Hair Color (Black, Brown,
(Qualitative) does not matter. Blonde)
Ordinal Values have a meaningful order, Educational Level (High
but the intervals between them School, Bachelor's,
are not fixed. Master's)
Numerical Interval Order is meaningful, intervals are Temperature in Celsius
(Quantitative) fixed, but zero is arbitrary (no ($0^\circ C$ is not
true zero). absence of heat)
Ratio Order and intervals are fixed, and Height (A height of 0
zero represents the complete means no height)
absence of the quantity.

8
1. Explain data science life cycle with suitable diagram.
The Data Science Life Cycle outlines the steps or phases that data science projects
typically follow from start to finish. This lifecycle provides a framework for the best
performance of each phase until the project's completion.
The typical steps in the data science life cycle include:
Step 1: Setting Goal The entire cycle revolves around a business or research goal.
It is essential to clearly understand the business objective to set a specific goal for the
analysis.
Step 2: Data Understanding This step involves the collection of all available data.
It requires describing and understanding the data, including its structure, relevance, and
data type.
Step 3: Data Preparation This is often the most time-consuming yet important step in
the entire life cycle. It includes selecting relevant data, integrating (merging) datasets,
cleaning data (treating missing or erroneous values, handling outliers), and constructing
new data or deriving new features.
Step 4: Exploratory Data Analysis (EDA) The process of getting an idea about the
solution before building the actual model. Involves exploring the distribution and
relationships within the data using graphical representations such as bar-graphs, scatter
plots, and heat maps.
Step 5: Data Modeling This is the heart of data analysis.
A model takes the prepared data as input and provides the desired output.
Step 6: Model Evaluation Although not explicitly numbered in all sections, evaluation
is a crucial step where the model's performance is rigorously tested before deployment.
Step 7: Model DeploymentThe final step where the evaluated model is deployed in the
desired format and channel. One of the goals is to convince the business that the
findings will change the business process as expected.
2. Explain 3 V's of data science with diagram.
The 3 V's of data science—Volume, Velocity, and Variety—epitomize the expansion of
data that has occurred at the turn of the twenty-first century. These three specific
characteristics define the challenges and opportunities of Big Data.
Volume Refers to the increasing size and scope of the data.
Explanation: Data collections today are massive, which requires complex and powerful
algorithms to handle, process, and analyze.
Velocity
Refers to the speed at which data is accumulated or acquired.
Explanation: Data is generated at an extremely fast and unprecedented speed, requiring
efficient processing capabilities.
Variety
Refers to the diverse types of data that are available.
Explanation: Data comes in a massive array of types, including structured, unstructured
(like text, images, and video), and semi-structured formats.
9
3. Explain concept and use of data visualisation.
Concept of Data Visualization
Data Visualization is the graphical representation of information and data. It is a process
that transforms the representation of raw data into meaningful insights in a visual
format. The representation can be viewed as a mapping between the original, usually
numerical, data and graphic elements like lines or points in a chart.
Data visualization is essential because the human brain processes visual content better
than plain textual information. By using visual elements like charts, graphs, and maps,
visualization tools provide an accessible way to understand complex data.
Uses (Advantages) of Data Visualization
Data visualization provides numerous advantages in data analysis, including:
Identifying Patterns and Trends: It makes it easier for humans to detect trends,
patterns, correlations, and outliers in a group of data. Trend or time-series analysis is
highly demanded in the market, for instance, for studying the stock market.
Facilitating Decision-Making: A simple visualization, built with credible data, can help
businesses or organizations make quick business decisions.
Understanding Big Data: It helps humans understand the big picture of massive
datasets using a small, impactful visualization.
Detecting Outliers: With the help of visualization, outliers can be easily detected and
removed from the dataset to prevent misleading analysis and incorrect results.
Improving Response Time: Visualization gives a quick glance of the entire data, which
allows analysts to quickly identify issues and thus improve response time.
4. What are the components of data scientist tool box? Explain two of them in detail.
The Data Scientist's Toolbox consists of various tools, techniques, and skills required by
a data scientist to extract, manipulate, pre-process, and generate predictions from data.
The components can be broadly categorized as:
Programming Languages (e.g., R, Python)
Statistical and Mathematical Tools (e.g., SAS)
Data Visualization Software (e.g., Tableau)
Big Data Processing Frameworks (e.g., Apache Spark)
Here are two components/tools explained in detail:
1. R Programming
R is an open-source programming tool.
Detail: It is frequently used in data science for data handling and manipulation. It
provides a powerful environment for statistical computing and graphics.
2. Tableau
Tableau is a powerful data visualization software.
Detail: It is a widely used data science tool due to its simplicity and the ability to provide
an easy interpretation of complex data analytical tasks.

10
5. Explain different data formats in brief.
Data Format Brief Explanation
Comma Separated Values. A text file where the values in the
columns are separated by a comma. Reading from this file type is
CSV Files a fundamental necessity in data science.
JavaScript Object Notation. A lightweight, text-based open
standard designed for exchanging data over the web. It stores
JSON Files data as text in a human-readable format.
Extensible Markup Language. An example of semi-structured data
XML Files used to store and transport data, often on the web.
Derived from "tape archive." Used for collecting many files into
Tar Files one single archive file.
These are two very popular methods of compressing files. They
are used to save space or to reduce the time needed to transmit
Zip/GZip Files files across a network or the internet.
A standard way to organize and store image data. They can be in
Image Files formats like Rasterized or Vectorized.
Integers/Floats/Text Basic data types representing whole numbers, decimal numbers,
Data and character-based information, respectively.

6. What is data quality? Which factors are affected data qualities?


Data Quality refers to the degree to which data is fit for use and meets the
requirements of the task. Although not explicitly defined in a single line, good data
quality is characterized by data that is accurate, consistent, and complete—the opposite
of the factors that degrade it. The purpose of Data Preprocessing is to ensure high data
quality. The key factors (or problems) that negatively affect data quality are:
Inaccuracy (Noisy Data):
Data is incorrect or corrupted. This can manifest as errors or outliers (e.g., an age value
of 80 when the range should be 22-45 years).
Factors/Reasons: Faulty data collection instruments , human or computer errors during
data entry , users purposely submitting incorrect data (known as disguised missing
data) , errors in data transmission , or technology limitations.
Inconsistency:
Data contains incorrect and redundant entries due to non-uniformity.
Factors/Reasons: Inconsistencies in naming conventions or data codes, inconsistent
formats for input fields (e.g., date formats) , and the presence of duplicate tuples.
Incompleteness (Missing Values):
The data lacks certain stored values, which are known as missing values.
Factors/Reasons: This occurs when data records are missing entries or NULLs for
required fields.
11
7. What do you mean by Data attributes? Explain types of attributes with example.
Data Attributes (Variables)
An attribute is a property or characteristic of an object. A data attribute is a single value
descriptor for a data object. It is also known as a variable, field, characteristic, or feature.
Example: For a person (the data object), the attributes might be their eye color or
name.
Types of Attributes Attributes are broadly categorized based on the types of values
they can hold and the operations that can be performed on them.
Nominal Attributes:
Values are categories or names, where order does not matter (no meaningful ranking).
Only the operation of Equality is supported.
Example: Hair color (black, brown, blonde) or Marital Status (single, married, divorced).
Binary Attributes:
A nominal attribute with only two possible states (0 or 1, true or false, yes or no).
Example: Smoker (Yes/No) or Gender (Male/Female). They can be symmetric (both
states equally important) or asymmetric (one state is more important, e.g., presence of
a disease).
Ordinal Attributes:
Values have a meaningful order or rank, but the difference (magnitude) between
successive values is unknown or inconsistent. Supports Equality and Order operations.
Example: Drink size (Small, Medium, Large) or military rank.
Numeric Attributes:
Attributes where values are real numbers and have quantitative meaning. They
support all arithmetic operations. The syllabus also breaks them down into:
Discrete Attributes: Have a finite or countably infinite set of values, often
represented as integers. Example: The number of cars a person owns, or the values 0
and 1 for binary attributes.
Continuous Attributes: Have a potentially infinite number of values within a given
range. They are real-valued. Example: Height, weight, or temperature.

12
8. Write short note on different methods for measuring data simillarity and
Dissimilarity.
Similarity and Dissimilarity (Proximity Measures) are fundamental to data science
applications like clustering and nearest-neighbor classification, as they quantify how
alike or unlike data objects are.
Similarity Measure: Measures how related or closed data samples are to each other. A
value of 1 indicates complete similarity.
Dissimilarity Measure: Measures how distinct data objects are. It is also known as a
distance measure. A value of 0 indicates the objects are identical.
Different methods are used based on the type of data attribute:
Dissimilarity of Numeric Data (Distance Measures): These methods measure the
"distance" between two data points (xi and xj) in a multi-dimensional space, which
corresponds to the degree of dissimilarity.
Euclidean Distance: The most common straight-line distance (L2 norm).
Manhattan Distance: Also known as city-block distance (L1 norm), calculated by
summing the absolute differences of the coordinates.
Minkowski Distance: A generalization of both Euclidean and Manhattan distances.
Proximity Measures for Binary Attributes: These methods are used for attributes with
only two states (0 or 1). Special measures are used depending on whether the attributes
are symmetric (both 0 and 1 are equally important, e.g., gender) or asymmetric (one
state is much more significant, often the '1' state, e.g., presence of a specific test result).
Proximity Measures for Nominal Attributes: Measures the similarity between
categorical values where the order does not matter (e.g., matching coefficient).
Proximity Measures for Ordinal Attributes: These measures typically use the ranks of
the ordinal values instead of the values themselves to calculate the proximity (e.g., rank
correlation).
9. Write a short note on feature extraction.
Feature Extraction is a crucial part of Dimensionality Reduction, a process used to
reduce the number of variables (features) in a dataset.
Concept and Purpose: Feature extraction is the process of reducing the data from a
high-dimensional space to a lower-dimensional space. Its goal is to create a new,
smaller set of features that retains the most useful information from the original large
set of features. Unlike feature selection, which chooses a subset of existing features,
feature extraction creates new features by combining or transforming the old ones.
Methods: There are several methods used for feature extraction, including:
Principal Component Analysis (PCA): A popular unsupervised method that identifies the
directions (principal components) along which the data varies the most, projecting the
data onto a smaller dimensional subspace.
Linear Discriminant Analysis (LDA): A supervised method that attempts to find a feature
subspace that maximizes the separation between different classes.
Generalized Discriminant Analysis (GDA).
13
10. What are the different methods for measuring the data dispersion?
Measures of Dispersion (or Variability) indicate the degree to which individual data
values scatter or vary around the average (central tendency). They are necessary to
assess how representative the average value is.
The different methods for measuring data dispersion are:
Range: The simplest measure of dispersion, calculated as the difference between the
maximum and minimum values in a dataset.
Formula (in practice): Range = Max value - Min value.
Variance: Measures the average of the squared differences from the Mean. It
provides a measure of how far the numbers in a set are spread out from their average
value. Calculation: It is calculated by finding the square of the standard deviation of the
given data distribution.
Standard Deviation (SD): The square root of the variance. It is the most widely used
measure, as it is expressed in the same units as the original data, making it easier to
interpret. It quantifies the amount of variation or dispersion of a set of data values.
Interquartile Range (IQR): Measures the spread of the middle 50% of the data. It is
the difference between the third quartile (Q3 ) and the first quartile (Q1 ).
Calculation: IQR = Q3−Q1. It is a resistant measure, meaning it is less affected by
outliers than the Range or Standard Deviation.
11. Explain any four data Visualization tools?
Data visualization tools are applications that help users create visual representations of
data (charts, graphs, maps, etc.) to communicate insights effectively.
Here are four common data visualization tools:
1. Tableau: A powerful, highly popular business intelligence (BI) tool known for its
ease of use and ability to create interactive and visually appealing dashboards and
visualizations without requiring coding.
2. Microsoft Power BI: A suite of business analytics tools provided by Microsoft. It is
excellent for data modeling, preparation, and creating interactive reports and
dashboards, especially within the Microsoft ecosystem.
3. R : While R is a programming language, it is a primary tool for statistical computing
and graphics. The `ggplot2` library within R is one of the most respected visualization
packages, enabling the creation of complex and aesthetically pleasing plots based on
the grammar of graphics.
4. Google Charts: A free, web-based tool provided by Google that allows developers
to create a wide variety of interactive charts and graphs to embed on their websites. It
uses a JavaScript API and relies on HTML5 and SVG.

14
12. What do you mean by data Transformation? State and explain 4 data
transformation techniques/strategies.
Data Transformation
Data Transformation Data transformation is the process of converting raw data into a
format or structure that would be more suitable for data analysis. This process is
important because it affects the manner in which the final data outcomes will result.
Data Transformation Strategies
Four Data Transformation Techniques/Strategies:
Rescaling - Rescaling means transforming the data so that it fits within a specific scale,
typically a range like 0-100 or 0-1.
It allows scaling all data values to lie between a specified minimum and maximum value
(e.g., between 0 and 1).
This technique helps to compare different variables on equal footing, especially when
attributes have varying scales.
Normalizing - Normalization is a technique that can be applied to improve the
identification of outliers or invalid values for numerical data.
It addresses issues where the use of different measurement units (e.g., meters vs.
inches for height) can affect data analysis and lead to very different results.
In a specific type of normalization, the data is rescaled in such a way that each row of
observation equals a length of 1 (called a unit norm in linear algebra).
Binarizing - Binarizing is the process of converting data to either 0 or 1 based on a
threshold value.
Label Encoding- Label encoding is used to convert textual labels (categorical features)
into a numeric form so that they can be used in a machine-readable format.
The categories are assigned a value from 0 to (n-1), where 'n' is the number of distinct
values for that particular categorical feature
_________________________________________________________________
13. Explain data visualization libraries in Python.
Python's ecosystem provides several powerful and widely used libraries for data
visualization, enabling users to create both static, interactive, and animated plots.
1. Matplotlib:
The foundational and most widely used plotting library for Python. It is used to generate
figures, plots, histograms, power spectra, bar charts, error charts, and scatter plots. It
offers fine-grained control over every element of a figure but can sometimes be verbose
for complex plots.
2. Seaborn:
Built on top of Matplotlib, Seaborn is specialized for creating statistical graphics. It
provides a high-level interface for drawing attractive and informative statistical graphics,
often with a more sophisticated visual appeal and fewer lines of code compared to
Matplotlib.
3. Plotly:
15
A library used to create interactive and high-quality visualizations that can be displayed
in web browsers. It supports various chart types, including 3D plots, statistical charts,
and financial charts. It is excellent for creating shareable, interactive dashboards.
4. Bokeh:
Another powerful library for creating interactive visualizations for modern web
browsers. It aims to provide elegant, concise construction of versatile graphics, and is
particularly good for creating large-scale data dashboards.
14. What is Outlier? Explain types/detection methods of outliers.
Outlier - An outlier is a data point that differs significantly from other observations.
They are extreme values that deviate from other observations and may indicate
variability in a measurement, experimental errors, or a novelty.
Detection Methods of Outliers
Outliers can be detected using various statistical, visual, or algorithmic methods:
1. Statistical Methods (Z-score and IQR):
Z-score Method: Measures how many standard deviations a data point is away from the
mean. Data points with a Z-score above a certain threshold (e.g., 3 or -3) are considered
outliers.
Interquartile Range (IQR) Method: An observation is identified as an outlier if it falls
outside the range of 1.5 times the IQR below the first quartile ($Q_1$) or above the
third quartile ($Q_3$). This method is visually represented by a Box Plot.
2. Visualization Methods:
Box Plots: A visual representation of data distribution that clearly shows the quartiles,
median, and marks points outside the whiskers as potential outliers.
Scatter Plots: Data points that lie far away from the general cluster of points can be
visually identified as outliers.
3. Distance-based Methods:
Identifies outliers as those data points that are far away from their nearest neighbors.
For instance, a point could be classified as an outlier if less than a fraction of all points
are within a specified distance from it.
4. Model-based Methods (Clustering):
Uses clustering algorithms (like K-Means) where data points that do not belong to any
cluster or are significantly far from the cluster centroid are flagged as potential outliers.
_____________________________________________________________________
15. What is data cleaning? Explain any two data cleaning methods.
Data Cleaning also known as data cleansing or scrubbing, is the process of correcting or
removing incorrect, incomplete, or duplicate data within a dataset. It is done to handle
irrelevant or missing data. The process involves filling in missing values, smoothing any
noisy data, identifying and removing outliers, and resolving inconsistencies.
Data Cleaning Methods/Operations:
Filling in Missing Values (Imputation):

16
This technique addresses missing values in the dataset, which is known as imputation.
Missing values can be filled in by guessing them from other data fields or by simply
using the mean of all non-null values for that attribute. For example, the null value of an
"age" attribute for a customer might be replaced with the mean age of all customers in
the same category.
Identifying and Removing Outliers (Smoothing Noisy Data):
Real-world data is often noisy and may contain outliers, which are errors.
Data cleaning involves identifying and removing these outliers to ensure the quality of
the data. An outlier is an extreme value that significantly deviates from other
observations.
_______________________________________
16. What do you mean by Data Reduction? Explain any two Data Reduction
strategies.
Data Reduction is an essential phase in the data preprocessing pipeline. Its primary goal
is to obtain a reduced representation of the data set, meaning a much smaller volume
of data, while ensuring that the integrity of the original data's information content is
preserved as much as possible.
Data reduction helps in: Reducing storage cost and volume. Increasing the speed of
complex query execution and data mining tasks.
Explain any two Data Reduction strategies:
i. Dimensionality Reduction
This strategy aims to reduce the number of attributes (random variables or features)
under consideration. It reduces the number of unwanted variables by obtaining a
smaller set of principal variables, which are often the most important features. It
includes methods like Feature Selection (selecting a subset of the original features) and
Feature Extraction (transforming data into a new set of dimensions, such as using
Principal Component Analysis - PCA).
Benefit: Reducing the number of dimensions helps to combat the curse of
dimensionality and can significantly improve the performance of machine learning
models.
ii. Data Cube Aggregation
This technique involves aggregating or summarizing data, which inherently provides a
much smaller data representation. A data cube stores pre-computed and summarized
multidimensional information. For example, instead of storing individual sales
transactions, the data is aggregated to show total sales per month or per year, across
different regions and products.
Benefit: This eases multidimensional analysis and speeds up data access for data mining,
achieving reduction without significant loss of analytical insight.

17
17. What do you mean by hypothesis testing? Explain null and alternate hypothesis.
Hypothesis Testing is a core inferential statistical technique used in data analysis. It is a
formal process used to determine whether there is sufficient statistical evidence in a
sample of data to conclude that a specific condition or relationship holds for the entire
population. Essentially, it's a method for testing a claim or an idea about a population
using sample data.
Explain Null and Alternate Hypothesis:
i. Null Hypothesis (H0 )
The Null Hypothesis is the statement of no effect, no difference, or no relationship. It is
the hypothesis that the analyst is trying to disprove or reject. It generally states that the
sample statistic is equal to the population statistic. Example: H0 : The average height
of all students in Pune University is 5 feet 8 inches.
ii. Alternative Hypothesis (Ha or H1 )
The Alternative Hypothesis is the claim about the population that contradicts the null
hypothesis. It is the statement that is accepted if the null hypothesis is rejected based
on the statistical evidence. It represents the researcher's new claim or finding, stating
that a difference, effect, or relationship exists in the population.
Example: Ha : The average height of all students in Pune University is not 5 feet 8
inches
The goal of the testing process is to calculate the probability of observing the sample
data assuming H0 is true. If this probability is very low, the analyst rejects H0 and
accepts Ha.
_____________________________________________________
[Link] to visualize geospatial data? Explain in detail.
Geospatial data (also called spatial data or geodata) consists of numeric data that
denotes a geographic coordinate system, primarily latitude and longitude, which
identifies the location of a physical object. The primary method of visualization is
through maps. Visualizing this data involves placing data points, regions, or movements
within their geographical context, allowing for the identification of spatial patterns,
clusters, and relationships.
Common Geospatial Visualization Techniques
i. Choropleth Maps - It divides a geographic area (like a country, state, or county) into
predefined regions and uses different colors or shades to represent the value or range
of a specific variable within that region. Ideal for visualizing aggregate data like
population density, average income, or election results across distinct administrative
boundaries. The file specifically mentions the role of a choropleth map.
ii. Heat Maps (Density Maps) - Heat maps are used to represent the concentration or
intensity of continuous data over a geographic area. Unlike choropleth maps, the colors
in a heat map do not strictly correspond to fixed geographical boundaries; instead, they
are determined by the density of data points. Areas with high concentration or intensity
of the variable appear as "hot spots" (often with intense colors like red/yellow), while
18
areas with lower values appear "cooler" (often blue/green). Useful for visualizing large
datasets with overlapping points, such as crime density, Wi-Fi usage, or regions of high
seismic activity.
iii. Point Maps and Cluster Maps - These use individual markers to represent discrete
locations or events.
Point Map: Places a single dot/point at the exact latitude and longitude of an event or
object
Cluster Map: When data points are too numerous and overlap, a cluster map groups
nearby points into a single symbol, which is often sized or labeled with the count of
grouped points, reducing clutter.
--------------------------------------------------------------------------------------------------------
[Link] Exploratory Data Analysis (EDA) in detail.
Exploratory Data Analysis (EDA) is the crucial initial process of investigating a dataset to
summarize its main characteristics, often using visual methods and descriptive statistics,
before formal modeling begins.
EDA is performed to:
Understand Data Quality: Identify missing values, erroneous data, and inconsistencies.
Discover Patterns and Anomalies: Detect important patterns, underlying trends, and
anomalies (outliers) in the data.
Form Hypotheses: Generate initial hypotheses and ideas about the solution and factors
affecting it.
Relationships: Understand the relationships between various features or entities within
the dataset.
Techniques Used in EDA
i. Descriptive Statistics - This involves generating summary statistics to describe the
features of the data. Central Tendency: Calculating mean, median, and mode.
Variability/Spread: Calculating standard deviation, variance, and range.
ii. Data Visualization - This is essential for graphical exploration, allowing patterns to be
captured easily. Bar Graphs/Histograms: Used to visualize the distribution of individual
features. Scatter Plots: Used to capture the relationship between two numerical
variables.

19
[Link] are the various types of data? Explain in detail and example
Classification by Structure
Data Type Explanation Example
Data that is well-organized in a Customer information
pre-defined manner or fixed format. It stored in a SQL table with
Structured Data
typically resides in a relational database columns like CustomerID,
management system (RDBMS). Name, Address.
Data that has no pre-defined structure, Emails, images, audio
Unstructured
format, or sequence. It is the majority of files, video files, and
Data
data generated today. social media posts.
Data that does not conform to a formal
RDBMS structure but contains tags or
Semi-structured JSON, XML, and HTML
markers to separate and organize
Data documents.
elements, making it easier to analyze than
unstructured data.
Classification by Measurement
Data Type Explanation Example
Data that can be counted and has a
The number of students in a class
Discrete finite, or countably infinite, number of
or the number of cars passing an
Data possible values. It often results from
intersection.
counting.
Data that can take any value within a The time taken by athletes to
Continuous
specified range. It is usually measured, complete a race, temperature, or
Data
not counted. a person's height or weight.

21. What are the measures of central tendency? Explain any two of them in brief.
Measures of Central Tendency are single values that attempt to describe a set of data
by identifying the central position within that set of data. They are also known as
measures of central location and are a key part of Descriptive Statistics, which aims to
summarize data to make it easier to understand.
The three most common measures of central tendency mentioned in your syllabus are:
Mean, Median, Mode
1. Mean (Arithmetic Mean) = The mean, or average, is calculated by summing up all the
values in the data set and then dividing the sum by the total number of values. It is the
most commonly used measure of central tendency because it includes every value in
the data set in its calculation. The mean is best used for data that is normally distributed.
A major disadvantage is that the mean is sensitive to extreme values (outliers), which
can skew the mean upward or downward, making it a poor representative of the typical
value in a skewed distribution.

20
2. Median = The median is the middle value in a dataset when the values are arranged
in ascending or descending order. It divides the data distribution into two halves, with
50% of the observations on either side of the median value. The median is less affected
by outliers and skewed data than the mean and is therefore often the preferred
measure of central tendency when the distribution is not symmetrical.
----------------------------------------------------------------------------------------------------------------
22. What do you mean by data discretization? Explain discretization by Histogram
analysis.
Data discretization is a data preprocessing technique characterized as a method of
translating attribute values of continuous data into a finite set of intervals (or bins) with
minimal information loss. It simplifies the original continuous data by substituting
interval marks for the actual numeric values. This transformation is crucial for
algorithms that only accept categorical attributes or to improve the performance of
other algorithms by reducing the number of values for a continuous attribute. The
technique is used to divide the continuous attributes into data with intervals.
Discretization by Histogram Analysis
Histogram analysis is an unsupervised discretization technique because it does not use
class (label) information to partition the data.
- Partitioning: In this method, a histogram distributes an attribute's observed values into
disjoint ranges called buckets or bins.
- Binning Rules: Various partitioning rules can be used to define the buckets:
Equal-width: The width of each bucket range is uniform (e.g., all bins span an
interval of 10 units).
Equal-frequency (or Equal-depth): The buckets are created such that, roughly, the
frequency (count of data samples) of each bucket is constant. This means each bucket
contains approximately the same number of contiguous data samples.
- Concept Hierarchy: The histogram analysis algorithm can be applied recursively to
each partition to automatically generate a multilevel concept hierarchy.

21

You might also like