Structured Data Unstructured Data
1. Write and explain data visualization libraries in Python
Data visualization is the graphical representation of data using
Organized in rows Not organized in a
charts and graphs to understand trends and patterns. Python predefined format
provides several visualization libraries. and columns
Matplotlib is the basic library used for creating simple graphs
Easy to store and Difficult to process and
like line, bar, and scatter plots. Gives full control over the graph
analyze analyze
design.
Seaborn is built on Matplotlib and provides more attractive
Stored in databases Stored in files, text, media
statistical visualizations like heatmaps and pairplots.
(SQL)
Plotly is used for interactive visualizations, while
Pandas also offers quick built-in plotting functions. Examples: tables, Examples: images, videos,
2. Outliers: Define "outlier" and explain outlier detection audio, emails, PDFs
spreadsheets
methods
An outlier is a value that lies far away from the rest of the data 9. One-Hot Encoding : One-hot encoding converts categorical data into
and does not follow the common trend. Outliers can be binary vectors, with 1 for the present category and 0 for others.
detected using several methods. The Z-score method marks a Example: Red → [1,0,0], Green → [0,1,0], Blue → [0,0,1].
value as an outlier if its Z-score is greater than +3 or less than –
[Link] Cube Aggregation:Data cube aggregation is a data reduction and
3. The IQR method uses quartiles and considers values outside
summarization technique used in data warehousing and OLAP (Online
Q1 – 1.5×IQR or Q3 + 1.5×IQR as outliers. Visual methods like
Analytical Processing). A data cube organizes data into multiple dimensions
boxplots and scatter plots also help identify points far from the
such as time, location, and product. Aggregation summarizes detailed data
main data cluster. Advanced techniques such as Isolation Forest
by applying operations like roll-up, drill-down, slice, and dice.
can also be used for outlier detection.
3. 3 V's of Data (Big Data) Roll-up converts detailed data into higher-level summaries (e.g., daily to
1. Volume monthly sales).
Volume refers to the huge amount of data generated every Drill-down is the reverse, moving from summary to detailed levels.
[Link] Data deals with terabytes, petabytes, and exabytes of Slice selects a single dimension, while dice selects a sub-cube using
data. Example: Social media posts, sensor data, transaction data. multiple [Link] purpose of data cube aggregation is to reduce
2. Velocity dataset size, improve query performance, simplify analysis, and support
Velocity means the speed at which data is generated, decision-making
processed, and analyzed.
Big Data systems must handle real-time or near real-time data. [Link] encoding represents data using visual features like color, size,
Example: Live streaming data, online transactions, GPS signals. shape, or position to easily understand patterns.
[Link]
Variety refers to the different types and formats of [Link] Data 12. A nominal attribute is a categorical variable with no order among its
includes structured, semi-structured, and unstructured data. values.
Example: text, audio, video, images, logs, emails, sensor Example: Gender, Blood group.
readings.
13. Bubble plot shows three variables: X-axis, Y-axis, and bubble size (or
4. Data Cleaning + Missing Value Handling
color) for an extra dimension.
Data cleaning refers to the process of detecting and correcting
missing, incorrect, or inconsistent data. Missing values can be 14. Inferential Statistics : Inferential statistics uses sample data to make
handled in several ways. Deletion removes rows or columns conclusions or predictions about a population. Example: Estimating city
containing missing values. Imputation methods fill missing values average income from a survey of 100 people
using mean, median, or mode, or by using techniques like
forward/backward fill or KNN. Sometimes default values such as [Link] transformation: Data Transformation is the process of converting
zero or “unknown” are used, and advanced methods use raw data into a suitable format for analysis. It includes techniques like
predictive models to estimate the missing values. normalization, standardization, smoothing, aggregation, discretization,
encoding, and attribute construction. (1) Normalization: scales numerical
5. Data Transformation data into a smaller and consistent range (usually 0–1). It reduces the effect
Data transformation is the process of converting data from one of different measurement scales and improves the performance of data
format or structure to another, such as normalization, scaling, mining algorithms.
encoding, and aggregation. Types include Min–Max normalization, Z-score normalization, and Decimal
Data Discretization scaling. (2) Aggregation: combines multiple values into a single summary
Data discretization is the process of converting continuous data value. It converts detailed data into higher-level summaries such as daily to
into discrete intervals or categories. monthly sales or hourly to daily traffic counts. Aggregation reduces data
volume and makes analysis faster and clearer.
6. Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is a technique used to 16. Geospatial data is location-based data containing coordinates like latitude and
understand the patterns, structure, and relationships in a longitude. To visualize this data, maps and GIS tools are used.1) Map-Based
dataset before building a model. It involves examining data Visualization. Different types of maps help show spatial patterns:[Link] Map:
types, identifying missing values, detecting outliers, and Uses colors to show values (e.g., population density).[Link] Map: Shows intensity of
calculating summary statistics such as mean, median, and data using color [Link]/Point Map: Dots represent locations or counts (e.g.,
correlation. Visualization methods like histograms, scatter crime spots).[Link] Symbol Map: Bigger symbols = higher values (e.g., sales).
plots, boxplots, and heatmaps are used to explore distributions 2) GIS Tools:Tools like QGIS, ArcGIS, Google Earth Engine, Mapbox help load data, add
and relationships. EDA helps identify data quality issues and layers (roads, boundaries, rivers), apply styling, and analyze patterns.
guides furtherdata preprocessing and modeling.
3) Interactive Visualization:Tools like [Link], Google Maps API, Tableau allow
7. Variance / Standard Deviation : Variance measures how far zooming, filtering, and interactive viewing.
data values are from the mean; standard deviation is the square
root of variance showing data spread in original units.