0% found this document useful (0 votes)
9 views7 pages

Data Analytics Visualization Expanded Answers

The document outlines various aspects of data analytics, including the data analytics lifecycle, text mining, regression techniques, and data visualization methods in R and Python. It discusses key roles in data analytics, the importance of analytic sandboxes, and methods for detecting dirty data. Additionally, it covers concepts such as time series analysis, sentiment analysis methods, and differences between various libraries and techniques.

Uploaded by

raj.224346101
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views7 pages

Data Analytics Visualization Expanded Answers

The document outlines various aspects of data analytics, including the data analytics lifecycle, text mining, regression techniques, and data visualization methods in R and Python. It discusses key roles in data analytics, the importance of analytic sandboxes, and methods for detecting dirty data. Additionally, it covers concepts such as time series analysis, sentiment analysis methods, and differences between various libraries and techniques.

Uploaded by

raj.224346101
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Analytics and Visualization - Expanded Answers

Q1) 1) List and explain different phases in data analytics lifecycle.

Ans: The data analytics lifecycle consists of six main phases:

1. Discovery - Understand business objectives, identify data sources, and form hypotheses.

2. Data Preparation - Clean, transform, and prepare data for analysis.

3. Model Planning - Decide on the techniques and tools to use, such as regression or clustering.

4. Model Building - Apply statistical and machine learning models.

5. Communicate Results - Share insights using visualizations and reports.

6. Operationalize - Deploy models and monitor their performance over time.

Q2) 2) What is text mining? Enlist and explain seven practice areas of text analytics.

Ans: Text mining refers to extracting meaningful information from unstructured text. The seven

practice areas of text analytics are:

1. Information Extraction - Identify entities and relationships.

2. Categorization - Classify texts into categories.

3. Clustering - Group similar documents.

4. Summarization - Produce a concise summary of text.

5. Sentiment Analysis - Determine the emotional tone.

6. Topic Tracking - Monitor evolving topics over time.

7. Concept/Entity Extraction - Identify key topics or entities from text.

Q3) 3) What is stepwise regression? State and explain different types of stepwise regression.

Ans: Stepwise regression is used to select a subset of variables for a regression model by adding or

removing predictors. The main types are:

1. Forward Selection - Starts with no variables and adds the most significant one at each step.

2. Backward Elimination - Starts with all variables and removes the least significant one.

3. Bidirectional Elimination - Combines both forward selection and backward elimination.


Q4) 4) Explain different types of data visualization in R and Python programming language.

Ans: In R, ggplot2 is widely used for data visualization using the grammar of graphics. In Python,

libraries like Matplotlib and Seaborn are popular.

Types of visualizations include:

- Line plots for trends.

- Bar charts for category comparison.

- Scatter plots for relationships between variables.

- Histograms for distribution.

- Box plots for identifying outliers.

Q5) 5) Show how logistic regression can be used as a classifier.

Ans: Logistic regression is used for binary classification problems. It predicts the probability of a

target variable belonging to a class.

Steps:

1. Use the logistic function (sigmoid) to map predictions to probabilities.

2. Set a threshold (e.g., 0.5) to classify into classes.

Example: Predicting whether an email is spam (1) or not spam (0).

Q6) 6) List and explain the steps in the Text analysis.

Ans: Steps in text analysis include:

1. Data Collection - Gather text data from sources like social media, reviews, etc.

2. Preprocessing - Tokenization, stop word removal, stemming, and lemmatization.

3. Feature Extraction - Convert text into numerical format using TF-IDF or Bag-of-Words.

4. Model Training - Use ML models for classification or sentiment analysis.

5. Evaluation - Assess model performance using accuracy, precision, etc.

Q7) 7) Explain AR, MA, ARMA and ARIMA model in detail.

Ans: AR (AutoRegressive): Models current value based on past values.


MA (Moving Average): Uses past forecast errors.

ARMA: Combines AR and MA for stationary data.

ARIMA: Adds differencing to handle non-stationary data.

Each model is used for time series forecasting where trends and seasonality need to be modeled.

Q8) 8) Explain Box-Jenkins intervention analysis.

Ans: Box-Jenkins intervention analysis is used when a time series is affected by an external event

(intervention). It models the series using ARIMA and adjusts for the intervention.

Steps:

1. Identify intervention.

2. Fit ARIMA model.

3. Estimate the effect of the intervention on the time series.

Example: Measuring the effect of a new policy on sales data.

Q9) 9) What is regression? What is simple linear regression? What is logistic regression?

Ans: Regression predicts a dependent variable based on one or more independent variables.

- Simple Linear Regression: One predictor and a straight-line relationship.

- Logistic Regression: Used for binary classification; outputs probabilities using sigmoid function.

Q10) 10) List and explain methods that can be used in sentiment analysis.

Ans: Methods used in sentiment analysis include:

1. Lexicon-based - Use dictionaries of positive and negative words.

2. Machine Learning - Train classifiers (e.g., Naive Bayes, SVM) on labeled data.

3. Deep Learning - Use models like LSTM and CNN for context-based understanding.

Each method has its strengths depending on the complexity of the data.

Q11) 11) Explain with suitable example how the TF, DF, and IDF are used in information

retrieval.

Ans: TF (Term Frequency): Number of times a term appears in a document.


DF (Document Frequency): Number of documents containing the term.

IDF (Inverse Document Frequency): Measures how important a term is. IDF = log(Total docs / DF).

TF-IDF is used to give more importance to rare but relevant words.

Q12) 12) How Exploratory data analysis is performed in R?

Ans: EDA in R involves:

1. Data Summarization: Using functions like summary(), head(), etc.

2. Visualization: Using ggplot2 or base R plots to understand data distribution.

3. Missing Value Detection: Using [Link]() and visualizations.

4. Outlier Detection: Boxplots and scatterplots help identify unusual values.

Q13) 13) What is time series analysis? Explain components of time series?

Ans: Time series analysis involves studying data points collected over time.

Components:

1. Trend - Long-term movement.

2. Seasonality - Repeated patterns over time.

3. Cyclic - Long-term oscillations not tied to seasonality.

4. Irregular - Random noise.

Used in forecasting stock prices, sales, etc.

Q14) 14) How is data exploration different from presentation? Explain with suitable

examples?

Ans: Data exploration is the process of examining datasets to summarize their main characteristics.

Example: Using histograms to understand distributions.

Data presentation involves visualizing processed data for stakeholders using dashboards, reports,

and visualizations to aid decision making.

Q15) 15) What is Pandas? Explain features of Pandas.

Ans: Pandas is a Python library for data manipulation and analysis.


Key features:

- DataFrame and Series structures.

- Handling missing data.

- Data filtering, grouping, and merging.

- Integration with visualization and statistical tools.

Q16) 16) List and explain different key roles for successful data analytics?

Ans: Key roles in data analytics include:

1. Data Analyst - Explores and visualizes data.

2. Data Scientist - Builds models and algorithms.

3. Data Engineer - Manages data pipelines.

4. Business Analyst - Bridges technical team and business.

5. Project Manager - Oversees timelines and deliverables.

Q17) 17) What is analytic sandbox? And why is it important?

Ans: An analytic sandbox is a secure environment for data scientists to access and explore data

without affecting live systems.

Importance:

- Safe experimentation.

- Promotes innovation.

- Supports reproducible research and collaboration.

Q18) 18) Explain how dirty data can be detected in the data exploration phase with

visualizations.

Ans: Dirty data includes incorrect, duplicate, or missing data. Detection methods include:

- Visual tools like box plots for outliers.

- Histograms for unexpected distributions.

- Heatmaps for missing data.


Cleaning involves imputation, transformation, and removal of noisy records.

Q19) 19) Differentiate between the following:

i) Matplotlib and seaborn library

ii) Linear and logistic regression

iii) Extractive and abstractive summarization

iv) Pandas and NumPy

Ans: i) Matplotlib vs Seaborn: Matplotlib is low-level and flexible. Seaborn builds on Matplotlib with a

simpler syntax and better aesthetics.

ii) Linear vs Logistic Regression: Linear predicts continuous outputs; logistic is for classification.

iii) Extractive vs Abstractive Summarization: Extractive picks sentences from text; abstractive

generates summaries in new words.

iv) Pandas vs NumPy: Pandas handles structured data (DataFrames); NumPy handles numerical

arrays and mathematical operations.

Q20) 20) Write a short note on the following:

i) Generalized linear model

ii) Pandas library

iii) Data import and export in R

iv) Regression plot

v) Seaborn Library

Ans: i) Generalized Linear Model: Extends linear models to non-normal distributions (e.g., logistic,

Poisson).

ii) Pandas: Python library for data manipulation using Series and DataFrame.

iii) Data Import/Export in R: Use functions like [Link](), [Link](), readxl, etc.

iv) Regression Plot: Shows relationship between variables and model fit (e.g., using Seaborn's

regplot).

v) Seaborn Library: Built on Matplotlib; used for statistical visualizations with fewer lines of code.
Q21) 21) Numerical based on Regression.

Ans: Numerical questions typically involve fitting regression lines, calculating coefficients using least

squares, interpreting R-squared, etc. Refer to textbook exercises for specific solved examples.

You might also like