7-Mark Answers: Data Analytics & R Questions
Concept of Data Science and its use in Business
Data Science is the process of collecting, organizing, analyzing, and interpreting data to
extract useful insights. It combines statistics, programming, machine learning, and business
knowledge.
Uses in Business:
1. Helps in decision-making using data.
2. Improves customer experience through recommendations.
3. Predicts sales and market trends.
4. Detects fraud in banking and finance.
5. Optimizes inventory and supply chain.
6. Supports targeted marketing.
7. Increases efficiency and profit.
Example: Amazon uses data science to recommend products to customers.
Data Analytics vs Data Analysis
Data Analysis focuses on examining data to find conclusions. Data Analytics is broader and
includes data collection, processing, prediction, and decision-making.
Differences:
1. Data Analysis studies past data.
2. Data Analytics uses tools and models for future prediction.
3. Analysis is a part of Analytics.
4. Analysis gives insights; Analytics supports business strategy.
5. Analytics includes machine learning and forecasting.
Example: Sales report checking is analysis, while predicting future sales is analytics.
Types of Analytics and real-life uses
1. Descriptive Analytics – explains what happened.
Example: Monthly sales reports.
2. Diagnostic Analytics – explains why it happened.
Example: Finding reasons for decrease in sales.
3. Predictive Analytics – predicts future outcomes.
Example: Weather forecasting or stock prediction.
4. Prescriptive Analytics – suggests actions.
Example: Google Maps suggesting fastest route.
These analytics help businesses make better decisions.
Define Big Data, characteristics, applications and challenges
Big Data refers to extremely large and complex data that cannot be managed using
traditional methods.
Characteristics (5Vs):
1. Volume – huge amount of data.
2. Velocity – fast speed of generation.
3. Variety – different data types.
4. Veracity – data accuracy.
5. Value – usefulness of data.
Applications:
- Healthcare
- Banking
- E-commerce
- Social media
- Education
Challenges:
- Data security
- Storage issues
- Data quality
- Processing speed
- Privacy concerns
How does classification of analytics help in decision-making?
Classification of analytics helps organizations understand past performance, identify
problems, predict future trends, and choose the best actions.
1. Descriptive analytics gives past information.
2. Diagnostic analytics identifies causes.
3. Predictive analytics forecasts future outcomes.
4. Prescriptive analytics recommends solutions.
Benefits:
- Better planning
- Faster decisions
- Reduced risks
- Improved business performance
Example: Companies predict customer demand before launching products.
Process of data preparation and cleaning in spreadsheet
Data preparation and cleaning means organizing and correcting raw data before analysis.
Steps:
1. Remove duplicate records.
2. Handle missing values.
3. Correct spelling and formatting errors.
4. Standardize data format.
5. Remove unnecessary columns.
6. Check outliers.
7. Validate data accuracy.
Importance:
- Improves data quality.
- Gives accurate results.
- Reduces errors in analysis.
- Saves time during reporting.
How can outliers be identified in a dataset using spreadsheet?
Outliers are unusual values that are very different from other data.
Methods:
1. Sort data to find extreme values.
2. Use conditional formatting.
3. Create box plots or charts.
4. Use formulas like mean and standard deviation.
5. Apply IQR method.
Steps in spreadsheet:
- Calculate Q1 and Q3.
- Find IQR = Q3 – Q1.
- Values below Q1–1.5(IQR) or above Q3+1.5(IQR) are outliers.
Outlier detection improves accuracy of analysis.
Use of Pivot Table and Pivot Charts
Pivot Tables summarize large data quickly.
Uses:
1. Group data easily.
2. Calculate totals, averages, and counts.
3. Filter information.
4. Compare categories.
5. Create reports quickly.
Pivot Charts visually represent Pivot Table data using graphs.
Benefits:
- Easy visualization
- Better understanding
- Faster analysis
- Supports decision-making
Handling missing data in spreadsheet
Missing data refers to blank or unavailable values in a dataset.
Methods:
1. Remove incomplete rows.
2. Replace with mean or median.
3. Use previous values.
4. Fill manually if possible.
5. Use formulas or interpolation.
Implications:
- Incorrect analysis
- Biased results
- Reduced accuracy
- Wrong business decisions
Proper handling improves data quality.
Interactive dashboard in spreadsheet
An interactive dashboard is a visual display of key information using charts, tables, and
filters.
Steps:
1. Organize data.
2. Create Pivot Tables.
3. Add charts and graphs.
4. Use slicers and filters.
5. Apply conditional formatting.
6. Design clear layout.
Benefits:
- Real-time insights
- Easy monitoring
- Better decision-making
- User-friendly visualization
Role of scatter plots, line charts and histograms
Scatter Plot:
Shows relationship between two variables.
Line Chart:
Shows trends over time.
Histogram:
Shows frequency distribution of data.
Importance:
- Identifies patterns
- Detects trends
- Helps comparison
- Makes data easy to understand
- Supports analysis and forecasting
Techniques available in spreadsheet for data visualization
1. Bar charts
2. Pie charts
3. Line charts
4. Histograms
5. Scatter plots
6. Pivot charts
7. Conditional formatting
Importance:
- Improves understanding
- Highlights trends
- Detects errors
- Helps accurate reporting
- Supports decision-making
What is R? How do we install an R package?
R is a programming language used for statistics, data analysis, and visualization.
Features:
- Open-source
- Statistical computing
- Graphical tools
- Data analysis support
Installing package:
Use command:
[Link]("package_name")
Loading package:
library(package_name)
Example:
[Link]("ggplot2")
Features of R
1. Open-source software
2. Supports statistical analysis
3. Powerful data visualization
4. Large package library
5. Cross-platform support
6. Supports machine learning
7. Easy data manipulation
R is widely used in research and business analytics.
Difference between setwd() and getwd()
setwd():
Used to set/change the current working directory.
Example:
setwd("C:/Data")
getwd():
Used to display the current working directory.
Example:
getwd()
Difference:
setwd() changes location, while getwd() shows location.
How do you remove NA values from a data frame?
NA values represent missing data in R.
Methods:
1. [Link](dataframe)
2. [Link]()
3. Replace NA with mean/median.
Example:
data <- [Link](data)
Benefits:
- Improves accuracy
- Prevents calculation errors
- Makes analysis reliable
Logical operators in R
Logical operators compare conditions.
Operators:
1. & → AND
2. | → OR
3. ! → NOT
4. == → Equal to
5. != → Not equal to
6. >, <, >=, <=
Example:
x > 5 & y < 10
Used in filtering and decision-making.
What does a histogram represent?
A histogram represents frequency distribution of continuous data.
Features:
- Bars touch each other.
- Shows spread of data.
- Displays patterns and distribution.
Uses:
- Detects skewness
- Identifies outliers
- Understands data distribution
Example: Marks distribution of students.
Difference between bar chart and histogram
Bar Chart:
- Used for categorical data.
- Bars are separated.
- Compares categories.
Histogram:
- Used for continuous data.
- Bars touch each other.
- Shows frequency distribution.
Example:
Bar chart for subjects, histogram for marks distribution.
When is a line graph most appropriate?
A line graph is best used to show trends and changes over time.
Uses:
- Stock prices
- Temperature changes
- Monthly sales
- Population growth
Advantages:
- Easy trend analysis
- Shows increase/decrease clearly
- Useful for forecasting
Correlation vs Covariance
Correlation measures strength and direction of relationship between variables.
Covariance measures how two variables vary together.
Differences:
1. Correlation ranges from -1 to +1.
2. Covariance has no fixed range.
3. Correlation is standardized.
4. Covariance depends on units.
Correlation is easier to interpret.
Linear Regression Model
Linear Regression shows relationship between dependent and independent variables.
Equation:
Y = a + bX
Where:
Y = dependent variable
X = independent variable
a = intercept
b = slope
Uses:
- Sales prediction
- Trend analysis
- Forecasting
Advantages:
- Simple
- Easy interpretation
- Useful for prediction
Multiple Regression
Multiple Regression uses two or more independent variables to predict one dependent
variable.
Equation:
Y = a + b1X1 + b2X2 + ...
Example:
Predicting house price using size, location, and age.
Advantages:
- Better accuracy
- Handles multiple factors
- Useful in business forecasting
Multicollinearity in Regression
Multicollinearity occurs when independent variables are highly correlated.
Effects:
- Reduces accuracy
- Difficult coefficient interpretation
- Increases errors
Detection:
- Correlation matrix
- VIF (Variance Inflation Factor)
Solution:
- Remove related variables
- Use feature selection
Heteroscedasticity in Regression
Heteroscedasticity occurs when error variance is not constant.
Effects:
- Unreliable predictions
- Incorrect statistical tests
Causes:
- Outliers
- Improper data
Detection:
- Residual plots
- Statistical tests
Solutions:
- Transform data
- Remove outliers
- Use weighted regression
Textual Data Analysis
Textual Data Analysis means analyzing text data to extract useful information.
Steps:
1. Data collection
2. Cleaning text
3. Tokenization
4. Sentiment analysis
5. Interpretation
Applications:
- Social media analysis
- Customer feedback
- Review analysis
Role of Residuals in Regression
Residuals are differences between actual and predicted values.
Formula:
Residual = Actual – Predicted
Importance:
- Measures prediction error
- Checks model accuracy
- Detects outliers
- Helps validate regression assumptions
Good models have small residuals.
Purpose of Confidence Interval in Regression
Confidence Interval gives a range within which the true parameter value is expected.
Importance:
- Measures reliability
- Shows uncertainty
- Helps statistical inference
Example:
95% confidence interval means results are expected to fall within range with 95%
confidence.
Predictive Interval in Regression
Prediction Interval estimates the range for future observations.
Features:
- Wider than confidence interval
- Includes uncertainty in prediction
Uses:
- Forecasting
- Future value estimation
Example:
Predicting next month sales range.
Importance of checking statistical significance of coefficients
Statistical significance checks whether variables truly affect the outcome.
Methods:
- p-value
- t-test
Importance:
1. Identifies useful variables.
2. Improves model accuracy.
3. Removes unnecessary variables.
4. Supports reliable conclusions.
Usually p-value < 0.05 indicates significance.
Goal of Textual Data Analysis
Goals:
1. Extract useful information from text.
2. Identify sentiments and opinions.
3. Classify documents.
4. Detect patterns and trends.
5. Support business decisions.
Applications:
- Chatbots
- Review analysis
- Social media monitoring
Difference between Structured and Unstructured Data
Structured Data:
- Organized in rows and columns.
- Easy to store and analyze.
Example: Excel tables.
Unstructured Data:
- No fixed format.
- Difficult to process.
Example: Images, videos, emails.
Structured data is easier for analysis.
Tokenization in Text Analysis
Tokenization is the process of breaking text into smaller units called tokens.
Tokens may be:
- Words
- Sentences
- Characters
Example:
“I love data science” → [I, love, data, science]
Importance:
- Text processing
- Sentiment analysis
- Machine learning
Text Mining, Text Categorization and Sentiment Analysis
Text Mining:
Extracting useful patterns from text data.
Text Categorization:
Classifying text into categories.
Sentiment Analysis:
Finding emotions/opinions from text.
Applications:
- Product review analysis
- Spam detection
- Social media monitoring
- Customer feedback analysis