Unit – V
Exploratory Analysis with base graphics tools in R (box plots, bar charts, line plots,
heat map, etc.) Customize plot axes, labels, add legends, and add colours - Data Analysis
Descriptive Statistics - Spotting problems with Data and Visualization.
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is a critical step in the data analysis process that involves
summarizing the main characteristics of a dataset, often using visual methods. EDA helps
uncover patterns, spot anomalies, test hypotheses, and check assumptions. Here’s a structured
approach to EDA, including techniques, visualizations, and best practices.
Objectives of EDA
1. Understand the Data: Gain insights into the data's structure, variables, and
distributions.
2. Identify Patterns: Discover relationships and trends in the data.
3. Spot Anomalies: Identify outliers and unusual data points that may affect analyses.
4. Check Assumptions: Verify assumptions necessary for statistical modeling.
Key Steps in EDA
1. Data Collection: Gather relevant data from various sources.
2. Data Cleaning: Handle missing values, remove duplicates, and address
inconsistencies.
3. Data Transformation: Normalize or standardize data if necessary.
4. Descriptive Statistics: Calculate summary statistics to understand distributions.
5. Data Visualization: Use visual methods to explore relationships and patterns.
1. Box Plots
Box plots are useful for visualizing the distribution of numerical data and spotting outliers.
Example:
# Load necessary libraries
data(mtcars)
# Create a box plot for 'mpg' by 'cyl' (number of cylinders)
boxplot(mpg ~ cyl, data = mtcars,
main = "Box Plot of MPG by Number of Cylinders",
xlab = "Number of Cylinders",
ylab = "Miles Per Gallon (MPG)",
col = "lightblue",
border = "darkblue")
# Add grid
grid()
1
Explanation:
mpg ~ cyl: This formula indicates that we want to plot 'mpg' against 'cyl'.
main, xlab, and ylab: Customize the title and axis labels.
col: Specifies the color of the boxes.
border: Sets the color of the box borders.
2. Bar Charts
Bar charts are used for comparing categorical data.
Example:
# Create a bar chart of the number of cars for each cylinder count
barplot(table(mtcars$cyl),
main = "Number of Cars by Cylinder Count",
xlab = "Number of Cylinders",
ylab = "Count",
col = c("lightgreen", "orange", "lightblue"),
beside = TRUE)
# Add legend
legend("topright", legend = levels([Link](mtcars$cyl)), fill =
c("lightgreen", "orange", "lightblue"))
2
Explanation:
table(mtcars$cyl): Creates a frequency table of 'cyl'.
beside: If set to TRUE, bars for different categories are placed side by side.
legend: Adds a legend for clarity.
3. Line Plots
Line plots are excellent for showing trends over time.
Example:
# Create a line plot of 'mpg' vs. 'hp' (horsepower)
plot(mtcars$hp, mtcars$mpg, type = "o", col = "blue",
main = "MPG vs Horsepower",
xlab = "Horsepower",
ylab = "Miles Per Gallon (MPG)")
# Add grid
grid()
3
Explanation:
type = "o": This indicates both points and lines should be drawn.
The col parameter sets the color of the lines and points.
4. Heatmaps
Heatmaps visualize data through color gradients and can highlight correlations or patterns.
Example:
# Create a correlation matrix and visualize it with a heatmap
cor_matrix <- cor(mtcars)
heatmap(cor_matrix,
main = "Correlation Heatmap",
col = [Link](256),
scale = "column",
margins = c(5, 5))
4
Explanation:
cor(mtcars): Computes the correlation matrix for the dataset.
[Link](256): Generates a color palette.
scale: Normalizes data by columns or rows.
Customizing Axes, Labels, and Colors
You can further customize plots with various parameters:
Axis Limits: Use xlim and ylim to set limits.
Font Size: Control with cex for points or [Link] for labels.
Colors: Use color palettes like rainbow(), [Link](), etc.
Descriptive Statistics: Spotting Problems with Data and Visualization
Descriptive statistics provide a summary of the main features of a dataset, allowing for a
better understanding of the data's characteristics. It includes measures such as mean, median,
mode, variance, standard deviation, and range. Additionally, visualizations like histograms,
box plots, and scatter plots can help identify anomalies or issues in the data.
Key Descriptive Statistics
1. Mean: The average value.
2. Median: The middle value when data is sorted.
3. Mode: The most frequently occurring value.
4. Variance: Measures how much the data varies from the mean.
5. Standard Deviation: The square root of variance; it provides insight into data spread.
6. Range: The difference between the maximum and minimum values.