0% found this document useful (0 votes)
8 views11 pages

Data Science with R: mtcars Analysis

The document summarizes the mtcars dataset in R. It displays the first 6 rows, provides summary statistics of the variables, and explores relationships between variables through plots and linear regression. Key details include there being 32 observations across 11 variables, with mpg ranging from 10.4 to 33.9 and summary plots exploring the distribution of mpg and its negative correlation with wt.

Uploaded by

PARIDHI DEVAL
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views11 pages

Data Science with R: mtcars Analysis

The document summarizes the mtcars dataset in R. It displays the first 6 rows, provides summary statistics of the variables, and explores relationships between variables through plots and linear regression. Key details include there being 32 observations across 11 variables, with mpg ranging from 10.4 to 33.9 and summary plots exploring the distribution of mpg and its negative correlation with wt.

Uploaded by

PARIDHI DEVAL
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

> data(mtcars)

> #view first six rows of mtcars dataset


> head(mtcars)
mpg cyl disp hp drat wt qsec vs am
gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1
4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1
4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1
4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0
3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0
3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0
3 1
> #summarize mtcars dataset
> summary(mtcars)
mpg cyl disp hp
drat
Min. :10.40 Min. :4.000 Min. : 71.1 Min. :
52.0 Min. :2.760
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.:
96.5 1st Qu.:3.080
Median :19.20 Median :6.000 Median :196.3 Median
:123.0 Median :3.695
Mean :20.09 Mean :6.188 Mean :230.7 Mean
:146.7 Mean :3.597
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd
Qu.:180.0 3rd Qu.:3.920
Max. :33.90 Max. :8.000 Max. :472.0 Max.
:335.0 Max. :4.930
wt qsec vs am
gear
Min. :1.513 Min. :14.50 Min. :0.0000 Min.
:0.0000 Min. :3.000
1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 1st
Qu.:0.0000 1st Qu.:3.000
Median :3.325 Median :17.71 Median :0.0000 Median
:0.0000 Median :4.000
Mean :3.217 Mean :17.85 Mean :0.4375 Mean
:0.4062 Mean :3.688
3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 3rd
Qu.:1.0000 3rd Qu.:4.000
Max. :5.424 Max. :22.90 Max. :1.0000 Max.
:1.0000 Max. :5.000
carb
Min. :1.000
1st Qu.:2.000
Median :2.000
Mean :2.812
3rd Qu.:4.000
Max. :8.000
> #display rows and columns
> dim(mtcars)
[1] 32 11
> #display column names
> names(mtcars)
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs"
"am" "gear" "carb"
> #create histogram of values for mpg
> hist(mtcars$mpg,
+ col='steelblue',
+ main='Histogram',
+ xlab='mpg',
+ ylab='Frequency')
> #create boxplot of values for mpg
> boxplot(mtcars$mpg,
+ main='Distribution of mpg values',
+ ylab='mpg',
+ col='steelblue',
+ border='black')
> #create scatterplot of mpg vs. wt
> plot(mtcars$mpg, mtcars$wt,
+ col='steelblue',
+ main='Scatterplot',
+ xlab='mpg',
+ ylab='wt',
+ pch=19)
> # Number of rows (observations)
> nrow(mtcars)
[1] 32
> # Number of columns (variables)
> ncol(mtcars)
[1] 11
> plot(mpg ~ wt, data = mtcars, col=2)
> fit <- lm(mpg ~ wt, data = mtcars)
> summary(fit)

Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘
’ 1

Residual standard error: 3.046 on 30 degrees of freedom


Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

> abline(fit,col=3,lwd=2)
> mtext(lmlab, 3, line=-2)
Error in [Link](text) : object 'lmlab' not found
> # Create a vector.
> x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
> # Find Mean.
> [Link] <- mean(x)
> print([Link])
[1] 8.22
> # Create the function.
> getmode <- function(v) {
+ uniqv <- unique(v)
+ uniqv[[Link](tabulate(match(v, uniqv)))]
+ }
> # Create the vector with numbers.
> v <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)
> # Calculate the mode using the user function.
> result <- getmode(v)
> print(result)
[1] 2
> # Create the vector with characters.
> charv <- c("o","it","the","it","it")
> # Calculate the mode using the user function.
> result <- getmode(charv)
> print(result)
[1] "it"
> x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
> y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
> # Apply the lm() function.
> relation <- lm(y~x)
> print(relation)

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept) x
-38.4551 0.6746
> # Apply the lm() function.
> relation <- lm(y~x)
> print(summary(relation))

Call:
lm(formula = y ~ x)

Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘
’ 1

Residual standard error: 3.253 on 8 degrees of freedom


Multiple R-squared: 0.9548, Adjusted R-squared: 0.9491
F-statistic: 168.9 on 1 and 8 DF, p-value: 1.164e-06

> # Create a sequence of numbers from 32 to 44.


> print(seq(32,44))
[1] 32 33 34 35 36 37 38 39 40 41 42 43 44
> # Find mean of numbers from 25 to 82.
> print(mean(25:82))
[1] 53.5
> # Find sum of numbers frm 41 to 68.
> print(sum(41:68))
[1] 1526
> # Create a function to print squares of numbers in
sequence.
> [Link] <- function(a) {
+ for(i in 1:a) {
+ b <- i^2
+ print(b)
+ }
+ }
> # Call the function [Link] supplying 6 as an
argument.
> [Link](6)
[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
[1] 36
> # Create a vector.
> apple <- c('red','green',"yellow")
> print(apple)
[1] "red" "green" "yellow"
> # Get the class of the vector.
> print(class(apple))
[1] "character"
> # Create a list.
> list1 <- list(c(2,5,3),21.3,sin)
> # Print the list.
> print(list1)
[[1]]
[1] 2 5 3

[[2]]
[1] 21.3

[[3]]
function (x) .Primitive("sin")

> # Create a matrix.


> M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol =
3, byrow = TRUE)
> print(M)
[,1] [,2] [,3]
[1,] "a" "a" "b"
[2,] "c" "b" "a"
> # Create an array.
> a <- array(c('green','yellow'),dim = c(3,3,2))
> print(a)
, , 1

[,1] [,2] [,3]


[1,] "green" "yellow" "green"
[2,] "yellow" "green" "yellow"
[3,] "green" "yellow" "green"

, , 2

[,1] [,2] [,3]


[1,] "yellow" "green" "yellow"
[2,] "green" "yellow" "green"
[3,] "yellow" "green" "yellow"

> # Create a vector.


> apple_colors <-
c('green','green','yellow','red','red','red','green')
> # Create a factor object.
> factor_apple <- factor(apple_colors)
> # Print the factor.
> print(factor_apple)
[1] green green yellow red red red green
Levels: green red yellow
> print(nlevels(factor_apple))
[1] 3
> # Create the data frame.
> BMI <- [Link](
+ gender = c("Male", "Male","Female"),
+ height = c(152, 171.5, 165),
+ weight = c(81,93, 78),
+ Age = c(42,38,26)
+ )
> print(BMI)
gender height weight Age
1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26
> # Assignment using equal operator.
> var.1 = c(0,1,2,3)
> # Assignment using leftward operator.
> var.2 <- c("learn","R")
> # Assignment using rightward operator.
> c(TRUE,1) -> var.3
> print(var.1)
[1] 0 1 2 3
> cat ("var.1 is ", var.1 ,"\n")
var.1 is 0 1 2 3
> cat ("var.2 is ", var.2 ,"\n")
var.2 is learn R
> cat ("var.3 is ", var.3 ,"\n")
var.3 is 1 1
> a <- "Hello"
> b <- 'How'
> c <- "are you? "
> print(paste(a,b,c))
[1] "Hello How are you? "
> print(paste(a,b,c, sep = "-"))
[1] "Hello-How-are you? "
> print(paste(a,b,c, sep = "", collapse = ""))
[1] "HelloHoware you? "
> # Total number of digits displayed. Last digit rounded
off.
> result <- format(23.123456789, digits = 9)
> print(result)
[1] "23.1234568"
> # Display numbers in scientific notation.
> result <- format(c(6, 13.14521), scientific = TRUE)
> print(result)
[1] "6.000000e+00" "1.314521e+01"
> # The minimum number of digits to the right of the
decimal point.
> result <- format(23.47, nsmall = 5)
> print(result)
[1] "23.47000"
> # Format treats everything as a string.
> result <- format(6)
> print(result)
[1] "6"
> # Numbers are padded with blank in the beginning for
width.
> result <- format(13.7, width = 6)
> print(result)
[1] " 13.7"
> # Left justify strings.
> result <- format("Hello", width = 8, justify = "l")
> print(result)
[1] "Hello "
> # Justfy string with center.
> result <- format("Hello", width = 8, justify = "c")
> print(result)
[1] " Hello "
>
> # Accessing vector elements using position.
> t <- c("Sun","Mon","Tue","Wed","Thurs","Fri","Sat")
> u <- t[c(2,3,6)]
> print(u)
[1] "Mon" "Tue" "Fri"
> # Accessing vector elements using logical indexing.
> v <- t[c(TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE)]
> print(v)
[1] "Sun" "Fri"
> # Accessing vector elements using negative indexing.
> x <- t[c(-2,-5)]
> print(x)
[1] "Sun" "Tue" "Wed" "Fri" "Sat"
> # Accessing vector elements using 0/1 indexing.
> y <- t[c(0,0,0,0,0,0,1)]
> print(y)
[1] "Sun"
> # Create a list containing a vector, a matrix and a list.
> list_data <- list(c("Jan","Feb","Mar"),
matrix(c(3,9,5,1,-2,8), nrow = 2),
+ list("green",12.3))
> # Give names to the elements in the list.
> names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner
list")
> # Access the first element of the list.
> print(list_data[1])
$`1st Quarter`
[1] "Jan" "Feb" "Mar"

> # Access the thrid element. As it is also a list, all its


elements will be printed.
> print(list_data[3])
$`A Inner list`
$`A Inner list`[[1]]
[1] "green"

$`A Inner list`[[2]]


[1] 12.3

> # Access the list element using the name of the element.
> print(list_data$A_Matrix)
[,1] [,2] [,3]
[1,] 3 5 -2
[2,] 9 1 8
> # Define the column and row names.
> rownames = c("row1", "row2", "row3", "row4")
> colnames = c("col1", "col2", "col3")
> # Create the matrix.
> P <- matrix(c(3:14), nrow = 4, byrow = TRUE, dimnames =
list(rownames, colnames))
> # Access the element at 3rd column and 1st row.
> print(P[1,3])
[1] 5
> # Access the element at 2nd column and 4th row.
> print(P[4,2])
[1] 13
> # Access only the 2nd row.
> print(P[2,])
col1 col2 col3
6 7 8
> # Access only the 3rd column.
> print(P[,3])
row1 row2 row3 row4
5 8 11 14
> # Create two vectors of different lengths.
> vector1 <- c(5,9,3)
> vector2 <- c(10,11,12,13,14,15)
> # Take these vectors as input to the array.
> array1 <- array(c(vector1,vector2),dim = c(3,3,2))
> # Create two vectors of different lengths.
> vector3 <- c(9,1,0)
> vector4 <- c(6,0,11,3,14,1,2,6,9)
> array2 <- array(c(vector1,vector2),dim = c(3,3,2))
> # create matrices from these arrays.
> matrix1 <- array1[,,2]
> matrix2 <- array2[,,2]
> # Add the matrices.
> result <- matrix1+matrix2
> print(result)
[,1] [,2] [,3]
[1,] 10 20 26
[2,] 18 22 28
[3,] 6 24 30
> # Create the vectors for data frame.
> height <- c(132,151,162,139,166,147,122)
> weight <- c(48,49,66,53,67,52,40)
> gender <-
c("male","male","female","female","male","female","male")
> # Create the data frame.
> input_data <- [Link](height,weight,gender)
> print(input_data)
height weight gender
1 132 48 male
2 151 49 male
3 162 66 female
4 139 53 female
5 166 67 male
6 147 52 female
7 122 40 male
> # Test if the gender column is a factor.
> print([Link](input_data$gender))
[1] FALSE
> # Print the gender column so see the levels.
> print(input_data$gender)
[1] "male" "male" "female" "female" "male" "female"
"male"
> # Create the data frame.
> [Link] <- [Link](
+ emp_id = c (1:5),
+ emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
+ salary = c(623.3,515.2,611.0,729.0,843.25),
+
+ start_date = [Link](c("2012-01-01", "2013-09-23",
"2014-11-15", "2014-05-11",
+ "2015-03-27")),
+ stringsAsFactors = FALSE
+ )
> # Print the summary.
> print(summary([Link]))
emp_id emp_name salary
start_date
Min. :1 Length:5 Min. :515.2 Min.
:2012-01-01
1st Qu.:2 Class :character 1st Qu.:611.0 1st
Qu.:2013-09-23
Median :3 Mode :character Median :623.3 Median
:2014-05-11
Mean :3 Mean :664.4 Mean
:2014-01-14
3rd Qu.:4 3rd Qu.:729.0 3rd
Qu.:2014-11-15
Max. :5 Max. :843.2 Max.
:2015-03-27
>
> # Create vector objects.
> city <- c("Tampa","Seattle","Hartford","Denver")
> state <- c("FL","WA","CT","CO")
> zipcode <- c(33602,98104,06161,80294)
> # Combine above three vectors into one data frame.
> addresses <- cbind(city,state,zipcode)
> # Print a header.
> cat("# # # # The First data frame\n")
# # # # The First data frame
> # Print the data frame.
> print(addresses)
city state zipcode
[1,] "Tampa" "FL" "33602"
[2,] "Seattle" "WA" "98104"
[3,] "Hartford" "CT" "6161"
[4,] "Denver" "CO" "80294"
> # Create another data frame with similar columns
> [Link] <- [Link](
+ city = c("Lowry","Charlotte"),
+ state = c("CO","FL"),
+ zipcode = c("80230","33949"),
+ stringsAsFactors = FALSE
+ )
> # Print a header.
> cat("# # # The Second data frame\n")
# # # The Second data frame
> # Print the data frame.
> print([Link])
city state zipcode
1 Lowry CO 80230
2 Charlotte FL 33949
> # Combine rows form both the data frames.
> [Link] <- rbind(addresses,[Link])
> # Print a header.
> cat("# # # The combined data frame\n")
# # # The combined data frame
> # Print the result.
> print([Link])
city state zipcode
1 Tampa FL 33602
2 Seattle WA 98104
3 Hartford CT 6161
4 Denver CO 80294
5 Lowry CO 80230
6 Charlotte FL 33949

Common questions

Powered by AI

The sources demonstrate vector indexing using several methods like position, logical, negative, and 0/1 indexing, each revealing different aspects of data handling. These operations highlight the flexibility and power of vectors in accessing and modifying elements effectively . Understanding vector indexing is essential for efficient data manipulation and retrieval, showcasing the need to understand data structures and types in programming for optimal data analysis workflows.

The prediction accuracy of linear models is often assessed using model residuals. For the 'mpg' vs 'wt' regression, residuals range significantly, indicating variance in predictions, yet overall showing a strong model due to high R-squared value . In the height to weight prediction, residuals also have some spread but suggest high accuracy due to a very high R-squared of 0.9548. Both models are statistically significant, evidenced by low p-values, indicating robust predictive capabilities .

Arrays, data frames, and lists serve different purposes in handling multivariate data. Arrays provide structured multi-dimensional data storage, suited for numerical computations with fixed dimensions. Data frames allow versatile storage of multiple data types, ideal for statistical modeling and analysis, as seen with 'emp.data'. Lists offer flexibility to store diverse data structures, enabling complex data types like vectors and matrices within a single structure. Each structure enhances data manipulation capabilities, making data analysis more robust .

Imagine a scenario where you are tasked with processing a dataset containing various sales figures across regions. Using assignment operators, you can categorize sales into vectors for different regions and use functions, like calculating the mean or mode, to summarize sales performance. By leveraging conditional selections and applying functions, as shown with custom functions for mode and vector assignments, you would efficiently manipulate and summarize data, demonstrating how coding structures fundamentally streamline analytical tasks .

Data frames play a crucial role in structured datasets, offering an efficient way to handle tabular data. Their advantage lies in allowing heterogeneous data types across columns while maintaining coherent relations between rows. For example, the 'emp.data' data frame integrates employee attributes such as name, salary, and start date, facilitating operations like summarization and statistical analysis . Data frames are versatile in managing mixed-type data, which is common in practical applications.

The 'mpg' attribute has a negative correlation with the 'wt' attribute, as supported by the linear model fit in the mtcars dataset. The linear regression gives coefficients indicating that for every unit increase in weight, the mpg decreases by 5.3445 . The model has an R-squared value of 0.7528, showing a strong negative correlation, and the p-value is 1.29e-10, indicating the result is statistically significant .

Matrices are constructed by organizing data into a 2D grid, defined by rows and columns, facilitating operations across datasets. In the sources, matrices are built from arrays where vectors of different lengths are input to create a multidimensional array structure. Examples include using 'array' to define dimensions and summing matrices created from arrays, demonstrating how matrices allow systematic data transformations and arithmetic operations .

The mode of a dataset is the value that appears most frequently and provides insights into the most common value within the data. The sources illustrate computing the mode using numerical and character data through custom functions. For a numerical vector, the mode is shown to be 2, while for a character vector, 'it' is the mode . This demonstrates the utility of mode in identifying common entities across data types.

The linear regression model applied to predicting weight based on height is highly effective. The regression model provides an R-squared of 0.9548, indicating a good fit to the data with the regression line explaining a high proportion of the variance in the weight data . The p-value is 1.16e-06, which is statistically significant, suggesting that height can strongly predict weight in this dataset . However, the model assumptions should be checked in practice to ensure reliability.

Factor levels help in managing and analyzing categorical data by providing distinct categories or groups that data points belong to. In the example of 'apple_colors', factor levels are 'green', 'red', and 'yellow', which categorize the color of apples. This allows easier aggregation, comparison, and summary statistics for each category. Identifying the number of levels helps simplify categorical data analysis .

You might also like