0% found this document useful (0 votes)

17 views17 pages

DSR Exam Guide: R Programming Essentials

Q: Describe how the modulo operator (%) is used to identify even and odd numbers in R and its significance in algorithm development and data manipulation.

In R, the modulo operator (%) is used to determine whether a number is even or odd by checking the remainder of division by 2. If vec %% 2 == 0, the number is even; otherwise, it's odd . This simple operation is fundamental in algorithm development, allowing for efficient data sorting, alternating operations, and pattern generation. It’s particularly significant in scenarios requiring conditional processing or grouping, laying the foundation for more complex data manipulation tasks and conditional algorithms .

Q: What are the steps and functions involved in removing rows or columns with NA values in matrices, and why is this important for data analysis in R?

In R, rows or columns containing NA values in a matrix are removed using the apply() function combined with is.na() and all() for checking NA status across dimensions. For instance, mat[!(apply(is.na(mat), 1, all)), ] removes all rows with only NA values, while using complete.cases(mat) keeps rows where all data points are complete . Removing NAs is critical for ensuring data integrity and accuracy in statistical analysis, as many computations cannot be performed on incomplete data without adjustments or assumptions .

Q: Explain how factor variables are used in R and why they are considered memory efficient for categorical data?

Factor variables in R are used to store categorical data, both nominal and ordinal types. They are memory efficient because they store the data as levels rather than storing the full character string repeatedly for every occurrence of a category. Only the unique levels are stored, and they are indexed, which reduces memory usage especially when dealing with large datasets with repeated categorical values .

Q: Discuss how RStudio complements the R programming language and its role in data analysis projects.

RStudio is an Integrated Development Environment (IDE) for R that enhances the productivity of R programmers by providing a user-friendly interface. It includes features like syntax highlighting, code completion, debugging tools, and integrated graphics, which facilitate efficient code writing, testing, and visualization . RStudio plays a key role in data analysis projects by streamlining the entire workflow from data import and cleaning to visualization and reporting, making it an invaluable tool for both beginners and experienced data analysts.

Q: Describe the str() function usage in R for understanding data frames and mention what specific aspects it reveals about data frames.

The str() function in R provides a compact, human-readable summary of an R object’s structure, particularly useful for data frames. It reveals key aspects such as the type of object, number of observations and variables, and a preview of the data within. For data frames, it shows the data types for each column, sample values, and also hints at the levels for factor variables and the format for date variables, offering insights into the data’s layout and type integrity .

Q: R's data frame and matrix data structures offer distinct features. Compare these structures and discuss scenarios in which each is more appropriately used.

Data frames in R allow columns to contain different data types, akin to a table in a database, and they provide flexibility in data manipulation, making them more suitable for datasets where varying types of data need to be analyzed together . In contrast, matrices require all elements to be of the same data type and are used where uniform operations need to be performed across elements, typical in mathematical computations and array-based analyses . For data analysis tasks that involve heterogeneous data and require flexibility, data frames are preferable, whereas, for numerical computations requiring uniform data types and efficiency, matrices are more appropriate.

Q: How do the max(), min(), and range() functions differ in finding extreme values in R, and what are their computational efficiencies?

The max() and min() functions in R are used to find the highest and lowest values in a vector, respectively, while range() returns both in a single call, making it slightly more efficient when both are needed simultaneously . These functions automatically skip NA values unless all are NA, making them robust for real-world datasets. Using range() might be computationally efficient for larger datasets because it processes the data only once to provide both extremes, whereas using max() and min() separately would involve two passes .

Q: What role does the 'order of a matrix' play in R, and how does it influence matrix operations and data structuring?

The order of a matrix in R, expressed as m×n, defines its dimensions and the layout of its rows and columns . This specification is crucial as it influences how matrix operations are conducted, ensuring appropriate element matching during operations like addition, multiplication, and transformation. Misunderstood dimensions can lead to errors or unintended results in computations and analyses, emphasizing its importance in correctly structuring and manipulating matrix data for computational tasks .

Q: How can different matrices be created in R using matrix(), rbind(), and cbind() methods, and what are the implications of using each method?

In R, the matrix() function is used to create matrices by specifying data along with the number of rows and columns. For instance, matrix(1:12, nrow=3, ncol=4) creates a 3x4 matrix with numbers 1 to 12 filled by column . The rbind() and cbind() functions are used for binding vectors or matrices by row and column respectively. Using rbind(c(1,2,3), c(4,5,6)) binds two vectors into a 2x3 matrix by rows, while cbind() binds by columns. The choice of method affects how data is organized and visualized, with matrix() often used for new matrix creation and rbind(), cbind() useful for building matrices from existing data .

Q: What are some advantages and disadvantages of using R for statistical computing and data analysis, and how do these impact its use in academia and industry?

Advantages of R include being free and open source, comprehensive with vast statistical capabilities, supported by a large active community, extensible with thousands of packages, and platform independent, working on multiple operating systems . However, R also has disadvantages such as memory management issues since it stores data in RAM, a steep learning curve for beginners, slower speed compared to compiled languages, basic graphical limitations, and inconsistent documentation with some packages . These pros and cons make R particularly suited for academia and tasks which benefit from a flexible and extensive statistical analysis environment, but might pose challenges for tasks requiring high-performance execution and intuitive learning for new users.

Uploaded by

misoxyash2205

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views17 pages

DSR Exam Guide: R Programming Essentials

Uploaded by

misoxyash2205

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

DSR Exam Study Guide - R Programming

SHORT ANSWER QUESTIONS

1. Define R Programming
Answer: R is a free, open-source programming language and statistical computing environment
designed for data analysis, statistical modeling, and data visualization. It was developed by Ross Ihaka
and Robert Gentleman.

2. List out any five features of R

Answer:

Open Source: Free to use and modify

Statistical Analysis: Built-in statistical functions
Data Visualization: Excellent graphical capabilities

Cross-platform: Runs on Windows, Mac, Linux

Extensible: Thousands of packages available

3. What are the applications of R?

Answer:

Statistical analysis and modeling

Data mining and machine learning

Bioinformatics and genetics

Financial analysis

Market research and social sciences

4. What are the different data types in R?

Answer:

Numeric: Decimal numbers (3.14)

Integer: Whole numbers (5L)
Character: Text strings ("Hello")

Logical: TRUE/FALSE values

Complex: Complex numbers (3+2i)

5. Demonstrate the simple 3X3 matrix

Code:
r

# Creating a 3x3 matrix

matrix1 <- matrix(1:9, nrow=3, ncol=3)
print(matrix1)

6. Define order of a Matrix

Answer: The order of a matrix is expressed as m×n, where m is the number of rows and n is the number
of columns. For example, a 3×4 matrix has 3 rows and 4 columns.

7. What are the 7 measures of central tendency?

Answer:

Mean (average)

Median (middle value)

Mode (most frequent value)

Geometric mean

Harmonic mean
Weighted mean

Trimmed mean

8. Define Transpose of a matrix

Answer: Matrix transpose is an operation that flips a matrix over its diagonal, switching rows and
columns. If A is m×n, then A^T is n×m.

Code:

mat <- matrix(1:6, nrow=2, ncol=3)

transpose_mat <- t(mat)
print(transpose_mat)

9. Explain factor variable

Answer: Factor is a data type used to store categorical data (both nominal and ordinal). It stores data as
levels and is memory efficient for repeated categorical values.

Code:

r
colors <- factor(c("red", "blue", "red", "green", "blue"))
print(colors)
levels(colors)

10. List out the characteristics of a data frame

Answer:

Rectangular structure (rows and columns)

Different data types in different columns

Column names are required

Row names are optional

All columns must have same length

11. Define Vector with an example

Answer: Vector is a basic data structure in R that contains elements of the same data type arranged in a
sequence.

Code:

# Numeric vector
num_vec <- c(1, 2, 3, 4, 5)
# Character vector
char_vec <- c("apple", "banana", "cherry")
print(num_vec)

12. What are the different values that can be assigned to a numeric data type in R?
Answer:

Positive numbers (5, 3.14)

Negative numbers (-2, -7.5)

Zero (0)

Infinity (Inf, -Inf)

Not a Number (NaN)

Missing values (NA)

13. Explain RStudio

Answer: RStudio is an Integrated Development Environment (IDE) for R programming. It provides a user-
friendly interface with features like syntax highlighting, code completion, debugging tools, and integrated
graphics.

14. Write an R program to reverse the order of given vector

Code:

# Original vector
vec <- c(1, 2, 3, 4, 5)
print("Original vector:")
print(vec)

# Reversed vector
reversed_vec <- rev(vec)
print("Reversed vector:")
print(reversed_vec)

15. How to create a Matrix in R?

Answer: Use the matrix() function with data, nrow, and ncol parameters.

Code:

# Method 1: Using matrix() function

mat1 <- matrix(1:12, nrow=3, ncol=4)

# Method 2: Using rbind() or cbind()

mat2 <- rbind(c(1,2,3), c(4,5,6))
print(mat1)

16. Define R Array

Answer: Array is a multi-dimensional data structure that can store data in more than two dimensions. It's
an extension of matrices.

Code:

# Creating a 2x3x2 array

arr <- array(1:12, dim=c(2,3,2))
print(arr)
17. Difference between data frame and a matrix in R

Answer:

Data Frame Matrix

Different data types in columns Same data type only

More flexible Less flexible

Uses $ for column access Uses [ ] for access

Can have column names Optional column names

 

18. Explain the use of length() function

Answer: The length() function returns the number of elements in a vector or list.

Code:

vec <- c(1, 2, 3, 4, 5)

len <- length(vec)
print(paste("Length of vector:", len))

19. Define the structure of a data frame using str() function

Answer: The str() function displays the internal structure of a data frame, showing data types,
dimensions, and sample values.

Code:

df <- [Link](name=c("John", "Jane"), age=c(25, 30))

str(df)

20. Explain Argument matching

Answer: R uses three types of argument matching:

Exact matching: Arguments matched by exact name

Partial matching: Arguments matched by partial name

Positional matching: Arguments matched by position

LONG ANSWER QUESTIONS

1. Summarize the advantages and disadvantages of R

Theory: R has become the de facto standard for statistical computing and data analysis in academia and
industry. Its popularity stems from its comprehensive statistical capabilities and active community
support. However, like any programming language, R has both strengths and limitations that users
should understand. The advantages significantly outweigh the disadvantages for most statistical and data
analysis tasks, making R an excellent choice for data scientists and statisticians.

Advantages:

Free and Open Source: No licensing costs

Comprehensive: Vast statistical capabilities

Active Community: Large user base and support

Extensible: Thousands of packages

Platform Independent: Works on multiple OS

Disadvantages:

Memory Management: Stores data in RAM

Learning Curve: Steep for beginners

Speed: Slower than compiled languages

Graphics: Basic plotting can be limited

Documentation: Inconsistent package documentation

2. Write an R program to find maximum and minimum value of a given vector

Theory: Finding maximum and minimum values is a fundamental operation in statistical analysis and
data exploration. The max() and min() functions are built-in R functions that efficiently process vectors
to find extreme values. These functions ignore NA values by default (unless all values are NA), making
them robust for real-world data analysis. The range() function provides both minimum and maximum
values in a single call, which is computationally efficient for large datasets.

Code:

r
# Create a vector
numbers <- c(15, 8, 23, 4, 16, 42, 7)
print("Original vector:")
print(numbers)

# Find maximum value

max_val <- max(numbers)
print(paste("Maximum value:", max_val))

# Find minimum value

min_val <- min(numbers)
print(paste("Minimum value:", min_val))

# Using range() to get both

range_val <- range(numbers)
print("Range (min, max):")
print(range_val)

3. Illustrate R program to create two 2x3 matrices and perform operations

Theory: Matrix operations are fundamental in linear algebra and statistical computing, forming the
backbone of many machine learning algorithms. Element-wise operations (addition, subtraction,
multiplication, division) perform calculations between corresponding elements of matrices with the same
dimensions. These operations are vectorized in R, making them highly efficient for large datasets. Matrix
arithmetic is essential for data transformations, statistical calculations, and scientific computing
applications.

Code:

r
# Create two 2x3 matrices
mat1 <- matrix(c(1,2,3,4,5,6), nrow=2, ncol=3)
mat2 <- matrix(c(7,8,9,10,11,12), nrow=2, ncol=3)

print("Matrix 1:")
print(mat1)
print("Matrix 2:")
print(mat2)

# Addition
add_result <- mat1 + mat2
print("Addition:")
print(add_result)

# Subtraction
sub_result <- mat1 - mat2
print("Subtraction:")
print(sub_result)

# Element-wise multiplication
mult_result <- mat1 * mat2
print("Multiplication:")
print(mult_result)

# Division
div_result <- mat1 / mat2
print("Division:")
print(div_result)

4. Write R program to check if a given number is Even or Odd

Theory: Even-odd checking is a fundamental programming concept that uses the modulo operator (%)
to determine divisibility. This concept is widely used in algorithms, data filtering, and conditional
processing in statistical analysis. The modulo operation returns the remainder after division, making it
perfect for checking divisibility by any number. Understanding this logic is crucial for data manipulation
tasks like creating alternating patterns or grouping data based on numeric properties.

Code:

r
# Function to check even or odd
check_even_odd <- function(n) {
if (n %% 2 == 0) {
return(paste(n, "is Even"))
} else {
return(paste(n, "is Odd"))
}
}

# Test with different numbers

num1 <- 15
num2 <- 24
print(check_even_odd(num1))
print(check_even_odd(num2))

# Using ifelse for multiple numbers

numbers <- c(10, 15, 22, 7)
result <- ifelse(numbers %% 2 == 0, "Even", "Odd")
print([Link](Number = numbers, Type = result))

5. Write an R program to add 3 to each element of the first vector

Theory: Vectorization is one of R's most powerful features, allowing operations to be performed on
entire vectors without explicit loops. This concept makes R code more concise, readable, and
computationally efficient compared to traditional programming approaches. Vector arithmetic operations
are automatically applied element-wise, which is fundamental to R's design philosophy. Understanding
vectorization is crucial for efficient data manipulation and statistical computations in R programming.

Code:

# Create original vector

original_vec <- c(5, 10, 15, 20, 25)
print("Original vector:")
print(original_vec)

# Add 3 to each element

new_vec <- original_vec + 3
print("New vector (after adding 3):")
print(new_vec)

# Display both vectors

print("Comparison:")
print([Link](Original = original_vec, New = new_vec))
6. Explain in detail about data frame with example

Theory: Data frames are the most important data structure in R for real-world data analysis, representing
the equivalent of spreadsheets or database tables. They provide the flexibility to store different types of
data (numeric, character, logical) in different columns while maintaining structural integrity. Data frames
are the standard format for most statistical functions and are essential for data import/export operations.
Understanding data frames is crucial because most real-world datasets are naturally tabular and require
this flexible structure for effective analysis.

Key Features:

Rectangular structure

Mixed data types allowed

Column names mandatory

Functions: [Link]() , str() , summary()

Code:

r
# Create a data frame
students <- [Link](
Name = c("Alice", "Bob", "Charlie", "Diana"),
Age = c(20, 22, 21, 23),
Grade = c("A", "B", "A", "B"),
Passed = c(TRUE, TRUE, TRUE, FALSE),
stringsAsFactors = FALSE
)

print("Student Data Frame:")

print(students)

# Structure of data frame

print("Structure:")
str(students)

# Access specific column

print("Names only:")
print(students$Name)

# Access specific row

print("First student:")
print(students[1, ])

# Summary statistics
print("Summary:")
summary(students)

7. Explain scalar and vector with example

Theory: Understanding the distinction between scalars and vectors is fundamental to R programming
and statistical computing. Scalars represent single values and are the building blocks of more complex
data structures, while vectors represent collections of related data points. In R, even a single number is
technically a vector of length 1, which demonstrates R's vector-oriented design philosophy. This
distinction is important for understanding how R functions operate and how data is stored and
manipulated in memory.

Code:

r
# Scalar examples
scalar_num <- 42
scalar_char <- "Hello"
scalar_logical <- TRUE

print("Scalars:")
print(paste("Number:", scalar_num))
print(paste("Character:", scalar_char))
print(paste("Logical:", scalar_logical))

# Vector examples
num_vector <- c(1, 2, 3, 4, 5)
char_vector <- c("apple", "banana", "cherry")
logical_vector <- c(TRUE, FALSE, TRUE)

print("Vectors:")
print(paste("Numeric vector:", toString(num_vector)))
print(paste("Character vector:", toString(char_vector)))
print(paste("Logical vector:", toString(logical_vector)))

# Check if scalar or vector

print(paste("scalar_num length:", length(scalar_num)))
print(paste("num_vector length:", length(num_vector)))

8. Write R program to check if vector elements are greater than 10

Theory: Logical operations and conditional checking are essential skills in data analysis for filtering,
subsetting, and data validation. The comparison operators in R return logical vectors, which can be used
for indexing and filtering operations. This approach is fundamental to data cleaning, exploratory data
analysis, and creating conditional summaries. Understanding logical indexing allows for efficient data
manipulation without explicit loops, leveraging R's vectorized operations for better performance.

Code:

r
# Create a vector
numbers <- c(5, 15, 8, 12, 3, 20, 7)
print("Original vector:")
print(numbers)

# Check which elements are greater than 10

result <- numbers > 10
print("Elements > 10 (TRUE/FALSE):")
print(result)

# Get actual values greater than 10

greater_than_10 <- numbers[numbers > 10]
print("Values greater than 10:")
print(greater_than_10)

# Create a data frame for better visualization

comparison <- [Link](
Value = numbers,
Greater_than_10 = result
)
print("Comparison table:")
print(comparison)

9. What is the use of subset() function? With example

Theory: The subset() function is a powerful tool for data filtering and extraction, providing an intuitive
way to select rows and columns based on logical conditions. It's particularly useful for exploratory data
analysis where you need to examine specific portions of your dataset. The function handles missing
values gracefully and provides cleaner syntax compared to bracket notation for complex filtering
operations. Mastering subset operations is essential for data cleaning, analysis, and generating targeted
insights from large datasets.

Code:

r
# Create a data frame
employees <- [Link](
Name = c("John", "Jane", "Mike", "Sara", "Tom"),
Age = c(25, 30, 35, 28, 32),
Salary = c(50000, 60000, 70000, 55000, 65000),
Department = c("IT", "HR", "IT", "Finance", "HR")
)

print("Original data:")
print(employees)

# Subset employees with age > 30

older_employees <- subset(employees, Age > 30)
print("Employees with age > 30:")
print(older_employees)

# Subset IT department employees

it_employees <- subset(employees, Department == "IT")
print("IT Department employees:")
print(it_employees)

# Multiple conditions
high_earners <- subset(employees, Age > 25 & Salary > 55000)
print("Young high earners:")
print(high_earners)

10. Write R code to remove empty rows and columns from a matrix
Theory: Data cleaning is a critical step in any data analysis workflow, and handling missing or empty data
is a common challenge. Removing empty rows and columns helps reduce data size, improve
computational efficiency, and prevent errors in statistical calculations. The apply() function combined
with [Link]() and logical operations provides flexible ways to identify and remove incomplete data.
Understanding these techniques is essential for preprocessing real-world datasets that often contain
missing values or incomplete records.

Code:

r
# Create a matrix with some NA values
mat <- matrix(c(1, 2, NA, 4, 5, 6, NA, NA, 9), nrow=3, ncol=3)
print("Original matrix:")
print(mat)

# Remove rows that are completely NA

mat_no_empty_rows <- mat[!apply([Link](mat), 1, all), ]
print("After removing empty rows:")
print(mat_no_empty_rows)

# Remove columns that are completely NA

mat_clean <- mat_no_empty_rows[, !apply([Link](mat_no_empty_rows), 2, all)]
print("After removing empty columns:")
print(mat_clean)

# Alternative: Remove rows/columns with any NA

mat_complete <- mat[[Link](mat), ]
print("Complete cases only:")
print(mat_complete)

11. Illustrate R code using seq(), paste(), print(), format(), mode(), order()
Theory: These six functions represent core R functionality for data generation, manipulation, and
inspection that every R programmer must master. The seq() function generates sequences for indexing
and creating regular patterns, while paste() handles string operations essential for data labeling and
reporting. The format() function controls data presentation, mode() helps with data type verification, and
order() provides sorting capabilities fundamental to data organization. Together, these functions form
the foundation for most data manipulation tasks in R programming.

Code:

r
# seq() - Generate sequences
seq1 <- seq(1, 10, by=2)
seq2 <- seq(0, 1, [Link]=5)
print("seq() function:")
print(seq1)
print(seq2)

# paste() - Concatenate strings

names <- c("John", "Jane", "Mike")
ages <- c(25, 30, 35)
combined <- paste(names, "is", ages, "years old")
print("paste() function:")
print(combined)

# print() - Display objects

print("print() function:")
print("This is printed using print()")

# format() - Format numbers

numbers <- c(123.456, 78.9, 1234.5678)
formatted <- format(numbers, digits=3, nsmall=2)
print("format() function:")
print(formatted)

# mode() - Check data mode

vec <- c(1, 2, 3, 4, 5)
char_vec <- c("a", "b", "c")
print("mode() function:")
print(paste("Numeric vector mode:", mode(vec)))
print(paste("Character vector mode:", mode(char_vec)))

# order() - Get ordering indices

values <- c(30, 10, 40, 20)
order_indices <- order(values)
ordered_values <- values[order_indices]
print("order() function:")
print(paste("Original:", toString(values)))
print(paste("Order indices:", toString(order_indices)))
print(paste("Ordered values:", toString(ordered_values)))

12. Explain the structure of a data frame using str() function

Theory: The str() function provides a compact display of the structure of any R object, showing data
types, dimensions, and sample values.

Code:
r

# Create a comprehensive data frame

company_data <- [Link](
Employee_ID = 1:5,
Name = c("Alice", "Bob", "Charlie", "Diana", "Eve"),
Age = c(25, 30, 35, 28, 32),
Salary = c(50000, 60000, 70000, 55000, 65000),
Department = factor(c("IT", "HR", "IT", "Finance", "HR")),
Full_Time = c(TRUE, TRUE, FALSE, TRUE, TRUE),
Start_Date = [Link](c("2020-01-15", "2019-05-20", "2021-03-10", "2020-08-05", "2018-12-01"))
)

print("Data frame content:")

print(company_data)

print("\nStructure using str():")

str(company_data)

print("\nWhat str() shows us:")

cat("- '[Link]': Object type\n")
cat("- '5 obs. of 7 variables': 5 rows, 7 columns\n")
cat("- Each variable shows: $ variable_name : data_type [1:5] sample_values\n")
cat("- Factor variables show levels\n")
cat("- Date variables show format\n")

EXAM TIPS:
1. Time Management: Spend more time on long answers (they carry more marks)
2. Code Comments: Always add brief comments to your code
3. Output Display: Use print() statements to show results clearly

4. Error Handling: Mention [Link]() , [Link]() when dealing with missing data
5. Function Syntax: Remember to include function syntax when asked

Common questions