0% found this document useful (0 votes)
10 views42 pages

R Programming and Bioinformatics Guide

R is a programming language used for statistical computing and graphics, widely utilized in data analysis and bioinformatics. Bioinformatics combines biology with computational tools to manage and analyze biological data, playing a crucial role in research and applications like genomics and clinical bioinformatics. The document also covers data manipulation in R using packages like dplyr, data structures like vectors, matrices, and data frames, and statistical methods including t-tests.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views42 pages

R Programming and Bioinformatics Guide

R is a programming language used for statistical computing and graphics, widely utilized in data analysis and bioinformatics. Bioinformatics combines biology with computational tools to manage and analyze biological data, playing a crucial role in research and applications like genomics and clinical bioinformatics. The document also covers data manipulation in R using packages like dplyr, data structures like vectors, matrices, and data frames, and statistical methods including t-tests.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

What is R?

R is a programming language for statistical


computing and graphics.

Open-source and widely used in data analysis


and bioinformatics.

Supports statistical modeling, data


manipulation, and visualization.
• 1️. Download R:
[Link]
• 2️. Download RStudio:
[Link]
rstudio-desktop/
• RStudio gives you: - Script
editor - Console -
Environment viewer
• RStudio makes it easy to
write, run, and debugR
code.
What is Bioinformatics?

Bioinformatics is the
application of It combines biology,
Definition: computational tools and computer science,
techniques to understand mathematics, and statistics.
and manage biological data.

Data Analysis – Alignment,


Data Storage – DNA, RNA,
Key Components: Annotation, Structure
Protein sequences
prediction

Visualization – Graphs,
Heatmaps, Molecular
Structures
Why is Bioinformatics Important?
• Manages Big Biological Data
• Human Genome: 3 billion base pairs!
• Need computers to store and analyze
Importance of this data.
Bioinformatics • Accelerates Research
• Drug discovery, personalized medicine,
in vaccine development
Biotechnology Virtual experiments reduce lab costs and
time.
• Bridges Lab and Data Science
• Helps biologists make sense of complex
datasets.
Applications of Bioinformatics

Field Bioinformatics Application


Genomics DNA sequencing, Genome assembly
Transcriptomics RNA-Seq, Gene expression analysis
Protein structure prediction, Mass
Proteomics
spectrometry data analysis

Metagenomics Study of microbiomes, 16S rRNA analysis

Structural Biology Protein docking, Molecular simulations

Clinical Bioinformatics Cancer genomics, Biomarker discovery


• Use [Link]() or
[Link]() to import data
Reading from CSV files.
• Example: data <-
Data in [Link]('gene_expression.
csv')

R • Check data with head(),


str(), and summary()
functions.
Importing & Cleaning Real Datasets

• Import Data:
• data <- [Link]("gene_expression.csv")
Clean Data:
• View first few rows: head(data)
• Check structure: str(data)
• Summary of columns: summary(data)
• Remove missing values: data <- [Link](data)
• Convert types: data$Group <- [Link](data)
dplyr

dplyr is one of the most


powerful and popular R
What is dplyr? packages for data Why use dplyr?
manipulation. It is part of
the tidyverse collection.

It makes data wrangling


Works great on data Uses a pipeline (%>%) to
easier, faster, and more
frames and tibbles chain commands together
readable
Data Manipulation with dplyr

dplyr is a grammar of data manipulation.

Key functions: filter(), select(), mutate(),


arrange(), summarise()

Example: data %>% filter(condition == 'treated')


%>% summarise(mean(Expression))
Common dpylr Function
Function Purpose Example

filter() Select rows that match a condition filter(Group == "Control")

select() Pick specific columns select(Gene, Expression)

mutate() Create or modify columns mutate(LogExpr = log(Expression))

arrange() Sort rows arrange(desc(Expression))

summarise() Create summary statistics summarise(Mean = mean(Expression))

group_by() Group data before summarizing group_by(Group)


What is a Vector?

A vector in R is a sequence of elements that are


all of the same data type. It is the most basic
and fundamental data structure in R.

R is a vectorized language, which means most


operations work on vectors without the need for
loops.
Key Characteristics:

Feature Explanation

Homogeneous All elements must be of the same type

One-dimensional Only one row of elements (no columns)

Elements can be accessed using position


Indexable numbers

Automatically named You can assign names to vector elements


How to Create a Vector
• # Numeric vector
• gene_expr <- c(25.4, 30.1, 45.2, 28.9, 35.6)

• # Character vector
• genes <- c("GeneA", "GeneB", "GeneC", "GeneD",
"GeneE")

• # Logical vector
• results <- c(TRUE, FALSE, TRUE)
Types of Vectors in R

Type Example Use Case

Numeric c(1, 2, 3.5) Gene expression values

Integer c(1L, 2L, 3L) Sample counts, IDs

Character c("GeneA", "GeneB") Gene names, sample labels

Logical c(TRUE, FALSE) Quality control pass/fail flags

Complex c(1+2i, 3+4i) Rarely used in biology


Accessing Vector
Elements (Indexing)
gene_expr[1] # First element
(25.4)
gene_expr[2:4] # 2nd to 4th
elements
• gene_expr[c(1, 3)] # First and
third elements
Matrix
• A matrix in R is a 2-dimensional data structure
that stores elements in rows and columns.
Unlike data frames, all elements in a matrix must
be of the same type (numeric, character, or
logical).
• It’s like a table or spreadsheet with a fixed data
type.
Key Characteristics:

Property Explanation

2D Structure Organized in rows and columns

Homogeneous All elements must be of the same type

Indexed Access elements using [row, column]

Supports matrix algebra (addition,


Numeric-friendly multiplication, etc.)
How to Create a
Matrix
expr_matrix <- matrix(
c(10, 12, 14, 15, 11, 13), # Data
nrow = 2, # Number of rows
ncol = 3, # Number of columns
byrow = TRUE # Fill row-wise
)
Common Matrix
Operations

Operation Code Example


Add a constant expr_values + 1
Multiply by scalar expr_values * 2
Row-wise mean rowMeans(expr_values)
Column-wise sum colSums(expr_values)
Matrix multiplication A %*% B (matrix algebra)
Transpose matrix t(expr_values)
When to Use Matrices inBioinformatics

Application Description
Rows = genes, Columns =
Gene Expression Data samples
Methylation / SNP Intensity For omics arrays (values are all
Matrix numeric)
Convert pixel intensities to
Image Data matrix
Distance or Correlation Pairwise distances between
Matrices genes or samples
Data Frame?

A data frame is a 2-dimensional tabular


data structure in R that:
• Looks like an Excel spreadsheet
• Can store columns of different data
types (numeric, character, logical, etc.)
• Is the most commonly used structure for
datasets in R
Key Characteristics

Feature Description
2D structure Rows and columns

Mixed types Columns can be numeric, character,


logical, etc.
Column access Use $, indexing, or column names
Row access Use row numbers

Ideal for Real-world data, survey results, lab


measurements
How to Create a Data Frame

• gene_data <- [Link](


• Gene = c("GeneA", "GeneB", "GeneC"),
• Expression = c(25.4, 30.1, 45.2),
• Group = c("Control", "Treatment",
"Control")
• )
Common Operations
Task Code Example
View top rows head(gene_data)
View structure str(gene_data)
Summary stats summary(gene_data)
gene_data$LogExpr <-
Add a new column log(gene_data$Expression)
colnames(gene_data) <- c("G", "E",
Rename columns "Grp")
subset(gene_data, Group ==
Filter data "Control")
Mean

mean(): Calculate the Average


• The mean() function computes the
arithmetic average of a numeric vector
expression <- c(25.4, 30.1, 45.2, 28.9, 35.6)
• mean(expression)
Use in Biology

Average gene expression


level across samples

Average metabolite
concentration in a group

Average age of patients in a


clinical study
Standard Deviation

• Standard Deviation (SD) tells how much


the values in a dataset vary (spread out)
from the mean.
• Low SD → values are close to the mean
(consistent)
• High SD → values are spread out (more
variable)
Biological Relevance

MEASURING VARIABILITY ASSESSING HOW EVALUATING


IN GENE EXPRESSION CONSISTENT PATIENT EXPERIMENTAL
ACROSS REPLICATES BIOMARKER LEVELS ARE REPEATABILITY
To Calculate SD

# Expression values for a gene across 5


samples
expression <- c(25.4, 30.1, 45.2, 28.9, 35.6)

# Calculate Standard Deviation


std_dev <- sd(expression)

# Print the result


print(paste("Standard Deviation:",
round(std_dev, 2)))
What is summary()
The summary() function provides a quick
statistical overview of data.
• For numeric data, it gives:
– Minimum
– 1st Quartile (25%)
– Median (50%)
– Mean
– 3rd Quartile (75%)
– Maximum
• For categorical/factor data, it gives:
– Frequency count of each category
syntax
A whole
summary(x)
data frame

A column in
Where x can
a data
be:
frame

A vector
ggplot2 is an R package used for
data visualization. It allows you to
create beautiful and customizable
plots based on the Grammar of
Graphics.
• It is part of the tidyverse collection
of packages.
ggplot2? • ggplot2 is an R package used for
data visualization.
• It allows you to create beautiful
and customizable plots based on
the Grammar of Graphics.
• It is part of the tidyverse collection
of packages.
Why use ggplot2?
Easy to build complex plots from simple layers
Ideal for scientific and publication-ready figures

Highly customizable: themes, labels, colors,


legends

Works seamlessly with data frames and dplyr


Add layers: + labs(), +
theme(), +
scale_x_continuous()

ggplot2 - Plot Faceting: ggplot(data)


+ facet_wrap(~Group)
Customization

Export plots with


ggsave('[Link]')
1. Load data using [Link]().

2. Use dplyr to clean and


manipulate.
Example
Workflow: 3. Calculate mean and sd of
expression levels.
Gene
Expression
4. Plot expression using ggplot2.

5. Interpret biological significance.


What is a t-test?
What is a t-test?
• A t-test compares the means of two groups and
tells you whether the difference is statistically
significant.
Types of t-tests:
Type When to use
Comparing two different
Independent (two-
groups (e.g., control vs
sample)
treatment)

Comparing same group


Paired before & after (e.g., pre- vs
post-treatment)

Compare one group to a


One-sample
known value
Example: Independent t-test in R
• Suppose you have gene expression data for a
gene in control and treatment groups.
# Expression values for Control and Treatment
groups
control <- c(25.4, 28.1, 29.5, 30.2, 27.8)
treatment <- c(32.5, 35.1, 33.9, 34.2, 36.0)
[Link](control, treatment)
How to Interpret:
Output Part Meaning
t= The t-statistic
df = Degrees of freedom
Probability result is due to
p-value
chance
mean of x Mean of control group
mean of y Mean of treatment group
Range of the true mean
confidence interval
difference
Real-life
Bio Use
Application Description

Gene expression Control vs Treatment


comparison

Cases: Clinical trial


measurements Drug vs Placebo

Proteomics Protein intensity across


conditions

Metabolomics Metabolite abundance


differences
summary(): Shows min, max,
mean, median, etc.

mean(), sd(), var(): Useful for


descriptive statistics.
Statistical
Summary in R
[Link](): Compare two groups.

cor(): Correlation between


variables.

You might also like