0% found this document useful (0 votes)
21 views17 pages

Understanding Multiple Sequence Alignment

The document discusses homologous multiple sequence alignment (MSA), emphasizing the importance of sequence identity and similarity in understanding evolutionary relationships among proteins. It outlines various mechanisms of molecular evolution, alignment methods (including progressive and iterative approaches), and the significance of tools like PAM and BLOSUM matrices in analyzing sequence data. Additionally, it highlights the applications of MSA in phylogenetic analysis, protein structure prediction, and identifying conserved sequences.

Uploaded by

Your Friend
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views17 pages

Understanding Multiple Sequence Alignment

The document discusses homologous multiple sequence alignment (MSA), emphasizing the importance of sequence identity and similarity in understanding evolutionary relationships among proteins. It outlines various mechanisms of molecular evolution, alignment methods (including progressive and iterative approaches), and the significance of tools like PAM and BLOSUM matrices in analyzing sequence data. Additionally, it highlights the applications of MSA in phylogenetic analysis, protein structure prediction, and identifying conserved sequences.

Uploaded by

Your Friend
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

04-02-2025

Homologous
Multiple Sequence Alignent ◼ Homology is an inference (sequences are
homologous or not).

◼ Identity and similarity are quantities that


Dr Perugu Shyam describe the relatedness of sequences.

Shyam’s

◼ The close kinship between human beings and


chimpanzees, hinted at by the mutual interest shown by
Jane Goodall and a chimpanzee in the photograph, is
revealed in the amino acid sequences of myoglobin.

◼ The human sequence (red) differs from the chimpanzee


sequence (blue) in only one amino acid in a protein
chain of 153 residues.

Sequence similarity to Humans


Shyam’s Shyam’s

1
04-02-2025

Similarity

◼ Similarity is a quantitative measure of how


two sequences are related to one another.

◼ Similarity is assessed as the total number of


identities and conservative substitutions in pair
wise sequence alignment.

Shyam’s Shyam’s

Mechanisms Involved in Molecular


Identity Evolution of Genes/Proteins
Mutation- Stochastic single point changes in the genetic material due to errors
in DNA replication during mitosis, radiation exposure, chemical or environmental
◼ A quantitative measure of how related two stressors, or viruses and transposable elements. Slow but constant rate
(molecular clock) of 10-9 to 10-8 mutations per base per generation. Splicing
sequences to one another. errors in eukaryotes that retain introns.

Recombination- Exchange of genes or portions of genes between different


chromosomes to create new combinations of elements.
◼ Identity is assessed as the total number of Gene duplication- Duplication of a gene or portions of a gene, one of which
exact matches in pair wise sequence alignment continues the original function and the other is free to evolve and acquire new
functions.

Retrotransposition- Incorporation of mRNA sequences back into DNA,


frequently inserting into new locations with different expression patterns.

The mechanism by which new genes/proteins arise allow for the


possibility of sequence analysis to infer functional and structural
relationships among different sequences.
Shyam’s

2
04-02-2025

PAM (Percent Acceptable Mutation) matrices Block substitution matrices (BLOSUM)

• Are derived from studying global alignments of well-characterized protein families. Are derived from studying local alignments (blocks) of sequences from related proteins
• PAM1 = only 1% of residues has changed (ie short evolutionary distance) that differ by no more than X%.
• Raise this to 250 power to get 250% change of two sequences (greater
1) In other words, one might use the portions of aligned sequences from related
evolutionary distance), or about 20% sequence identity.
proteins that have no more than 62% identity (in the portions or blocks) to derive
• Therefore,
the BLOSUM 62 scoring matrix.
a PAM 30 would be used to analyze more closely related proteins,
a PAM 400 is used for finding and analyzing distantly related proteins. 2) One might use only the blocks that have <80% identity to derive the BLOSUM 80
• PAMx = PAM1x matrix.

3) BLOSUM and PAM substitution matrices have the opposite effects:

a) The higher the number of the BLOSUM matrix (BLOSUM X), the more closely
related proteins you are looking for.

a) The higher the number of the PAM matrix (PAM X), the more distantly related
proteins you are looking for.

Gap penalties – Intuitively one recognizes that there should be a penalty


for introducing (requiring) a gap during identification/alignment of a given
sequence. But if two sequences are related, the gaps may well be located
in loop regions which are more tolerant of mutational events and probably
have little impact on structure. Therefore, a new gap should be penalized,
but extending an existing gap should be penalized very little.

Filtering – many proteins and nucleotides contain simple repeats or regions


Multiple Sequence Alignments
of low sequence complexity. These must be excluded from searches and
alignments.

Significance of a “hit” during a search - More important than an arbitrary


score is an estimation of the likelihood of finding a hit through pure chance
(lower the value to more certainty of a match). Ergo the “Expectation value”
or E-value. E-values can be as low as 10-70.

3
04-02-2025

Exhaustive algorithms :
◼ Place residues in columns that are
derived from a common ancestral ◼ Exhaustive alignment involves examining all possible alignments at once.
residue ◼ A multidimensional search matrix is required to perform multiple
sequence alignment using the exhaustive algorithm, similar to the two-
◼ MSA can reveal sequence patterns dimensional matrix used in dynamic programming for pairwise alignment.
CREASE
◼ Demonstration of homology between >2 This means that to align N sequences, an N-dimensional matrix is
sequences CREATE required.
◼ Identification of functionally important RELAPSE
sites ◼ Dynamic programming is a powerful method for aligning sequences, but
as the number of sequences to be aligned increases, the amount of
◼ Protein function prediction GREASER computational time and memory space also increases. This means that the
◼ Structure prediction method becomes computationally impractical for large data sets. As a
◼ Search for weak but significant similarities SeqA CREAT--E- result, dynamic programming is typically only used for small data sets
in databases with fewer than ten short sequences.
Design PCR primers for related gene
SeqB CREAS--E-

◼ Heuristic approaches are typically used for larger data sets to achieve a
identification SeqC GREAS--ER more efficient alignment.
◼ Genome sequencing: contig assembly
SeqD -RELAPSE-
123456789

Heuristic algorithm :
◼ i. Progressive method
◼ The progressive method, also known as the tree-based algorithm, is
a step-wise assembly of multiple alignments based on pairwise
similarity. This method is called progressive because it aligns
sequences in a step-wise manner.
◼ First, it performs pairwise alignments of all the sequences using the
Needleman–Wunsch global alignment method and records the
similarity scores.
◼ Then, it converts the scores into evolutionary distances to create a
distance matrix. A guide tree is constructed from the distance matrix
using the neighbor-joining method.
◼ The guide tree is used to direct the realignment of sequences based
on their relative positions on the tree, starting with the two most
closely related sequences and adding more distant sequences one at
a time until all sequences are aligned.
◼ Clustal and T-Coffee are two well-known progressive alignment
programs.

4
04-02-2025

◼ ii. Iterative Method

◼ The iterative method involves improving an initial


suboptimal solution by repeatedly modifying it
until an optimal solution is reached.
◼ An initial pairwise alignment is conducted to
create a tree that provides weights for creating
alignments. Aligned regions with gaps are
identified and iteratively adjusted to enhance the
alignment score. The highest-scoring alignment is
used in a new set of calculations to predict a new
tree, new weights, and new alignments. The
procedure is repeated until there is no more
improvement in the alignment score.
◼ PRRN is a web-based program that uses the
iterative method of alignment.

◼ Applications of sequence alignment :


◼ iii. Block-based method ◼ Sequence alignment can identify unknown sequences by
comparing them with already known sequences in
databases.
◼ The progressive and iterative alignment methods ◼ Sequence alignment is also used to identify conserved
are based on global alignment and may not be sequence patterns and motifs, which helps to characterize
effective in identifying conserved domains and the functions of the sequences.
◼ Sequence alignment can also produce phylogenetic trees
motifs in highly divergent sequences of different and obtain information about the evolutionary
lengths. relationship between the sequences aligned.
◼ To align such divergent sequences, a local ◼ Sequence alignment can also predict proteins’ secondary
and tertiary structures. It can also predict gene locations
alignment-based approach is needed. and new members of gene families.
◼ The block-based method is one such method that ◼ Sequence alignment can also be used to develop
identifies a block of ungapped alignment that is degenerate PCR primers by analyzing multiple related
sequences.
shared by all sequences.

5
04-02-2025

Multiple Sequence alignment

6
04-02-2025

MSA with PILEUP

PILEUP is the MSA program that is part of the Genetics Computer Group
(GCG) sequence analysis package

Sequences are aligned pairwise using dynamic programming algorithm

The scores are used to produce a phylogenetic tree, which is then used to
guide the alignment of the most closely related sequences and groups of
sequences

Resulting alignment is a global alignment produced by the Needleman-


Wunsch algorithm

MSA with PILEUP Iterative MSA methods

PILEUP drawbacks:

No recent enhancements such as gap modifications or sequence weighting Attempt to correct initial alignment problems by repeatedly aligning subgroups of the
sequences and then by aligning these subgroups into a global alignment of all the sequences
comparable to those introduced for ClustalW
MultAlin – recalculates pair-wise scores during the production of the progressive alignment
and uses these scores to recalculate the tree
As with other progressive alignment programs, does not guarantee an
optimal alignment PRRP – initial alignment is made to predict a tree, the tree is used to produce weights where
the sequences are analyzed for the presence of aligned regions that include gaps

Major problem with progressive alignment programs such as ClustalW and SAGA – based on genetic algorithm that is a machine-learning algorithm that attempts to
produce alignments by the simulations of evolutionary changes in sequences
PILEUP is the dependence of the final multiple sequence alignment on the
initial pairwise alignments

For closely related sequences, ClustalW is designed to provide an adequate


alignment of a large number of sequences

7
04-02-2025

Editing and formatting alignments Multiple Sequence Alignment


Sequence editors are used for: Clustalw can be run on many websites or
- manual alignment/editing of sequences downloaded
- visualization of data

- data management
ClustalX is a graphical form of Clustalw which can
- import/export of data
be downloaded
- graphical enhancement of data for presentations
Clustalw is a global sequence alignment program
Examples:

- CINEMA (Color Interactive Editor for Multiple Alignments) web applet


therefore sequences may need to be edited before
[Link]

- GDE (Genetic Data Environment) - UNIX based


alignment
[Link]

- GeneDoc - MS Windows [Link]

- MACAW - local multiple sequence alignment program and sequence editing tool
Examples: Clustal W/X, Pileup (GCG), 3D-Coffee, DIALIGN-2, MUSCLE,
available by anonymous FTP from [Link]/pub/schuler/macaw PROBCONS, MSA, SALIGN.
- BioEdit - sequence alignment editor for MS Windows with web access and
accessory applications (BLAST, local BLAST, ClustalW, Phylip and more)

ClustalW
◼ Based on phylogenetic analysis.
◼ A phylogenetic tree is created using a pairwise distance matrix and
nearest-neighbor algorithm.
◼ The most closely-related pairs of sequences are aligned using
dynamic programming.
◼ Each of the alignments is analyzed and a profile of it is created.
◼ Alignment profiles are aligned progressively for a total alignment.
◼ W in ClustalW refers to a weighting of scores depending on how
far a sequence is from the root on the phylogenetic tree (See p.
154 of Bioinformatics by Mount.)

8
04-02-2025

Summary MSA
Definition:
A multiple sequence alignment is an alignment of n > 2 sequences obtained by inserting gaps Approaches:
(“-”) into sequences such that the resulting sequences have all length L and can be arranged in a
matrix of N rows and L columns where each column represents a homologous position
◼ Optimal Global Alignments -Dynamic programming
Why do we need MSA?
◼ Build matrices with every possible combination and
- Formulate & test hypotheses about protein 3-D structure
- MSA can help us to reveal biological facts about proteins search for optimal solution
- Crucial for genome sequencing
- To establish homology for phylogenetic analyses ◼ Align 10 sequences of 100 aa length
- Identify primers and probes to search for homologous sequences in other organisms
◼ Optimal in the mathematical sense
- Most pairwise alignment algorithms are too complex to be used for n-wise alignments
- Alignment algorithms need to be optimized ◼ Global Progressive Alignments - Match most common
* use structural information sequences together
* use phylogenetic information
* use conserved regions ◼ Global Iterative Alignments - Multiple re-building
MSA methods
- Progressive global alignment (starts with the most alike sequences)
attempts to find best alignment
* e.g., ClustalW, ClustalX, Pileup
- Iterative methods (initial alignment of groups of sequences that are revised)
* MultAlin, PRRP, SAGA
◼ Local alignments
- Alignments based on locally conserved patterns
◼ Profiles, Blocks, Patterns
Sequence editors
- CINEMA GDE, GeneDoc, MACAW, BioEdit

Progressive Methods
Progressive Method
◼ Similar to dynamic programming method in that it uses ◼ Generally proceeds as follows:
the first step (i.e., it creates a phylogenetic tree, aligns the ◼ Choose a starting pair of sequences and align them
most-alike pair, and incrementally adds sequences to the ◼ Align each next sequence to those already aligned, one at
alignment in order of “alikeness” as indicated by the tree.). a time
◼ Heuristic method – doesn’t guarantee an optimal alignment
◼ Differs from dynamic programming method for MSA in ◼ Details vary in implementation:
that it doesn’t refine the “first-cut” MSA by doing a full ◼ How to choose the first sequence to align?
search through the reduced search space. (This is the ◼ Align all subsequence sequences cumulatively or in
computationally expensive part of DP MSA in that, even subfamilies?
though we’ve cut down the search space, it’s still big when
◼ How to score?
we have many sequences to align.)

9
04-02-2025

Problems with Progressive Method


Global Progressive Alignment
◼ A heuristic approach that utilizes Seq1
VMR
Seq2
VMK
Seq3
GMK
Seq4
GMV
◼ MSA depends on pairwise alignments.
phylogenetic information to ◼ If sequences are very distantly related, much higher
assist in routing the alignment
(clustalw/clustalx).
VMR
VMK
likelihood of errors.
◼ Feng & Doolittle1987, Higgins = ◼ Highly sensitive to the choice of initial pair to align. If
and Sharp 1988. VMR/K
they aren’t very similar, it throws everything off.
◼ Most alike sequences are aligned VMR/K ◼ It’s not trivial to come up with a suitable scoring matrix
together in order of their GMK
=
or gap penaties.
similarity (tree-based), a
consensus is determined and
V/G M R/K ◼ Other approaches using Bayesian methods such as
then aligned to next most similar
VMR
VMK
hidden Markov models
sequence GMK V/G M R/K
GMV
GMV
=
V/G M V/R/K

Iterative Multiple Alignment Iterative Methods for MSA


◼ “Repeatedly re-align subgroups of ◼ Get an alignment.
sequences into a global alignment Initial Progressive
to improve alignment score” Alignment
◼ Refine it.
(Mount, 2001)
◼ Repeat until one msa doesn’t change
◼ Start with a progressive alignment Build Tree
and tree significantly from the next.
◼ Recalculate pair-wise scores during Weight Based On ◼ An example is genetic algorithm approach.
progressive alignment, use new
scores to rebuild the tree, which is Subgroup Alignments
used to improve alignments
Iterate MSA

10
04-02-2025

Phylogenetics

Biological Foundations Terminology


Evolution is driven by ◼ Phylogeny
◼ Inheritance
◼ Variation
The evolutionary relationships among organisms,
◼ Mutations based on a common ancestor
Phenotype
◼ Phylogenetics

◼ Genotype
Area of research concerned with finding the
◼ Recombination
genetic relationships between species
◼ Nature selects: survival of the fittest
◼ (Greek: phylon = race and genetic = birth

◼ All organisms share a common ancestry

11
04-02-2025

Applications of phylogenetic trees


Phylogeny
◼ Evolution studies
◼ Systematic biology
◼ Medical research and epidemiology
Orangutan Gorilla Chimpanzee Human
◼ Ecology

Phylogenetic Trees Tree Shapes


◼ A graph representing the evolutionary Rooted Un-rooted
history of a sequence
A A A C
◼ Relationship of one sequence to other Simple Tree
sequences B B
B D
◼ Dissect the order of appearance of A C C
insertions, deletions, and mutations B
D D
◼ Predict function, observe epidemiology, C
analyzing changes in viral strains
D Branches intersect at Nodes
Leaves are the topmost branches

12
04-02-2025

Tree Characteristics Tree Building Algorithms


◼ Tree Properties ◼ Maximum Parsimony
◼ Clade: all the descendants of a common ancestor
represented by a node
Phylogram
◼ Distance: number of changes that have taken place ◼ Distance Methods
along a branch .035
.012
A ◼ UPGMA
◼ Tree Types ◼ Neighbor Joining
◼ Cladogram: shows the branching order of nodes .009B
.057
C
Phylogram: shows branching order and distances
Maximum Likelihood

.016 ◼
.044D

Maximum Parsimony Distance Methods


Informative Trees ◼ Distance is expressed as the fraction of sites that
Alignment Tree I Tree II Tree III
Site
differ between two sequences in an alignment
1 2 3 4 5 6 1 3 1 2 1 2
One A A G A G T
Site 5
G A G G G G ◼ Sequences with the smallest number of changes
Two A G C C G T G A A A G A
Three A G A T A T G A A A A A (shortest distance) are “related taxa”
Four 2 4 3 4 4 3
A G A G A T
(Select Tree I) (Li, 1991)

◼ Find the tree that changes one sequence into all of the others by the least
number of steps [Focus solely on end product sequences, ignore
evolutionary history]
◼ Only informative sites are analyzed (no gaps or conserved positions)
◼ Can be misleading when rates of change vary in different tree branches

13
04-02-2025

Distance Methods - UPGMA Distance Methods - NJ


◼ Neighbor-Joining (NJ): useful when there are different
◼ UPGMA (Unweighted Pair-Group Method with rates of evolution within a tree
Arithmetic mean) ◼ Each possible pair-wise alignment is examined. Calculate distance
◼ Sequentially find pair of taxa with smallest distance from each sequence to every other sequence
between them, and define branching as midpoint of two ◼ Choose the pair with the lowest distance value and join them to
produce the minimal length tree
◼ Assumes the tree is additive and that rate of change is ◼ Update distance matrix where joined node is substituted for two
constant in all of the branches original taxa and then repeat process
A A A E
DAB
A C
2 B D(AB)C B B B E C 3
H B A A B F
2
C C G 1 C 2 1 F 2 1
D(ABC)D
2 D D
F D H G G
D H

Maximum Likelihood Tree Reliability


◼ Best accounts for variation in sequences ◼ Probability that the members of a clade are always
members of that clade
◼ Establish a probabilistic model with multiple
◼ Sample by Bootstrapping
solutions and determine which is most likely
◼ Random sites of an alignment are randomly sampled so as
◼ All possible trees are considered, therefore, to create a dataset the same size as the original. The same
only suitable for small number of sequences analysis as applied to the original data set is performed on
the bootstrap dataset
◼ Maximizes probability of finding optimal tree
◼ Construct a consensus bootstrap tree and compare to the
original tree

14
04-02-2025

Analysing the aligned sequence


Which Method to Use?
matrix
◼ PHYLIP
Is there yes Maximum
strong Parsimony
◼ POY
sequence
similarity? ◼ PAUP, GCG
no
Is there yes
◼ And many more... (274 software packages
Distance
clearly
Methods
described at one website)
recognizable
sequence
similarity?
no
(Mount, 2001)
Maximum
Likelihood

PHYLIP (Phylogeny Inference Package)


[Link]

◼ Available free in Windows/MacOS/Linux


systems
◼ Parsimony, distance matrix and likelihood
methods (bootstrapping and consensus trees)
◼ Data can be molecular sequences, gene
frequencies, restriction sites and fragments,
distance matrices and discrete characters

15
04-02-2025

Visualising trees
◼ Treeview
◼ You can change the graphic presentation of a
tree (cladogram, rectangular cladogram, radial
tree, phylogram), but not change the structure
of a tree

16
04-02-2025

POY
(Phylogenetic Analysis Using Parsimony)

◼ Cladistic and phylogenetic analysis using sequence


and/or morphological data
◼ Finding among all possible trees, those that exhibit
minimal edit costs (minimum number of mutations)
◼ Is able to assess directly the number of DNA
sequence transformations, evolutionary events,
required by a tree topology without the use of
multiple sequence alignment
◼ CSC

17

You might also like