Understanding Multiple Sequence Alignment
Understanding Multiple Sequence Alignment
Homologous
Multiple Sequence Alignent ◼ Homology is an inference (sequences are
homologous or not).
Shyam’s
1
04-02-2025
Similarity
Shyam’s Shyam’s
2
04-02-2025
• Are derived from studying global alignments of well-characterized protein families. Are derived from studying local alignments (blocks) of sequences from related proteins
• PAM1 = only 1% of residues has changed (ie short evolutionary distance) that differ by no more than X%.
• Raise this to 250 power to get 250% change of two sequences (greater
1) In other words, one might use the portions of aligned sequences from related
evolutionary distance), or about 20% sequence identity.
proteins that have no more than 62% identity (in the portions or blocks) to derive
• Therefore,
the BLOSUM 62 scoring matrix.
a PAM 30 would be used to analyze more closely related proteins,
a PAM 400 is used for finding and analyzing distantly related proteins. 2) One might use only the blocks that have <80% identity to derive the BLOSUM 80
• PAMx = PAM1x matrix.
a) The higher the number of the BLOSUM matrix (BLOSUM X), the more closely
related proteins you are looking for.
a) The higher the number of the PAM matrix (PAM X), the more distantly related
proteins you are looking for.
3
04-02-2025
Exhaustive algorithms :
◼ Place residues in columns that are
derived from a common ancestral ◼ Exhaustive alignment involves examining all possible alignments at once.
residue ◼ A multidimensional search matrix is required to perform multiple
sequence alignment using the exhaustive algorithm, similar to the two-
◼ MSA can reveal sequence patterns dimensional matrix used in dynamic programming for pairwise alignment.
CREASE
◼ Demonstration of homology between >2 This means that to align N sequences, an N-dimensional matrix is
sequences CREATE required.
◼ Identification of functionally important RELAPSE
sites ◼ Dynamic programming is a powerful method for aligning sequences, but
as the number of sequences to be aligned increases, the amount of
◼ Protein function prediction GREASER computational time and memory space also increases. This means that the
◼ Structure prediction method becomes computationally impractical for large data sets. As a
◼ Search for weak but significant similarities SeqA CREAT--E- result, dynamic programming is typically only used for small data sets
in databases with fewer than ten short sequences.
Design PCR primers for related gene
SeqB CREAS--E-
◼
◼ Heuristic approaches are typically used for larger data sets to achieve a
identification SeqC GREAS--ER more efficient alignment.
◼ Genome sequencing: contig assembly
SeqD -RELAPSE-
123456789
Heuristic algorithm :
◼ i. Progressive method
◼ The progressive method, also known as the tree-based algorithm, is
a step-wise assembly of multiple alignments based on pairwise
similarity. This method is called progressive because it aligns
sequences in a step-wise manner.
◼ First, it performs pairwise alignments of all the sequences using the
Needleman–Wunsch global alignment method and records the
similarity scores.
◼ Then, it converts the scores into evolutionary distances to create a
distance matrix. A guide tree is constructed from the distance matrix
using the neighbor-joining method.
◼ The guide tree is used to direct the realignment of sequences based
on their relative positions on the tree, starting with the two most
closely related sequences and adding more distant sequences one at
a time until all sequences are aligned.
◼ Clustal and T-Coffee are two well-known progressive alignment
programs.
4
04-02-2025
5
04-02-2025
6
04-02-2025
PILEUP is the MSA program that is part of the Genetics Computer Group
(GCG) sequence analysis package
The scores are used to produce a phylogenetic tree, which is then used to
guide the alignment of the most closely related sequences and groups of
sequences
PILEUP drawbacks:
No recent enhancements such as gap modifications or sequence weighting Attempt to correct initial alignment problems by repeatedly aligning subgroups of the
sequences and then by aligning these subgroups into a global alignment of all the sequences
comparable to those introduced for ClustalW
MultAlin – recalculates pair-wise scores during the production of the progressive alignment
and uses these scores to recalculate the tree
As with other progressive alignment programs, does not guarantee an
optimal alignment PRRP – initial alignment is made to predict a tree, the tree is used to produce weights where
the sequences are analyzed for the presence of aligned regions that include gaps
Major problem with progressive alignment programs such as ClustalW and SAGA – based on genetic algorithm that is a machine-learning algorithm that attempts to
produce alignments by the simulations of evolutionary changes in sequences
PILEUP is the dependence of the final multiple sequence alignment on the
initial pairwise alignments
7
04-02-2025
- data management
ClustalX is a graphical form of Clustalw which can
- import/export of data
be downloaded
- graphical enhancement of data for presentations
Clustalw is a global sequence alignment program
Examples:
- MACAW - local multiple sequence alignment program and sequence editing tool
Examples: Clustal W/X, Pileup (GCG), 3D-Coffee, DIALIGN-2, MUSCLE,
available by anonymous FTP from [Link]/pub/schuler/macaw PROBCONS, MSA, SALIGN.
- BioEdit - sequence alignment editor for MS Windows with web access and
accessory applications (BLAST, local BLAST, ClustalW, Phylip and more)
ClustalW
◼ Based on phylogenetic analysis.
◼ A phylogenetic tree is created using a pairwise distance matrix and
nearest-neighbor algorithm.
◼ The most closely-related pairs of sequences are aligned using
dynamic programming.
◼ Each of the alignments is analyzed and a profile of it is created.
◼ Alignment profiles are aligned progressively for a total alignment.
◼ W in ClustalW refers to a weighting of scores depending on how
far a sequence is from the root on the phylogenetic tree (See p.
154 of Bioinformatics by Mount.)
8
04-02-2025
Summary MSA
Definition:
A multiple sequence alignment is an alignment of n > 2 sequences obtained by inserting gaps Approaches:
(“-”) into sequences such that the resulting sequences have all length L and can be arranged in a
matrix of N rows and L columns where each column represents a homologous position
◼ Optimal Global Alignments -Dynamic programming
Why do we need MSA?
◼ Build matrices with every possible combination and
- Formulate & test hypotheses about protein 3-D structure
- MSA can help us to reveal biological facts about proteins search for optimal solution
- Crucial for genome sequencing
- To establish homology for phylogenetic analyses ◼ Align 10 sequences of 100 aa length
- Identify primers and probes to search for homologous sequences in other organisms
◼ Optimal in the mathematical sense
- Most pairwise alignment algorithms are too complex to be used for n-wise alignments
- Alignment algorithms need to be optimized ◼ Global Progressive Alignments - Match most common
* use structural information sequences together
* use phylogenetic information
* use conserved regions ◼ Global Iterative Alignments - Multiple re-building
MSA methods
- Progressive global alignment (starts with the most alike sequences)
attempts to find best alignment
* e.g., ClustalW, ClustalX, Pileup
- Iterative methods (initial alignment of groups of sequences that are revised)
* MultAlin, PRRP, SAGA
◼ Local alignments
- Alignments based on locally conserved patterns
◼ Profiles, Blocks, Patterns
Sequence editors
- CINEMA GDE, GeneDoc, MACAW, BioEdit
Progressive Methods
Progressive Method
◼ Similar to dynamic programming method in that it uses ◼ Generally proceeds as follows:
the first step (i.e., it creates a phylogenetic tree, aligns the ◼ Choose a starting pair of sequences and align them
most-alike pair, and incrementally adds sequences to the ◼ Align each next sequence to those already aligned, one at
alignment in order of “alikeness” as indicated by the tree.). a time
◼ Heuristic method – doesn’t guarantee an optimal alignment
◼ Differs from dynamic programming method for MSA in ◼ Details vary in implementation:
that it doesn’t refine the “first-cut” MSA by doing a full ◼ How to choose the first sequence to align?
search through the reduced search space. (This is the ◼ Align all subsequence sequences cumulatively or in
computationally expensive part of DP MSA in that, even subfamilies?
though we’ve cut down the search space, it’s still big when
◼ How to score?
we have many sequences to align.)
9
04-02-2025
10
04-02-2025
Phylogenetics
11
04-02-2025
12
04-02-2025
◼ Find the tree that changes one sequence into all of the others by the least
number of steps [Focus solely on end product sequences, ignore
evolutionary history]
◼ Only informative sites are analyzed (no gaps or conserved positions)
◼ Can be misleading when rates of change vary in different tree branches
13
04-02-2025
14
04-02-2025
15
04-02-2025
Visualising trees
◼ Treeview
◼ You can change the graphic presentation of a
tree (cladogram, rectangular cladogram, radial
tree, phylogram), but not change the structure
of a tree
16
04-02-2025
POY
(Phylogenetic Analysis Using Parsimony)
17