Chapter 5:
Multiple Sequence Alignment
Tamiru Oljira, Ph.D.
From Bioinformatics & Functional Genomics
(Wiley-Liss, 3rd edition, 2015) and H3BioNet website
Learning objectives
• Explain the three main stages by which ClustalW performs multiple
sequence alignment (MSA)
• Describe several alternative programs for MSA (such as MUSCLE,
ProbCons, & TCoffee).
• Explain how they work & contrast them with ClustalW.
• Explain the significance of performing benchmarking studies &
describe several of their basic conclusions for MSA.
• Explain the issues surrounding MSA of genomic regions
Outline
• Introduction; definition of MSA; typical uses
• Five main approaches to multiple sequence alignment
• Exact approaches
• Progressive sequence alignment
• Iterative approaches
• Consistency-based approaches
• Structure-based methods
• Benchmarking studies: approaches, findings, challenges
• Databases of Multiple Sequence Alignments
• Pfam: Protein Family Database of Profile HMMs
• SMART
• Conserved Domain Database
• Integrated multiple sequence alignment resources
• MSA database curation: manual versus automated
• Multiple sequence alignments of genomic regions
UCSC, Galaxy, Ensembl, alignathon
Multiple sequence alignment:
definition
• Multiple Sequence Alignment (MSA) is generally the
alignment of three or more biological sequences (protein or
nucleic acid) of similar length that are partially or completely
aligned.
• From the output, homology can be inferred and the
evolutionary relationships between the sequences
studied.
• Homologous residues are aligned in columns across the
length of the sequences.
• Residues are homologous in
• an evolutionary sense
• a structural sense
Example: Alignments of 5 globins
• Examples of MSAs are presented on the next few slides
for five globin genes using five prominent MSA
programs for comparison:
o ClustalW (Nearly replaced by Clustal Omega
now)
o Praline
o MUSCLE (MUltiple Sequence Comparison by Log-
Expectation)
o ProbCons (Probabilistic Consistency-based Multiple
Alignment of Amino Acid Sequences)
o Tcoffee (Tree-based Consistency Objective Function
For alignment Evaluation)
• See that each program offers unique strengths.
Example: Alignments of 5 globins
• The five sequences are:
• Beta globin, myoglobin, neuroglobin, soyabean globin, rice globin
• In the aligned results we will focus on a histidine (H) residue that has a critical
role in binding oxygen in globins & it should be aligned.
• But often it’s not aligned, & all the five programs give different
answers.
• Our conclusion will be that there is no single best approach to MSA.
• Dozens of new programs have been introduced in recent years.
• You can read about the remaining new well known programs at:
[Link]
• MAFFT
• Clustal Omega
• Kalign
• MView
• WebPRANK
• COBALT
• You will use this and other three programs in your exercises.
Result for ClustalW
• Note how the region of a conserved histidine (▼) varies
depending on which of five prominent algorithms is used
• Clustal Omega is now recommended for protein MSA.
Result for Praline
Note also the changing pattern of gaps within the boxed
region in these five different alignments.
Result for MUSCLE
• MUSCLE is claimed to achieve both better average accuracy & better
speed than ClustalW2 or T-Coffee, depending on the chosen options.
•
• This tool can align up to 500 sequences or a maximum file size of 1 MB.
ProbCons
Tcoffee
Multiple sequence alignment: features
• Not necessarily one “correct” alignment of a protein family
• Protein sequences evolve.
• The corresponding 3-D structures of proteins also evolve
• May be impossible to identify amino acid residues that align
properly (structurally) throughout a MSA.
• For two proteins sharing 30% amino acid identity, about 50%
of the individual amino acids are superposable in the two
structures.
Multiple sequence alignment: features
o Some aligned residues, such as cysteines that form
disulfide bridges, may be highly conserved.
o There may be conserved motifs such as a
transmembrane domain.
o There may be conserved secondary structure
features
o There may be regions with consistent patterns of
insertions or deletions (indels)
Multiple sequence alignment: uses
• MSA is more sensitive than pairwise alignment to detect
homologs.
• BLAST output can take the form of a MSA, & can reveal
conserved residues or motifs.
• A single query can be searched against a database of MSAs (e.g.
PFAM)
• Regulatory regions of genes may have consensus sequences
identifiable by MSA.
Exact
approaches
Structure-
Progressive
based Five
prominent
approaches
to MSA of
proteins
Consistency-
based Iterative
1. Exact methods to MSA
• Exact methods of MSA use dynamic programming
& are guaranteed to find optimal solutions.
• The matrix is multidimensional rather than two-
dimensional.
• The goal is to maximize the summed alignment score of
each pair of sequences.
• But they are not feasible for more than a few
sequences…takes much time.
2. Progressive methods to MSA
• Progressive methods:
• iterative; repetition of a process/procedure applied
to the result of a previous application, typically as a
means of obtaining successively closer
approximations to the solution of a problem.
• uses a guide tree (related to a phylogenetic tree) to
determine how to combine pairwise alignments
one-by-one to create a MSA.
• First proposed by Feng & Doolittle
• Examples of programs using the progressive
algorithms:
• ClustalW
• MUSCLE
Progressive MSA occurs in 3 stages
1. Do a set of global pairwise
alignments (Needleman & Wunsch’s
dynamic programming algorithm)
2. Create a guide tree
3. Progressively align the sequences
Example of MSA using clustalW: Progressive
• Do the exercise of MSA using ClustalW by visiting
[Link] (May not be working now!!)
• Refer to web documents 6-3 & 6-4 in the book at
the website.
• Two data sets:
• Five distantly related globins (human to plant)
• Five closely related beta globins
• Obtain your sequences in the FASTA format!
• You can save them in a Word document or text
editor.
Use ClustalW to do a progressive MSA
[Link]
ClustalW stage
1: series of
pairwise global
alignments
best
score
Number of pairwise alignments needed (highest percent
o For n sequences, (n-1)(n) / 2 pairwise identity)
o For 5 sequences, (4)(5) / 2 = 10
o For 200 sequences, (199)(200) / 2 = 19,900
Progressive; stage 2: guide tree
• Convert similarity scores to distance scores
• A tree shows the distance between objects
• Use UPGMA (defined in the phylogeny
chapter) or other options.
• ClustalW provides a syntax to describe
the tree
ClustalW stage 1:
series of pairwise
alignments
ClustalW stage 2: best
create a guide tree score
(highest percent
Note that the two proteins with pairwise identity)
the highest percent pairwise
identity (soybean & rice globin) also
have the shortest connecting
branch lengths in the tree
Progressive; stage 2:
Generate a guide tree calculated from the
distance matrix (example for 5 distantly related glob
A guiding tree built by Neighbor-joining (NJ) the score matrix with branch length
proportional to the score of each pair.
NJ method is used to build the unrooted & rooted tree (see chapter on Phylogenetics)
5 closely
related
globins
Progressive; stage 3: progressive alignment
• Multiple alignment is carried out by
• starting with the most closest related sequence pairs,
aligning them &
• then including other more distant pairs progressively
according to the branching order in the guide tree; i.e. adding
the next closest sequence & continuing until all sequences
are added to the MSA
• Use the Rule: “once a gap, always a gap.”
Why “once a gap, always a gap”?
• There are many possible ways to make a MSA
• Where gaps are added is a critical question
• Gaps are often added to the first two (closest) sequences
• To change the initial gap choices later on would be
to give more weight to distantly related sequences.
• To maintain the initial gap choices is to trust
that those gaps are most believable.
ClustalW alignment of five distantly related beta globin
orthologs
ClustalW alignment of five closely related beta globin orthologs
Additional features of ClustalW improve
its ability to generate accurate MSAs
• Individual weights are assigned to sequences;
• very closely related sequences are given less weight,
• distantly related sequences are given more weight
• Scoring matrices are varied dependent on the presence
of conserved or divergent sequences, e.g.:
PAM20 80-100% id
PAM60 60-80% id
PAM120 40-60% id
PAM350 0-40% id
• Residue-specific gap penalties are applied
See Thompson et al. (1994) for an explanation of the three
stages of progressive alignment implemented in ClustalW
Pairwise alignment:
Calculate distance matrix
Unrooted neighbor-
joining tree
Unrooted neighbor-
joining tree
Rooted neighbor-joining
tree (guide tree) &
sequence weights
Rooted neighbor-joining
tree (guide tree) &
sequence weights
Progressive
alignment: Align
following the guide
tree
3. Iterative methods of MSAs
•Compute a sub-optimal solution &
•keep modifying that intelligently using dynamic
programming or other methods until the solution
converges.
Examples of programs using iterative algorithms:
oMUSCLE
oIterAlign
oPraline
oMAFFT
Iterative approaches: MAFFT
⃝ MAFFT (Multiple Alignment using Fast Fourier Transform,
Katoh et al., 2005)
⃝ Uses Fast Fourier Transform to speed up profile
alignment
⃝ Uses fast two-stage method for building alignments using k-
mer frequencies
⃝ Offers many different scoring & aligning techniques
⃝ One of the more accurate programs available
⃝Available as standalone or web interface
⃝Many output formats, including interactive phylogenetic trees
Page 190
Iterative approaches: MAFFT
Has about 1000
advanced settings!
Iterative method of MAFFT: Steps
The initial alignment is a progressive alignment
MAFFT: Iterative
MUSCLE:
Iterative
ProbCons:
Iterative
T-COFFEE:
Iterative
MUSCLE: next-generation progressive MSA
Three steps:
[1] Build a draft progressive alignment
[2] Improve the progressive alignment
[3] Refine the MSA
[1] Build a draft progressive alignment
• Determine pairwise similarity through k-mer counting
(not by alignment)
• Compute distance (triangular distance) matrix
• Construct tree using UPGMA
• Construct draft progressive alignment following tree
MUSCLE: next-generation progressive MSA
[2] Improve the progressive alignment
• Compute pairwise identity through current
MSA
• Construct new tree with Kimura distance
measures
• Compare new & old trees:
• if improved, repeat this step,
• if not improved, then we’re done.
MUSCLE: next-generation progressive MSA
[3] Refine the MSA
• Split tree in half by deleting one edge
• Make profiles of each half of the tree
• Re-align the profiles
• Accept/reject the new alignment
Access to MUSLCE at EBI: [Link]
4. Consistency-based approaches to MSA
Consistency-based algorithms:
• Generally use a database of both local high-scoring alignments
& long-range global alignments to create a final alignment.
• These are very powerful, very fast, & very accurate methods
Examples of MSA programs that implement consistency-based
methods:
o T-COFFEE
o Prrp
o DiAlign
o ProbCons
ProbCons (Probabilistic Consistency-based) approach
• Combines iterative & progressive approaches with a unique
probabilistic model.
• Uses Hidden Markov Models (HMMs) to calculate
probability matrices for matching residues,
• uses this to construct a guide tree.
• Progressive alignment done hierarchically along guide tree
• Post-processing & iterative refinement (a little like MUSCLE)
ProbCons: consistency-based approach
If we align three sequences:
Sequence x xi
Sequence y yj
Sequence z zk
• If xi aligns with zk & zk aligns with yj then xi should align with yj
• ProbCons incorporates evidence from multiple sequences to guide
the creation of a pairwise alignment.
ProbCons output for the same alignment:
consistency iteration helps
Tree-based Consistency Objective Function
For alignment Evaluation
Access to TCoffee:
[Link]
o Make a MSA
o MSA w. structural data
o Compare MSA methods
o Make an RNA MSA
o Combine MSA methods
o Consistency-based
o Structure-based
APDB (“Analyze alignments with Protein Data Bank”)
ClustalW output:
TCoffee can incorporate structural information into a MSA
Protein Data Bank accession numbers
Benchmarking studies: approaches, findings, challenges
How do we know which program to use for MSA?
Strategy for assessment of alternative
MSA Algorithms
1. Create or obtain a database of protein sequences for which the 3D
structure is known.
• Helps to define “true” homologs using structural criteria.
• Helps to judge your alignments.
• For example, try Expresso at [Link]
2. Try making MSAs with many different sets of proteins:
o very related, very distant, few gaps, many gaps, insertions, Outliers
• There are benchmarking multiple alignment datasets that have been aligned
painstakingly by:
• hand,
• structural similarity, or
• extremely time- & memory-intensive automated exact algorithms.
e.g. BaliBase: (Benchmark Alignment dataBASE)
3. Compare the answers
Benchmarking studies: approaches, findings, challenges
BaliBase: (Benchmark Alignment dataBASE)
• used for comparison of MSA programs
• Does benchmarking test
• Some programs have interfaces that are more user-
friendly than others & most programs are excellent so it
depends on your preference.
BaliBase: (Benchmark Alignment
dataBASE)
Benchmarking studies: approaches, findings, challenges
• Benchmarking tests on BaliBase suggest:
• ProbCons, a consistency-based/progressive algorithm,
performs the best.
• However, MUSCLE, a progressive alignment package, is an
extremely fast & accurate program.
• ClustalW has been the most popular program.
• has a nice interface (especially with ClustalX) &
• easy to use.
• But several programs perform better than it.
• There is no one single best program to use, & your answers
will certainly differ (especially if you align divergent protein or
DNA sequences)
Databases of MSAs for Proteins
• Pfam: Protein Family
Database of Profile
HMMs.
• SMART (Simple
Modular Architecture
Research Tool)
• Conserved Domain
Database (CDD)
• IMSAR (Integrated
Multiple Sequence
Alignment Resources)
• MSA database curation:
manual vs. automated
Pfam alignment retrieved in the JalView Java viewer
Databases on which Interpro (release 51.0) is based (Now it is
release 71.0)
[Link]
Multiple sequence alignment of
genomic DNA
• Aligning more species improves accuracy.
• Alignment of divergent sequences often reveals islands of
conservation (providing “anchors” for alignment).
• Chromosomes are subject to inversions, duplications,
deletions, & translocations (often involving millions of base
pairs).
• E.g. human chromosome 2 is derived from the fusion
of two acrocentric chromosomes.
• There are no benchmark datasets available for genomic
data.
Online resources
for Genomic MSA
• UCSC
• Galaxy: NGS
analyses
• Ensembl
• Alignathon:
whole genome
Alignment
Analyzing multiple sequence alignments at Ensembl
Analyzing multiple sequence
alignments at Ensembl
Interpreting Your
Multiple Sequence
Alignment
Interpret your multiple sequence alignment
o The interpretation of a multiple alignment
depends very much on its appearance.
o Some tools on the Net can help you make
sense of your multiple alignments by
extracting blocks or singling out special
positions.
-Interpreting an alignment is a bit of an art.
-E-values (the scores that tell you how reliable
your database search is )
That means deciding whether your
alignment is correct still involves
some educated guesswork.
- DNA alignments are by far the most
difficult to interpret.
- If you’re analyzing this type of sequence, you
want a very high level of conservation,
knowing that single conserved columns are
likely to be meaningless.
-A DNA block is only informative when it
contains several identical columns in a cluster.
- Even with the DNA of closely related
sequences, obtaining such an alignment is
still difficult.
- This is why most biologists prefer protein
alignments.
Recognizing the good parts in a protein
alignment
- The most convincing evaluative grid we have for
a protein multiple alignment stems from our
knowledge of protein structures.
Claverie J, Notredame C (2007). Bioinformatics for Dummies (2nd Edn). Wiley publishing, Inc. 436 pp.
- We know that structures contain
surface loops that evolve rapidly.
(Loops are softer portions of the
protein that connect its more rigid
portions).
Claverie J, Notredame C (2007). Bioinformatics for Dummies (2nd Edn). Wiley publishing, Inc. 436 pp.
Protein structures also contain core regions
that act as support walls for the protein.
These support walls evolve less rapidly than
the loops on the surface.
.
In your multiple alignment, you can expect to
find nice, gap-free blocks that correspond to
the core regions — and gap-rich regions that
correspond to the loops.
Cabalistic signs
• The last line contains seemingly ClustalW, MUSCLE, or
• Tcoffee alignment, cabalistic signs such as (*), (:), or (.).
• (*) A star indicates an entirely conserved column.
• (:) A colon indicates columns where all the residues
have roughly the same size and the same
hydropathy.
• (.) A period indicates columns where the size OR the
hydropathy has been preserved in the course of
evolution.
The average good block is:
- A unit at least 10–30 amino acids long, exhibiting at
least one to three stars (*), a few more colons (:)
close to the stars, and a several periods (.)
scattered along the MSA result.
• The magic thing about multiple sequence alignments is
that 4 or 5 conserved positions over 50 amino acids can
be enough to convince us that we’re looking at a genuine
signal. This is less than 10 percent identity!
• You have to remember that we require at least 25 percent
identity to consider a pairwise alignment
Conserved columns in a multiple
sequence alignment are meaningful
only when the surrounding columns
are not conserved
Another criterion
for a useful multiple alignment is
knowing the type of amino acids
you can expect to see conserved.
Amino acids are not equal and they all
have very characteristic patterns of
mutation/conservation in a multiple
sequence alignment.
Patterns of Conservation in Multiple Sequence Alignments
W(tryptophans),F(phenylalanine), Y(tyrosine)
It is common to find conserved tryptophans. Tryptophan is a large
hydrophobic residue that sits deep in the core of proteins. It plays an
important role in their stability and is therefore difficult to mutate.
When tryptophan mutates, it is usually replaced by another
aromatic amino acid, such as phenylalanine or tyrosine.
Patterns of conserved aromatic amino acids constitute the most
common signatures for recognizing protein domains.
G (glycine), P (proline)
• It is common to find conserved columns with a glycine or a proline in a
multiple alignment. These two amino acids often coincide with the
extremities of well-structured beta strands or alpha helices.
Claverie J, Notredame C (2007). Bioinformatics for Dummies (2nd Edn). Wiley publishing, Inc. 436 pp.
C (cysteines)
Cysteines are famous for making C-C (disulphide)
bridges. Conserved columns of cysteines are rather
common and usually indicate such bridges.
Columns of conserved cysteines with a specific
distance provide a useful signature for recognizing
protein domains and folds.
H(Histidine), S(serine)
Histidine and serine are often involved in
catalytic sites, especially those of proteases.
Conserved histidine or a conserved serine are
good candidates for being part of an active
site.
K (Lysine), R (Arginine), D (Aspartic Acid), E (Glutamic Acid)
These charged amino acids are often
involved in ligand binding. Highly
conserved columns can also indicate a
salt bridge inside the core of the protein.
L (Leucines)
Leucines are rarely very conserved unless
they’re involved in protein-protein interactions
such as a leucine zipper.
Summary: multiple sequence alignment (MSA)
• Many dozens of MSA programs have been introduced in recent
years. None is optimal. Each offers unique strengths &
weaknesses.
• Key methods include
• Consistency-based MSA
• Iterative-based MSA
• Structure-based MSA
• Alignment of genomic DNA presents specialized challenges &
different sets of tools.
• MSA are readily available through genome browsers such as
Ensembl, UCSC, & NCBI.