0% found this document useful (0 votes)

54 views86 pages

Understanding Multiple Sequence Alignment

Biotechnology and bioinformatics that highlights multiple sequence alignments

Uploaded by

chalisd07

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views86 pages

Understanding Multiple Sequence Alignment

Biotechnology and bioinformatics that highlights multiple sequence alignments

Uploaded by

chalisd07

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Chapter 5:

Multiple Sequence Alignment

Tamiru Oljira, Ph.D.

From Bioinformatics & Functional Genomics
(Wiley-Liss, 3rd edition, 2015) and H3BioNet website
Learning objectives

• Explain the three main stages by which ClustalW performs multiple

sequence alignment (MSA)

• Describe several alternative programs for MSA (such as MUSCLE,

ProbCons, & TCoffee).

• Explain how they work & contrast them with ClustalW.

• Explain the significance of performing benchmarking studies &

describe several of their basic conclusions for MSA.

• Explain the issues surrounding MSA of genomic regions

Outline
• Introduction; definition of MSA; typical uses
• Five main approaches to multiple sequence alignment
• Exact approaches
• Progressive sequence alignment
• Iterative approaches
• Consistency-based approaches
• Structure-based methods
• Benchmarking studies: approaches, findings, challenges
• Databases of Multiple Sequence Alignments
• Pfam: Protein Family Database of Profile HMMs
• SMART
• Conserved Domain Database
• Integrated multiple sequence alignment resources
• MSA database curation: manual versus automated
• Multiple sequence alignments of genomic regions
UCSC, Galaxy, Ensembl, alignathon
Multiple sequence alignment:
definition

• Multiple Sequence Alignment (MSA) is generally the

alignment of three or more biological sequences (protein or
nucleic acid) of similar length that are partially or completely
aligned.
• From the output, homology can be inferred and the
evolutionary relationships between the sequences
studied.
• Homologous residues are aligned in columns across the
length of the sequences.
• Residues are homologous in
• an evolutionary sense
• a structural sense
Example: Alignments of 5 globins
• Examples of MSAs are presented on the next few slides
for five globin genes using five prominent MSA
programs for comparison:
o ClustalW (Nearly replaced by Clustal Omega
now)
o Praline
o MUSCLE (MUltiple Sequence Comparison by Log-
Expectation)
o ProbCons (Probabilistic Consistency-based Multiple
Alignment of Amino Acid Sequences)
o Tcoffee (Tree-based Consistency Objective Function
For alignment Evaluation)
• See that each program offers unique strengths.
Example: Alignments of 5 globins
• The five sequences are:
• Beta globin, myoglobin, neuroglobin, soyabean globin, rice globin
• In the aligned results we will focus on a histidine (H) residue that has a critical
role in binding oxygen in globins & it should be aligned.
• But often it’s not aligned, & all the five programs give different
answers.
• Our conclusion will be that there is no single best approach to MSA.
• Dozens of new programs have been introduced in recent years.
• You can read about the remaining new well known programs at:
[Link]
• MAFFT
• Clustal Omega
• Kalign
• MView
• WebPRANK
• COBALT
• You will use this and other three programs in your exercises.
Result for ClustalW

• Note how the region of a conserved histidine (▼) varies

depending on which of five prominent algorithms is used
• Clustal Omega is now recommended for protein MSA.
Result for Praline

Note also the changing pattern of gaps within the boxed

region in these five different alignments.
Result for MUSCLE

• MUSCLE is claimed to achieve both better average accuracy & better

speed than ClustalW2 or T-Coffee, depending on the chosen options.
•
• This tool can align up to 500 sequences or a maximum file size of 1 MB.
ProbCons
Tcoffee
Multiple sequence alignment: features

• Not necessarily one “correct” alignment of a protein family

• Protein sequences evolve.

• The corresponding 3-D structures of proteins also evolve

• May be impossible to identify amino acid residues that align

properly (structurally) throughout a MSA.

• For two proteins sharing 30% amino acid identity, about 50%
of the individual amino acids are superposable in the two
structures.
Multiple sequence alignment: features

o Some aligned residues, such as cysteines that form

disulfide bridges, may be highly conserved.

o There may be conserved motifs such as a

transmembrane domain.

o There may be conserved secondary structure

features

o There may be regions with consistent patterns of

insertions or deletions (indels)
Multiple sequence alignment: uses

• MSA is more sensitive than pairwise alignment to detect

homologs.

• BLAST output can take the form of a MSA, & can reveal
conserved residues or motifs.

• A single query can be searched against a database of MSAs (e.g.

PFAM)

• Regulatory regions of genes may have consensus sequences

identifiable by MSA.
Exact
approaches

Structure-
Progressive
based Five
prominent
approaches
to MSA of
proteins

Consistency-
based Iterative
1. Exact methods to MSA

• Exact methods of MSA use dynamic programming

& are guaranteed to find optimal solutions.

• The matrix is multidimensional rather than two-

dimensional.

• The goal is to maximize the summed alignment score of

each pair of sequences.

• But they are not feasible for more than a few

sequences…takes much time.
2. Progressive methods to MSA
• Progressive methods:
• iterative; repetition of a process/procedure applied
to the result of a previous application, typically as a
means of obtaining successively closer
approximations to the solution of a problem.
• uses a guide tree (related to a phylogenetic tree) to
determine how to combine pairwise alignments
one-by-one to create a MSA.
• First proposed by Feng & Doolittle
• Examples of programs using the progressive
algorithms:
• ClustalW
• MUSCLE
Progressive MSA occurs in 3 stages

1. Do a set of global pairwise

alignments (Needleman & Wunsch’s
dynamic programming algorithm)

2. Create a guide tree

3. Progressively align the sequences

Example of MSA using clustalW: Progressive

• Do the exercise of MSA using ClustalW by visiting

[Link] (May not be working now!!)
• Refer to web documents 6-3 & 6-4 in the book at
the website.
• Two data sets:
• Five distantly related globins (human to plant)
• Five closely related beta globins
• Obtain your sequences in the FASTA format!

• You can save them in a Word document or text

editor.
Use ClustalW to do a progressive MSA

[Link]
ClustalW stage
1: series of
pairwise global
alignments

best
score
Number of pairwise alignments needed (highest percent
o For n sequences, (n-1)(n) / 2 pairwise identity)
o For 5 sequences, (4)(5) / 2 = 10

o For 200 sequences, (199)(200) / 2 = 19,900

Progressive; stage 2: guide tree

• Convert similarity scores to distance scores

• A tree shows the distance between objects

• Use UPGMA (defined in the phylogeny

chapter) or other options.

• ClustalW provides a syntax to describe

the tree
ClustalW stage 1:
series of pairwise
alignments

ClustalW stage 2: best

create a guide tree score
(highest percent
Note that the two proteins with pairwise identity)
the highest percent pairwise
identity (soybean & rice globin) also
have the shortest connecting
branch lengths in the tree
Progressive; stage 2:
Generate a guide tree calculated from the
distance matrix (example for 5 distantly related glob
A guiding tree built by Neighbor-joining (NJ) the score matrix with branch length
proportional to the score of each pair.
NJ method is used to build the unrooted & rooted tree (see chapter on Phylogenetics)

5 closely
related
globins
Progressive; stage 3: progressive alignment

• Multiple alignment is carried out by

• starting with the most closest related sequence pairs,
aligning them &
• then including other more distant pairs progressively
according to the branching order in the guide tree; i.e. adding
the next closest sequence & continuing until all sequences
are added to the MSA
• Use the Rule: “once a gap, always a gap.”
Why “once a gap, always a gap”?

• There are many possible ways to make a MSA

• Where gaps are added is a critical question

• Gaps are often added to the first two (closest) sequences

• To change the initial gap choices later on would be

to give more weight to distantly related sequences.

• To maintain the initial gap choices is to trust

that those gaps are most believable.
ClustalW alignment of five distantly related beta globin
orthologs
ClustalW alignment of five closely related beta globin orthologs
Additional features of ClustalW improve
its ability to generate accurate MSAs

• Individual weights are assigned to sequences;

• very closely related sequences are given less weight,

• distantly related sequences are given more weight

• Scoring matrices are varied dependent on the presence

of conserved or divergent sequences, e.g.:

PAM20 80-100% id
PAM60 60-80% id
PAM120 40-60% id
PAM350 0-40% id

• Residue-specific gap penalties are applied

See Thompson et al. (1994) for an explanation of the three
stages of progressive alignment implemented in ClustalW
Pairwise alignment:
Calculate distance matrix

Unrooted neighbor-
joining tree
Unrooted neighbor-
joining tree

Rooted neighbor-joining
tree (guide tree) &
sequence weights
Rooted neighbor-joining
tree (guide tree) &
sequence weights

Progressive
alignment: Align
following the guide
tree
3. Iterative methods of MSAs
•Compute a sub-optimal solution &
•keep modifying that intelligently using dynamic
programming or other methods until the solution
converges.

Examples of programs using iterative algorithms:

oMUSCLE
oIterAlign
oPraline
oMAFFT
Iterative approaches: MAFFT
⃝ MAFFT (Multiple Alignment using Fast Fourier Transform,
Katoh et al., 2005)
⃝ Uses Fast Fourier Transform to speed up profile
alignment
⃝ Uses fast two-stage method for building alignments using k-
mer frequencies
⃝ Offers many different scoring & aligning techniques
⃝ One of the more accurate programs available
⃝Available as standalone or web interface
⃝Many output formats, including interactive phylogenetic trees

Page 190
Iterative approaches: MAFFT

Has about 1000

advanced settings!
Iterative method of MAFFT: Steps
The initial alignment is a progressive alignment
MAFFT: Iterative

MUSCLE:
Iterative
ProbCons:
Iterative

T-COFFEE:
Iterative
MUSCLE: next-generation progressive MSA

Three steps:
[1] Build a draft progressive alignment
[2] Improve the progressive alignment
[3] Refine the MSA
[1] Build a draft progressive alignment
• Determine pairwise similarity through k-mer counting
(not by alignment)
• Compute distance (triangular distance) matrix
• Construct tree using UPGMA
• Construct draft progressive alignment following tree
MUSCLE: next-generation progressive MSA

[2] Improve the progressive alignment

• Compute pairwise identity through current
MSA
• Construct new tree with Kimura distance
measures
• Compare new & old trees:
• if improved, repeat this step,
• if not improved, then we’re done.
MUSCLE: next-generation progressive MSA

[3] Refine the MSA

• Split tree in half by deleting one edge
• Make profiles of each half of the tree
• Re-align the profiles
• Accept/reject the new alignment
Access to MUSLCE at EBI: [Link]
4. Consistency-based approaches to MSA

Consistency-based algorithms:
• Generally use a database of both local high-scoring alignments
& long-range global alignments to create a final alignment.

• These are very powerful, very fast, & very accurate methods

Examples of MSA programs that implement consistency-based

methods:
o T-COFFEE
o Prrp
o DiAlign
o ProbCons
ProbCons (Probabilistic Consistency-based) approach

• Combines iterative & progressive approaches with a unique

probabilistic model.
• Uses Hidden Markov Models (HMMs) to calculate
probability matrices for matching residues,
• uses this to construct a guide tree.
• Progressive alignment done hierarchically along guide tree
• Post-processing & iterative refinement (a little like MUSCLE)
ProbCons: consistency-based approach

If we align three sequences:

Sequence x xi
Sequence y yj
Sequence z zk

• If xi aligns with zk & zk aligns with yj then xi should align with yj

• ProbCons incorporates evidence from multiple sequences to guide
the creation of a pairwise alignment.
ProbCons output for the same alignment:
consistency iteration helps
Tree-based Consistency Objective Function
For alignment Evaluation

Access to TCoffee:
[Link]

o Make a MSA
o MSA w. structural data
o Compare MSA methods
o Make an RNA MSA
o Combine MSA methods
o Consistency-based
o Structure-based
APDB (“Analyze alignments with Protein Data Bank”)
ClustalW output:
TCoffee can incorporate structural information into a MSA

Protein Data Bank accession numbers

Benchmarking studies: approaches, findings, challenges

How do we know which program to use for MSA?

Strategy for assessment of alternative
MSA Algorithms
1. Create or obtain a database of protein sequences for which the 3D
structure is known.
• Helps to define “true” homologs using structural criteria.
• Helps to judge your alignments.
• For example, try Expresso at [Link]
2. Try making MSAs with many different sets of proteins:
o very related, very distant, few gaps, many gaps, insertions, Outliers
• There are benchmarking multiple alignment datasets that have been aligned
painstakingly by:
• hand,
• structural similarity, or
• extremely time- & memory-intensive automated exact algorithms.
e.g. BaliBase: (Benchmark Alignment dataBASE)
3. Compare the answers
Benchmarking studies: approaches, findings, challenges
BaliBase: (Benchmark Alignment dataBASE)
• used for comparison of MSA programs
• Does benchmarking test
• Some programs have interfaces that are more user-
friendly than others & most programs are excellent so it
depends on your preference.
BaliBase: (Benchmark Alignment
dataBASE)
Benchmarking studies: approaches, findings, challenges

• Benchmarking tests on BaliBase suggest:

• ProbCons, a consistency-based/progressive algorithm,
performs the best.
• However, MUSCLE, a progressive alignment package, is an
extremely fast & accurate program.

• ClustalW has been the most popular program.

• has a nice interface (especially with ClustalX) &
• easy to use.
• But several programs perform better than it.
• There is no one single best program to use, & your answers
will certainly differ (especially if you align divergent protein or
DNA sequences)
Databases of MSAs for Proteins

• Pfam: Protein Family

Database of Profile
HMMs.

• SMART (Simple
Modular Architecture
Research Tool)

• Conserved Domain
Database (CDD)

• IMSAR (Integrated
Multiple Sequence
Alignment Resources)

• MSA database curation:

manual vs. automated
Pfam alignment retrieved in the JalView Java viewer
Databases on which Interpro (release 51.0) is based (Now it is
release 71.0)

[Link]
Multiple sequence alignment of
genomic DNA
• Aligning more species improves accuracy.

• Alignment of divergent sequences often reveals islands of

conservation (providing “anchors” for alignment).

• Chromosomes are subject to inversions, duplications,

deletions, & translocations (often involving millions of base
pairs).
• E.g. human chromosome 2 is derived from the fusion
of two acrocentric chromosomes.

• There are no benchmark datasets available for genomic

data.
Online resources
for Genomic MSA
• UCSC

• Galaxy: NGS
analyses

• Ensembl

• Alignathon:
whole genome
Alignment
Analyzing multiple sequence alignments at Ensembl
Analyzing multiple sequence
alignments at Ensembl
Interpreting Your
Multiple Sequence
Alignment
Interpret your multiple sequence alignment
o The interpretation of a multiple alignment
depends very much on its appearance.

o Some tools on the Net can help you make

sense of your multiple alignments by
extracting blocks or singling out special
positions.
-Interpreting an alignment is a bit of an art.
-E-values (the scores that tell you how reliable
your database search is )
That means deciding whether your
alignment is correct still involves
some educated guesswork.
- DNA alignments are by far the most
difficult to interpret.
- If you’re analyzing this type of sequence, you
want a very high level of conservation,
knowing that single conserved columns are
likely to be meaningless.
-A DNA block is only informative when it
contains several identical columns in a cluster.
- Even with the DNA of closely related
sequences, obtaining such an alignment is
still difficult.
- This is why most biologists prefer protein
alignments.
Recognizing the good parts in a protein
alignment

- The most convincing evaluative grid we have for

a protein multiple alignment stems from our
knowledge of protein structures.

Claverie J, Notredame C (2007). Bioinformatics for Dummies (2nd Edn). Wiley publishing, Inc. 436 pp.
- We know that structures contain
surface loops that evolve rapidly.
(Loops are softer portions of the
protein that connect its more rigid
portions).

Claverie J, Notredame C (2007). Bioinformatics for Dummies (2nd Edn). Wiley publishing, Inc. 436 pp.
Protein structures also contain core regions
that act as support walls for the protein.
These support walls evolve less rapidly than
the loops on the surface.
.
In your multiple alignment, you can expect to
find nice, gap-free blocks that correspond to
the core regions — and gap-rich regions that
correspond to the loops.
Cabalistic signs
• The last line contains seemingly ClustalW, MUSCLE, or
• Tcoffee alignment, cabalistic signs such as (*), (:), or (.).
• (*) A star indicates an entirely conserved column.
• (:) A colon indicates columns where all the residues
have roughly the same size and the same
hydropathy.
• (.) A period indicates columns where the size OR the
hydropathy has been preserved in the course of
evolution.
The average good block is:
- A unit at least 10–30 amino acids long, exhibiting at
least one to three stars (*), a few more colons (:)
close to the stars, and a several periods (.)
scattered along the MSA result.
• The magic thing about multiple sequence alignments is
that 4 or 5 conserved positions over 50 amino acids can
be enough to convince us that we’re looking at a genuine
signal. This is less than 10 percent identity!
• You have to remember that we require at least 25 percent
identity to consider a pairwise alignment
Conserved columns in a multiple
sequence alignment are meaningful
only when the surrounding columns
are not conserved
Another criterion
for a useful multiple alignment is
knowing the type of amino acids
you can expect to see conserved.
Amino acids are not equal and they all
have very characteristic patterns of
mutation/conservation in a multiple
sequence alignment.
Patterns of Conservation in Multiple Sequence Alignments
W(tryptophans),F(phenylalanine), Y(tyrosine)
It is common to find conserved tryptophans. Tryptophan is a large
hydrophobic residue that sits deep in the core of proteins. It plays an
important role in their stability and is therefore difficult to mutate.
When tryptophan mutates, it is usually replaced by another
aromatic amino acid, such as phenylalanine or tyrosine.

Patterns of conserved aromatic amino acids constitute the most

common signatures for recognizing protein domains.
G (glycine), P (proline)
• It is common to find conserved columns with a glycine or a proline in a
multiple alignment. These two amino acids often coincide with the
extremities of well-structured beta strands or alpha helices.

Claverie J, Notredame C (2007). Bioinformatics for Dummies (2nd Edn). Wiley publishing, Inc. 436 pp.
C (cysteines)
Cysteines are famous for making C-C (disulphide)
bridges. Conserved columns of cysteines are rather
common and usually indicate such bridges.
Columns of conserved cysteines with a specific
distance provide a useful signature for recognizing
protein domains and folds.
H(Histidine), S(serine)
Histidine and serine are often involved in
catalytic sites, especially those of proteases.
Conserved histidine or a conserved serine are
good candidates for being part of an active
site.
K (Lysine), R (Arginine), D (Aspartic Acid), E (Glutamic Acid)

These charged amino acids are often

involved in ligand binding. Highly
conserved columns can also indicate a
salt bridge inside the core of the protein.
L (Leucines)
Leucines are rarely very conserved unless
they’re involved in protein-protein interactions
such as a leucine zipper.
Summary: multiple sequence alignment (MSA)

• Many dozens of MSA programs have been introduced in recent

years. None is optimal. Each offers unique strengths &
weaknesses.

• Key methods include

• Consistency-based MSA
• Iterative-based MSA
• Structure-based MSA

• Alignment of genomic DNA presents specialized challenges &

different sets of tools.
• MSA are readily available through genome browsers such as
Ensembl, UCSC, & NCBI.

Multiple Sequence Alignment of Proteins
No ratings yet
Multiple Sequence Alignment of Proteins
6 pages
4 - BTE 401 Multiple Sequence Alignment
No ratings yet
4 - BTE 401 Multiple Sequence Alignment
37 pages
Multiple Sequence Alignment Techniques
No ratings yet
Multiple Sequence Alignment Techniques
51 pages
Multiple Sequence Alignment Overview
No ratings yet
Multiple Sequence Alignment Overview
89 pages
Clustal Omega in Sequence Alignment
No ratings yet
Clustal Omega in Sequence Alignment
29 pages
Multiple Sequence Alignment Techniques
No ratings yet
Multiple Sequence Alignment Techniques
62 pages
Understanding Multiple Sequence Alignments
No ratings yet
Understanding Multiple Sequence Alignments
22 pages
Msa Notes
No ratings yet
Msa Notes
10 pages
Building Effective Multiple-Sequence Alignments
No ratings yet
Building Effective Multiple-Sequence Alignments
30 pages
Conserved Regions in Sequence Alignment
No ratings yet
Conserved Regions in Sequence Alignment
23 pages
Sequence Alignment Methods Overview
100% (1)
Sequence Alignment Methods Overview
34 pages
Understanding Multiple Sequence Alignment
No ratings yet
Understanding Multiple Sequence Alignment
18 pages
Understanding pglo Sequence Alignment
No ratings yet
Understanding pglo Sequence Alignment
36 pages
Types and Methods of Sequence Alignment
No ratings yet
Types and Methods of Sequence Alignment
24 pages
Multiple Sequence Alignment Overview
No ratings yet
Multiple Sequence Alignment Overview
64 pages
Pairwise vs Multiple Sequence Alignment
No ratings yet
Pairwise vs Multiple Sequence Alignment
22 pages
Multiple Sequence Alignment Techniques
No ratings yet
Multiple Sequence Alignment Techniques
35 pages
MSA and Phylogenetics Lab Guide
No ratings yet
MSA and Phylogenetics Lab Guide
4 pages
Multiple Sequence Alignment Overview
No ratings yet
Multiple Sequence Alignment Overview
14 pages
Advanced Multiple Sequence Alignment
No ratings yet
Advanced Multiple Sequence Alignment
13 pages
Sequence Alignment Techniques in Bioinformatics
No ratings yet
Sequence Alignment Techniques in Bioinformatics
31 pages
MSA of P53 Protein Using Clustal Omega
No ratings yet
MSA of P53 Protein Using Clustal Omega
13 pages
Multiple Sequence Alignment Techniques
No ratings yet
Multiple Sequence Alignment Techniques
71 pages
Understanding Multiple Sequence Alignments
No ratings yet
Understanding Multiple Sequence Alignments
23 pages
Bioinformatics: Profiles & Alignments
No ratings yet
Bioinformatics: Profiles & Alignments
18 pages
L 2 Comperative Genomics
No ratings yet
L 2 Comperative Genomics
54 pages
ClustalW: Multiple Sequence Alignment
100% (1)
ClustalW: Multiple Sequence Alignment
36 pages
BI Tutorial - Mega 11
No ratings yet
BI Tutorial - Mega 11
9 pages
Multiple Sequence Alignment Overview
No ratings yet
Multiple Sequence Alignment Overview
27 pages
Overview of Multiple Sequence Alignment
No ratings yet
Overview of Multiple Sequence Alignment
78 pages
Multiple Sequence Alignment Techniques
No ratings yet
Multiple Sequence Alignment Techniques
21 pages
Chapter 6 Multiple Sequence Alignment 2022 Bioinformatics For Everyone
No ratings yet
Chapter 6 Multiple Sequence Alignment 2022 Bioinformatics For Everyone
7 pages
Multiple Sequence Alignment
No ratings yet
Multiple Sequence Alignment
15 pages
Guia de Alinhamento Múltiplo de Sequências
No ratings yet
Guia de Alinhamento Múltiplo de Sequências
24 pages
Bioinformatics UNIT II
No ratings yet
Bioinformatics UNIT II
27 pages
Multiple Sequence Alignments in Bioinformatics
No ratings yet
Multiple Sequence Alignments in Bioinformatics
9 pages
MSA Lecture Notes Overview
No ratings yet
MSA Lecture Notes Overview
11 pages
MSA Lecture Notes by Dr. Bhimsaria
No ratings yet
MSA Lecture Notes by Dr. Bhimsaria
22 pages
Biounit 3&5
No ratings yet
Biounit 3&5
6 pages
T-Coffee and DALIGN in MSA Techniques
No ratings yet
T-Coffee and DALIGN in MSA Techniques
31 pages
Introduction to Bioinformatics Concepts
No ratings yet
Introduction to Bioinformatics Concepts
55 pages
Clustal Omega: Phylogenetic Tree Guide
No ratings yet
Clustal Omega: Phylogenetic Tree Guide
48 pages
Lecture 3 - Brief-Multiple Sequence Alignment (B)
No ratings yet
Lecture 3 - Brief-Multiple Sequence Alignment (B)
29 pages
Multiple Sequence Alignment MSA
No ratings yet
Multiple Sequence Alignment MSA
8 pages
Multiple Sequence Alignment in Bioinformatics
No ratings yet
Multiple Sequence Alignment in Bioinformatics
17 pages
Sequence Bioinformatics
No ratings yet
Sequence Bioinformatics
24 pages
Multiple Sequence Alignment Lab Guide
No ratings yet
Multiple Sequence Alignment Lab Guide
14 pages
Sequence Alignment in Bioinformatics
No ratings yet
Sequence Alignment in Bioinformatics
25 pages
Understanding Multiple Sequence Alignment
No ratings yet
Understanding Multiple Sequence Alignment
17 pages
Multiple Sequence Alignment Overview
No ratings yet
Multiple Sequence Alignment Overview
21 pages
1) Full Study Note Multiple Sequence Alignment (MSA) Multiple Sequence Alignments (MSA)
No ratings yet
1) Full Study Note Multiple Sequence Alignment (MSA) Multiple Sequence Alignments (MSA)
13 pages
Lab 3 - Multiple Sequence Alignment: Bioinformatic Methods I Lab 3
No ratings yet
Lab 3 - Multiple Sequence Alignment: Bioinformatic Methods I Lab 3
14 pages
Multiple Sequence Alignment Techniques
No ratings yet
Multiple Sequence Alignment Techniques
28 pages
Clustal W: Multiple Alignment Guide
No ratings yet
Clustal W: Multiple Alignment Guide
9 pages
Phylogeny Questions from Campbell Biology
No ratings yet
Phylogeny Questions from Campbell Biology
46 pages
Evidence of Evolutionary Relationships
No ratings yet
Evidence of Evolutionary Relationships
3 pages
Ebookfinal - Com/?p 5426: Bioinformatics For DNA Sequence Analysis 1st Edition Kit J. Menlove
No ratings yet
Ebookfinal - Com/?p 5426: Bioinformatics For DNA Sequence Analysis 1st Edition Kit J. Menlove
35 pages
Human Ontogeny in Psychology Study Plan
No ratings yet
Human Ontogeny in Psychology Study Plan
2 pages
Cookie Phylogeny Lab Worksheet
No ratings yet
Cookie Phylogeny Lab Worksheet
3 pages
Arawak Language Expansion in Amazonia
No ratings yet
Arawak Language Expansion in Amazonia
6 pages
Bioinformatics Lab Manual Overview
No ratings yet
Bioinformatics Lab Manual Overview
34 pages
Phylogenetic Tree Construction Lab
No ratings yet
Phylogenetic Tree Construction Lab
7 pages
Phylogenetic Relationships of Primates
No ratings yet
Phylogenetic Relationships of Primates
56 pages
Indonesian Freshwater Eels Phylogeny Study
No ratings yet
Indonesian Freshwater Eels Phylogeny Study
17 pages
Phylogenetic Tree-Building Methods
No ratings yet
Phylogenetic Tree-Building Methods
11 pages
Biological Chemistry and Microbiology Overview
No ratings yet
Biological Chemistry and Microbiology Overview
10 pages
Dinosaurs' Low Paleolatitude Origin Explained
No ratings yet
Dinosaurs' Low Paleolatitude Origin Explained
19 pages
Scary Truths Behind Fairy Tales
No ratings yet
Scary Truths Behind Fairy Tales
6 pages
ARG CH 26 - Phylogeny and The Tree of Life
No ratings yet
ARG CH 26 - Phylogeny and The Tree of Life
10 pages
Molecular ID of Denitrifying Bacteria
No ratings yet
Molecular ID of Denitrifying Bacteria
7 pages
Organizing Life: Phylogeny and Taxonomy
No ratings yet
Organizing Life: Phylogeny and Taxonomy
17 pages
Mammalian Social Evolution Insights
No ratings yet
Mammalian Social Evolution Insights
1 page
Dunham John
No ratings yet
Dunham John
2 pages
Phylogeny of Orchid-like Bladderworts
No ratings yet
Phylogeny of Orchid-like Bladderworts
15 pages
Vetigastropod Phylogeny and Seguenzioidea
No ratings yet
Vetigastropod Phylogeny and Seguenzioidea
22 pages
Understanding Sequence Alignment Techniques
No ratings yet
Understanding Sequence Alignment Techniques
33 pages
Characterization of Sarocladium spinificis Strains
No ratings yet
Characterization of Sarocladium spinificis Strains
17 pages
New Microcharidius Species and Genitalia Evolution
No ratings yet
New Microcharidius Species and Genitalia Evolution
30 pages
Systematics of The Genus Torrubiella
No ratings yet
Systematics of The Genus Torrubiella
92 pages
Monophyly: Phylogenetic Tree
No ratings yet
Monophyly: Phylogenetic Tree
7 pages
Timetree of Fungi: Fossil Dating Insights
No ratings yet
Timetree of Fungi: Fossil Dating Insights
31 pages
Scary Truths in Fairy Tales
No ratings yet
Scary Truths in Fairy Tales
5 pages
Cladograms in Biological Classification
No ratings yet
Cladograms in Biological Classification
32 pages
Phylogeny of Aeromonas Species
No ratings yet
Phylogeny of Aeromonas Species
7 pages

Understanding Multiple Sequence Alignment

Uploaded by

Understanding Multiple Sequence Alignment

Uploaded by

Chapter 5:

Multiple Sequence Alignment

Tamiru Oljira, Ph.D.

• Explain the three main stages by which ClustalW performs multiple

• Describe several alternative programs for MSA (such as MUSCLE,

• Explain how they work & contrast them with ClustalW.

• Explain the significance of performing benchmarking studies &

• Explain the issues surrounding MSA of genomic regions

• Multiple Sequence Alignment (MSA) is generally the

• Note how the region of a conserved histidine (▼) varies

Note also the changing pattern of gaps within the boxed

• MUSCLE is claimed to achieve both better average accuracy & better

• Not necessarily one “correct” alignment of a protein family

• Protein sequences evolve.

• The corresponding 3-D structures of proteins also evolve

• May be impossible to identify amino acid residues that align

o Some aligned residues, such as cysteines that form

o There may be conserved motifs such as a

o There may be conserved secondary structure

o There may be regions with consistent patterns of

• MSA is more sensitive than pairwise alignment to detect

• A single query can be searched against a database of MSAs (e.g.

• Regulatory regions of genes may have consensus sequences

• Exact methods of MSA use dynamic programming

• The matrix is multidimensional rather than two-

• The goal is to maximize the summed alignment score of

• But they are not feasible for more than a few

1. Do a set of global pairwise

2. Create a guide tree

3. Progressively align the sequences

• Do the exercise of MSA using ClustalW by visiting

• You can save them in a Word document or text

o For 200 sequences, (199)(200) / 2 = 19,900

• Convert similarity scores to distance scores

• A tree shows the distance between objects

• Use UPGMA (defined in the phylogeny

• ClustalW provides a syntax to describe

ClustalW stage 2: best

• Multiple alignment is carried out by

• There are many possible ways to make a MSA

• Where gaps are added is a critical question

• Gaps are often added to the first two (closest) sequences

• To change the initial gap choices later on would be

• To maintain the initial gap choices is to trust

• Individual weights are assigned to sequences;

• very closely related sequences are given less weight,

• distantly related sequences are given more weight

• Scoring matrices are varied dependent on the presence

• Residue-specific gap penalties are applied

Examples of programs using iterative algorithms:

Has about 1000

[2] Improve the progressive alignment

[3] Refine the MSA

Examples of MSA programs that implement consistency-based

• Combines iterative & progressive approaches with a unique

If we align three sequences:

• If xi aligns with zk & zk aligns with yj then xi should align with yj

Protein Data Bank accession numbers

How do we know which program to use for MSA?

• Benchmarking tests on BaliBase suggest:

• ClustalW has been the most popular program.

• Pfam: Protein Family

• MSA database curation:

• Alignment of divergent sequences often reveals islands of

• Chromosomes are subject to inversions, duplications,

• There are no benchmark datasets available for genomic

o Some tools on the Net can help you make

- The most convincing evaluative grid we have for

Patterns of conserved aromatic amino acids constitute the most

These charged amino acids are often

• Many dozens of MSA programs have been introduced in recent

• Key methods include

• Alignment of genomic DNA presents specialized challenges &

You might also like