Sequence Alignment and BLOSUM Matrix Overview
Sequence Alignment and BLOSUM Matrix Overview
UNIT-III
GIVE A DETAIL ACCOUNT ON SEQUENCE ALIGNMENT BASED ON MATRICES (5 marks)
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA or protein to
identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships
between the sequences.
Aligned sequences of nucleotide or aminoacid residues are typically represented as rows within a matrix.
Gaps are inserted between the residues so that identical or similar characters are aligned in successive
columns
A sequence alignment, produced by ClustalW, of two human zinc finger proteins, identified on the left by GenBank
accession number. Sequence alignments are also used for non-biological sequences, such as those present in natural
language or in financial data.
If two sequences in an alignment share a common ancestor, mismatches can be interpreted as point
mutations and gaps as indels (that is, insertion or deletion mutations) introduced in one or both
lineages in the time since they diverged from one another.
In sequence alignments of proteins, the degree of similarity between amino acids occupying a
particular position in the sequence can be interpreted as a rough measure of how conserved a
particular region or sequence motif is among lineages.
The absence of substitutions, or the presence of only very conservative substitutions (that is, the
substitution of amino acids whose side chains have similar biochemical properties) in a particular
region of the sequence, suggest that this region has structural or functional importance.
Although DNA and RNA nucleotide bases are more similar to each other than are amino acids, the
conservation of base pairs can indicate a similar functional or structural role.
Scores within a BLOSUM are log-odds scores that measure, in an alignment, the logarithm for the
ratio of the likelihood of two amino acids appearing with a biological sense and the likelihood of the
same amino acids appearing by chance.
The matrices are based on the minimum percentage identity of the aligned protein sequence used in
calculating them.
Every possible identity or substitution is assigned a score based on its observed frequencies in the
alignment of related proteins. A positive score is given to the more likely substitutions while a
negative score is given to the less likely substitutions.
Substitution matrix
In bioinformatics and evolutionary biology, a substitution matrix describes the rate at which one
character in a sequence changes to other character states over time.
Substitution matrices are usually seen in the context of amino acid or DNA sequence alignments,
where the similarity between sequences depends on their divergence time and the substitution rates as
represented in the matrix. In the process of evolution, from one generation to the next the amino acid
sequences of an organism's proteins are gradually altered through the action of DNA mutations.
For example, the sequence
ALEIRYLRD
Could mutate into the sequence
ALEINYLRD
In one generation and possibly
AQEINYQRD
Over a longer period of evolutionary time.
Each amino acid is more or less likely to mutate into various other amino acids. For instance, a
hydrophilic residue such as arginine is more likely to be replaced another hydrophilic residue such as
glutamine, than it is to be mutated into a hydrophobic residue such as leucine.
This is primarily due to redundancy in the genetic code, which translates similar codons into similar
amino acids. Furthermore, mutating an amino acid to a residue with significantly different properties
could affect the folding and/or activity of the protein. There is therefore usually strong selective pressure
to remove such mutations quickly from a population.
If we have two amino acid sequences in front of us, we should be able to say something about how likely
they are to be derived from a common ancestor, or homologous. If we can line up the two sequences
using a sequence alignment algorithm such that the mutations required transforming a hypothetical
ancestor sequence into both of the current sequences would be evolutionarily plausible, then we'd like to
assign a high score to the comparison of the sequences.
3
To this end, we will construct a 20x20 matrix where the (i,j)th entry is equal to the probability of the ith
amino acid being transformed into the jth amino acid in a certain amount of evolutionary time.
There are many different ways to construct such a matrix, called a substitution matrix. Here are the
most commonly used ones:
Identity matrix
The simplest possible substitution matrix would be one in which each amino acid is considered maximally
similar to itself, but not able to transform into any other amino acid. This matrix would look like:
.
This identity matrix will succeed in the alignment of very similar amino acid sequences but will be
miserable at aligning two distantly related sequences
Log-odds matrices
We express the probabilities of transformation in what are called log-odds scores. The scores matrix
S is defined as
Where Mi,j is the probability that amino acid i transforms into amino acid j and pi is the frequency
of amino acid i. The base of the logarithm is not important, and you will often see the same substitution
matrix expressed in different bases.
Different levels of the BLOSUM matrix can be created by differentially weighting the degree of
similarity between sequences. For example, a BLOSUM62 matrix is calculated from protein blocks such
that if two sequences are more than 62% identical, then the contribution of these sequences is weighted to
sum to one. In this way the contributions of multiple entries of closely related sequences is reduced.
The BLOSUM62 matrix is given in Table 2. If the BLOSUM62 matrix is compared to PAM160 (it's
closest equivalent) then it is found that the BLOSUM matrix is less tolerant of substitutions to or from
hydrophilic amino acids, while more tolerant of hydrophobic changes and of cysteine and tryptophan
mismatches.
4
PAM
One of the first amino acid substitution matrices, the PAM (Point Accepted Mutation) matrix was
developed by Margaret Dayhoff in the 1970s. This matrix is calculated by observing the differences in
closely related proteins.
The PAM1 matrix estimates what rate of substitution would be expected if 1% of the amino acids
had changed. The PAM1 matrix is used as the basis for calculating other matrices by assuming that repeated
mutations would follow the same pattern as those in the PAM1 matrix, and multiple substitutions can occur
at the same site.
Using this logic, Dayhoff derived matrices as high as PAM250. Usually the PAM 30 and the PAM70
are used. A matrix for divergent sequences can be calculated from a matrix for closely related sequences by
taking the second matrix to a power. For instance, we can roughly approximate the WIKI2 matrix from the
WIKI1 matrix by saying where W1 is WIKI1 and W2 is WIKI2. This is how the PAM250
matrix is calculated.
numerator), and the other that the change occurred because of random sequence variation of non
biological significance (the denominator).
ALIGNMENT ALGORITHMS
To find out a best optimal alignment for a pair of sequences, a scoring system or matrices is
required.
If both sequences are of same length the only possibility is global alignment but when you
have sequences which differ in length then alignment becomes complicated as gaps has to be
inserted for good alignment and gaps can be inserted in any position, they we go for local
alignments between sequences
NEEDLEMAN-WUNSCH GLOBAL ALIGNMENT ALGORITHM (10 MARKS)
To obtain the optimal global alignment between two sequences, allowing gaps, that is all of the x has
to be aligned with all of y.
The dynamic programming algorithms for solving this problem are known as Needleman-Wunsch
Global Alignment Algorithm.
In a gapped alignment, a nucleic acid in x is matched either by an nucleic acid in y or by a gap and
vice-versa e.g. the sequences GAATTC AND GATTA has to be aligned according to the following
scoring rules starting
score=0,
match=2,
mismatch= -1,
gap= -2
We construct a matrix F (i, j) indexed by I and j, one index for each sequences, where the value of
F(i,j) is the score of the best alignment between x1…j of x, and y 1…j of y.
We begin the matrix by initializing F (0, 0) = 0, then we proceed to fill the matrix from top left to
bottom right.
Fill the path-graph according to scoring rules.
G A A T T C
0 -2 -2 -2 -2 -2 -2
G -2 2 -1 -1 -1 -1 -1
A -2 -1 2 2 -1 -1 -1
T -2 -1 -1 -1 2 2 -1
T -2 -1 -1 -1 2 2 -1
A -2 -1 2 2 -1 -1 -1
2. Make a path to ach cell that maximizes the score for that cell.
There are three possible ways to obtain best score for F (i, j) of an alignment.
If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j).
To do this, start from the upper left hand corner and fill each cell according to the following rule:
Where, F (i, j) = the entry in the ith row and jth column of the path-graph.
s (xi, yi) = the score for the residues being aligned
d= gap penalty
F (i-1, j-1) + s (xi, yi) represent the diagonal move on the path graph, in which two residues x i aligned
with yi
F (i, j-1) + d represents a horizontal move on the path graph, in which a residue in the first
sequence (xi) is aligned with a gap in the second sequence
6
F (i-1,j) + d represents a vertical move on the path graph, in which a residue in the second sequence
(yi) is aligned with a gap in the first sequence
The best score of (i, j) will be largest of these three options.
This equation is repeatedly applied to fill the matrix of F (i, j) values, calculating value in bottom
right hand corner of each square of four cells from any one of the three values (left, above left, or
above) as in the figure.
G A A T T C
0 -2 -4 -6 -8 -10 -12
G -2 2 0 -2 -4 -6 -8
A -4 0 4 2 0 -2 -4
T -6 -2 2 -1 4 2 0
T -8 -4 0 -1 5 6 4
A -10 -6 -2 2 3 4 5
As we fill the F (i, j) values, a pointer is kept in each cell back form which F (i, j) was derived.
We also have to deal with some special boundary conditions to complete this algorithm.
Along the top row where j=0, the values F(i-1,j-1) and F(i,j-1) are not defined so the values F(i, 0)
represents alignments of a prefix of x to all gaps in y, hence we can define F(i, 0)=id similarly for the left
first column F(0,j)= jd.
3. The value in the final cell of the matrix is by definition the best score for an alignment of x 1…n and y
1…m which is the score of best global alignment of x to y. the path that leads from the first cell to the last cell
,
that gives this score, corresponds to the maximum-scoring alignment.
To discover this path, the trace back, a path from the last cell to the first cell, using only transitions that
maximize the score.
G A A T T C
0 -2 -4 -6 -8 -10 -12
G -2 2 0 -2 -4 -6 -8
A -4 0 4 2 0 -2 -4
T -6 -2 2 -1 4 2 0
T -8 -4 0 -1 5 6 4
A -10 -6 -2 2 3 4 5
Note: there can be more than one path though the path graph, corresponding to more than one optimal alignment.
4. Trace the path or paths forward to give the optimal alignment. Trace back of the path of choices is done that
has led to the final value.
This is done, by building the alignment in the reverse direction that is starting from the final cell and
following the pointers path.
7
In each trace back step, from the current cell (i, j) have to move back to one of the cells (i-1, j-1), (i, j-1)
and (i-1, j) from which the F (i, j) was derived.
Simultaneously we add a pair of symbols on to the current alignment during this trace back process.
xi and yj if step was traced back to (i-1, j-1), xi and gap character ‘-‘if step was to (i-1, j) or ‘-‘ and y j if
the step was (i, j-1) .
At the end we have reached at the start of matrix, i=j=0.
This trace back procedure finds out just one alignment with optimal score.
The main reason that this algorithm works is that the score is made of sum of scores of best aligned
independent pieces.
G A A T T C
0 -2 -4 -6 -8 -10 -12
G -2 2 0 -2 -4 -6 -8
A -4 0 4 2 0 -2 -4
T -6 -2 2 -1 4 2 0
T -8 -4 0 -1 5 6 4
A -10 -6 -2 2 3 4 5
G A A T T C
G A A T T C
G A A T T C
G A - T T C
1. The first step of the Smith-Waterman algorithm is to generate a path graph identical to the Needleman-
Wunsch algorithm
2. In the Smith-Waterman algorithm the minimum allowed final value of a cell is zero-no negative values are
allowed. This rule is expressed formally by including 0 as an alternative in the expression for F (i, j):
T 0 0 2 3 4 2 0
T 0 0 0 1 5 6 4
A 0 0 0 0 3 4 5
8
An uninformative alignment - - - - - - - g c t g a a c g
ctataatc-------
Sequences that are quite similar and approximately the same length are suitable candidates for
global alignments. In local alignments, stretches of sequences with highest density of matches are aligned,
thus generating one or more islands of matches or sub alignments in the aligned sequences.
Local alignments are more suitable for aligning sequences that are similar along some of their
lengths but dissimilar in others, sequences that differ in length or sequences that share a conserved region or
domain.
Global alignment:
For the two hypothetical protein sequence fragments, the global alignment is stretched over the entire
sequence length to include as many matching amino acids as possible up to and including the sequence ends.
Vertical bars between the sequences indicate the presence of identical amino acids.
Although there is an obvious region of identity in this example (the sequence GKG preceded by a
commonly observed substitution of T for A), a global alignment may not align such regions so that more
amino acids along the entire sequence lengths can be matched.
Local alignment:
In a local alignment, the alignment stops at the ends of regions of identity or strong similarity, and a
much higher priority is given to finding these local regions, than to extending the alignment to include more
neighboring amino acid pairs. Dashes indicate sequence not included in the alignment. This type of
alignment favors finding conserved nucleotide patterns, DNA sequences, or amino acid patterns in protein
sequences
Paralogs are homologous sequences that arose by a mechanism such as gene duplication.
OVERVIEW:
When the user initiates a new job from a BLAST form, BLAST immediately presents the Job
Running page, which reports the status of a running job and an estimate of how long it will take to complete.
The formatting parameters for a BLAST job may be changed on the Format Control page as the job runs,
since formatting only occurs after search and alignment.
When the job completes, BLAST presents the BLAST Report. From the Report, the user may now
re-format the current job, run another BLAST job using the same parameters as a starting point, or navigate
to one of the other application pages.
The Recent Results page shows the status and some of the parameters of the user’s unexpired
BLAST jobs, and links directly to the BLAST Report for each job.
BLAST screen flow map. Each box represents a different page in the BLAST web application. A
user will normally enter through the ‘Home’ page and from there select a ‘BLAST form’ to submit a search.
After the search is submitted the ‘Job running’ page is shown until the search is done, after which the
‘Report’ page is shown. From the ‘Report’ page the user may reformat, modify the current search and
resubmit, or save the search strategy in My NCBI.
APPLICATION PAGES:
BLAST form:
The Enter Query Sequence section at the top of the form provides a place to enter one or more query
sequences, either by accession or gi number, or as IUPAC sequence in FASTA format.
11
The optional Query Sub range boxes limit the search to a subrange of the query sequence. As an
alternative to cut/pasting sequence into a text box, you may also upload the query sequence(s) from a local
disk file.
The new Job Title is the job name that appears in Saved Strategies and Recent Results, as well as at
the top of every BLAST report.
When the input sequence is an accession or gi number, the BLAST web interface automatically looks
up the definition line in GenBank without reloading the page.
If multiple sequences are present, an appropriate descriptive title is generated (e.g. ‘5 nucleotide
sequences’).
The Choose Search Set section of the BLAST form selects the BLAST database to be searched and
applies limiting criteria, such as organism or Entrez query.
Searches may be limited to a specific organism (species or taxonomic group) by typing the scientific
name, common name or taxid (the integer id for the taxon in the NCBI Taxonomy database).
The Program Selection section of the BLAST form selects the algorithm used for search and alignment.
For nucleotide searches, the choices are megablast (default), discontiguous megablast and blastn. For protein
searches, the options are blastp (default), PSI-BLAST and PHIBLAST.
Job running
The user submits a new BLAST job by pressing the BLAST form button. BLAST immediately presents the
12
Job Running page, which reports some statistics about the job, and provides an estimate of completion time.
The Job Running view periodically refreshes itself, effectively polling the server while the job runs. BLAST
automatically displays the BLAST report when the job completes.
Format control:
Limit controls (i.e. the Descriptions, Graphical Overview and Alignments counts; the Organism and
Entrez limits; and the expect value range) limit the items shown on the report for a completed job, rather
than limiting the search set, as they do on the BLAST form.
The Format Control form has a text input for the Request ID (RID), allowing the user to format the
current job, or any other known RID.
Report page
The current BLAST report pages are basically the same as the previous design, with a reformatted
header and some new features. To the right of the breadcrumbs are three links:
(1) Reformat these results leads to the Format Control page,
(2) Edit and Resubmit leads to the original BLAST form, with the current parameters selected
and
(3) Save Search Strategy saves the search parameters for the job so the user can run the same
job again later with identical parameters. This option is available only if the user is signed in
to My NCBI, since saved strategies are user-specific.
USES OF BLAST:
Genomic BLAST pages
o Human
o Mouse
o Flies
o Nematodes
o Rat
o Fugu rubripes
o Zebrafish
o Plants
o Yeasts
o Malaria
o Other eukaryotes
o Microbial
Publicity available tools:
Sequence similarity tools
BLAST
TBLASTN
BLASTX
13
TBLASTX
GAP PENALTY: GAP open penalty for the first residue in the gap (-12 by default for proteins, -16 for
DNA), Gap Extension penalty for additional residues in a gap (-2 by default for proteins, -4 for DNA).
Therefore it is given high value which has to be deducted from the similarity score. This is called as gap
penalty.
The common formulation for gap penalty is to gibe a certain high value for introduction of gap and
additional value for extension of gaps. There are two parameters like the gap opening penalty G and gap
extending penalty L.
14
The total gap penalty to be deducted from the alignment score is as follows: G + Ln. G are the gap opening
penalty and Ln is the length of the gap that follows. Since gap opening in the sequence is considered to be a
rare event, the penalty for G is high (usually 10-15 in the context of BLOSUM 62).
Additional gap formation is considered to be not so very rare, if already a gap existed. The L value is around
1 or 2.
HISTOGRAM:
This displays the search histogram of the expected frequency of chance occurrence of the database matches
found.
E-VALUE
E (Expectation) value is used for the evaluation of statistical significance. The E-value for a given
alignment depends upon its score as well as the lengths of both the query sequence and the database
searched sequence. Smaller E-value indicates more statistical significance of the match.
The upper E-value limit for score and alignment display by defaults are 10.0 for FASTA with protein
searches, 5.0 for translated DNA/PROTEN comparisons, and 2.0 for DNA/DNA searches.
In the lower E-value limit a value of 1e-6 prevents library sequences with E 0 value lower than 1e-6 from
being displayed. This allows the use to focus on more distant relationships.