0% found this document useful (0 votes)

14 views14 pages

Sequence Alignment and BLOSUM Matrix Overview

fdg

Uploaded by

nivedharshinitamil55

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views14 pages

Sequence Alignment and BLOSUM Matrix Overview

fdg

Uploaded by

nivedharshinitamil55

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

1

UNIT-III
GIVE A DETAIL ACCOUNT ON SEQUENCE ALIGNMENT BASED ON MATRICES (5 marks)
 In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA or protein to
identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships
between the sequences.
 Aligned sequences of nucleotide or aminoacid residues are typically represented as rows within a matrix.
 Gaps are inserted between the residues so that identical or similar characters are aligned in successive
columns

A sequence alignment, produced by ClustalW, of two human zinc finger proteins, identified on the left by GenBank
accession number. Sequence alignments are also used for non-biological sequences, such as those present in natural
language or in financial data.
 If two sequences in an alignment share a common ancestor, mismatches can be interpreted as point
mutations and gaps as indels (that is, insertion or deletion mutations) introduced in one or both
lineages in the time since they diverged from one another.
 In sequence alignments of proteins, the degree of similarity between amino acids occupying a
particular position in the sequence can be interpreted as a rough measure of how conserved a
particular region or sequence motif is among lineages.
 The absence of substitutions, or the presence of only very conservative substitutions (that is, the
substitution of amino acids whose side chains have similar biochemical properties) in a particular
region of the sequence, suggest that this region has structural or functional importance.
 Although DNA and RNA nucleotide bases are more similar to each other than are amino acids, the
conservation of base pairs can indicate a similar functional or structural role.

BLOSUM MARIX (BLOck SUbstitution Matrix)

 BLOSUM (BLOcks of Amino Acid SUbstitution Matrix) is a substitution matrix used for sequence
alignment of proteins.
 BLOSUM are used to score alignments between evolutionarily divergent protein sequences.
BLOSUM is based on local alignments. BLOSUM was first introduced in a paper by Henikoff and
Henikoff.
 They scanned the BLOCKS database for much conserved regions of protein families (that do not
have gaps in the sequence alignment) and then counted the relative frequencies of amino acids and
their substitution probabilities.
 Then, they calculated a log-odds score for each of the 210 possible substitutions of the 20 standard
amino acids. All BLOSUM are based on observed alignments; they are not extrapolated from
comparisons of closely related proteins like the PAM Matrices.
 Several sets of BLOSUM exists using different alignment databases, named with numbers.
BLOSUM with high numbers are designed for comparing closely related sequences, while
BLOSUM with low numbers are designed for comparing distant related sequences.
 For example, BLOSUM80 is used for less divergent alignments, and BLOSUM45 is used for more
divergent alignments.
 The matrices were created by merging (clustering) all sequences that were more similar than a given
percentage into one single sequence and then comparing those sequences (that were all more
divergent than the given percentage value) only; thus reducing the contribution of closely related
sequences.
 The percentage used was appended to the name, giving BLOSUM80 for example where sequences
that were more than 80% identical were clustered.
2

 Scores within a BLOSUM are log-odds scores that measure, in an alignment, the logarithm for the
ratio of the likelihood of two amino acids appearing with a biological sense and the likelihood of the
same amino acids appearing by chance.
 The matrices are based on the minimum percentage identity of the aligned protein sequence used in
calculating them.
 Every possible identity or substitution is assigned a score based on its observed frequencies in the
alignment of related proteins. A positive score is given to the more likely substitutions while a
negative score is given to the less likely substitutions.

Substitution matrix
In bioinformatics and evolutionary biology, a substitution matrix describes the rate at which one
character in a sequence changes to other character states over time.
Substitution matrices are usually seen in the context of amino acid or DNA sequence alignments,
where the similarity between sequences depends on their divergence time and the substitution rates as
represented in the matrix. In the process of evolution, from one generation to the next the amino acid
sequences of an organism's proteins are gradually altered through the action of DNA mutations.
For example, the sequence
ALEIRYLRD
Could mutate into the sequence
ALEINYLRD
In one generation and possibly
AQEINYQRD
Over a longer period of evolutionary time.
 Each amino acid is more or less likely to mutate into various other amino acids. For instance, a
hydrophilic residue such as arginine is more likely to be replaced another hydrophilic residue such as
glutamine, than it is to be mutated into a hydrophobic residue such as leucine.
 This is primarily due to redundancy in the genetic code, which translates similar codons into similar
amino acids. Furthermore, mutating an amino acid to a residue with significantly different properties
could affect the folding and/or activity of the protein. There is therefore usually strong selective pressure
to remove such mutations quickly from a population.
 If we have two amino acid sequences in front of us, we should be able to say something about how likely
they are to be derived from a common ancestor, or homologous. If we can line up the two sequences
using a sequence alignment algorithm such that the mutations required transforming a hypothetical
ancestor sequence into both of the current sequences would be evolutionarily plausible, then we'd like to
assign a high score to the comparison of the sequences.
3

 To this end, we will construct a 20x20 matrix where the (i,j)th entry is equal to the probability of the ith
amino acid being transformed into the jth amino acid in a certain amount of evolutionary time.
 There are many different ways to construct such a matrix, called a substitution matrix. Here are the
most commonly used ones:
Identity matrix
The simplest possible substitution matrix would be one in which each amino acid is considered maximally
similar to itself, but not able to transform into any other amino acid. This matrix would look like:

.
This identity matrix will succeed in the alignment of very similar amino acid sequences but will be
miserable at aligning two distantly related sequences
Log-odds matrices
We express the probabilities of transformation in what are called log-odds scores. The scores matrix
S is defined as

Where Mi,j is the probability that amino acid i transforms into amino acid j and pi is the frequency
of amino acid i. The base of the logarithm is not important, and you will often see the same substitution
matrix expressed in different bases.
Different levels of the BLOSUM matrix can be created by differentially weighting the degree of
similarity between sequences. For example, a BLOSUM62 matrix is calculated from protein blocks such
that if two sequences are more than 62% identical, then the contribution of these sequences is weighted to
sum to one. In this way the contributions of multiple entries of closely related sequences is reduced.
The BLOSUM62 matrix is given in Table 2. If the BLOSUM62 matrix is compared to PAM160 (it's
closest equivalent) then it is found that the BLOSUM matrix is less tolerant of substitutions to or from
hydrophilic amino acids, while more tolerant of hydrophobic changes and of cysteine and tryptophan
mismatches.
4

PAM
One of the first amino acid substitution matrices, the PAM (Point Accepted Mutation) matrix was
developed by Margaret Dayhoff in the 1970s. This matrix is calculated by observing the differences in
closely related proteins.
The PAM1 matrix estimates what rate of substitution would be expected if 1% of the amino acids
had changed. The PAM1 matrix is used as the basis for calculating other matrices by assuming that repeated
mutations would follow the same pattern as those in the PAM1 matrix, and multiple substitutions can occur
at the same site.
Using this logic, Dayhoff derived matrices as high as PAM250. Usually the PAM 30 and the PAM70
are used. A matrix for divergent sequences can be calculated from a matrix for closely related sequences by
taking the second matrix to a power. For instance, we can roughly approximate the WIKI2 matrix from the
WIKI1 matrix by saying where W1 is WIKI1 and W2 is WIKI2. This is how the PAM250
matrix is calculated.

Differences between PAM and BLOSUM

 PAM matrices are based on an explicit evolutionary model (i.e. replacements are counted on the
branches of a phylogenetic tree), whereas the BLOSUM matrices are based on an implicit model of
evolution.
 The PAM matrices are based on mutations observed throughout a global alignment, this includes
both highly conserved and highly mutable regions. The BLOSUM matrices are based only on highly
conserved regions in series of alignments forbidden to contain gaps. The method used to count the
replacements is different: unlike the PAM matrix, the BLOSUM procedure uses groups of sequences
within which not all mutations are counted the same.
 Higher numbers in the PAM matrix naming scheme denote larger evolutionary distance, while larger
numbers in the BLOSUM matrix naming scheme denote higher sequence similarity and therefore
smaller evolutionary distance. Example: PAM150 is used for more distant sequences than PAM100;
BLOSUM62 is used for closer sequences than Blosum50.

DAYHOFF AMINO ACD SUBSTITUTIN MATRICES (PAM MUTATIONS) (10 marks)

 This family of matrices lists the change from one amino acid to another in homologous protein
sequences during evolution.
 There is no other type of scoring matrix based on evolutionary principles, thus they were originally
based on a relatively small data set, and the PAM matrices remain a useful tool for sequence
alignment.
 Each matrix gives the changes expected for a given period of evolutionary time, evidenced by
decreased sequence similarity as genes encoding the same protein diverge with increased
evolutionary time.
 Thus one matrix gives the changes expected in homologous proteins that have diverged only a small
amount from each other in a relatively short period of time so that they are still 50% or more similar.
 Another gives the changes expected of proteins that have diverged over a much longer period,
leaving only 20% similarity.
 These predicted changes are used to produce optimal alignments between two protein sequences and
to score the alignment.
 PAM matrices are usually converted into another form called log odds matrices.
 The log odds score represents the ratio of the chance of amino acid substitution by two different
hypotheses , one the change actually represents an authentic evolutionary variation at that site (the
5

numerator), and the other that the change occurred because of random sequence variation of non
biological significance (the denominator).
ALIGNMENT ALGORITHMS
 To find out a best optimal alignment for a pair of sequences, a scoring system or matrices is
required.
 If both sequences are of same length the only possibility is global alignment but when you
have sequences which differ in length then alignment becomes complicated as gaps has to be
inserted for good alignment and gaps can be inserted in any position, they we go for local
alignments between sequences
NEEDLEMAN-WUNSCH GLOBAL ALIGNMENT ALGORITHM (10 MARKS)
 To obtain the optimal global alignment between two sequences, allowing gaps, that is all of the x has
to be aligned with all of y.
 The dynamic programming algorithms for solving this problem are known as Needleman-Wunsch
Global Alignment Algorithm.
 In a gapped alignment, a nucleic acid in x is matched either by an nucleic acid in y or by a gap and
vice-versa e.g. the sequences GAATTC AND GATTA has to be aligned according to the following
scoring rules starting
 score=0,
 match=2,
 mismatch= -1,
 gap= -2
 We construct a matrix F (i, j) indexed by I and j, one index for each sequences, where the value of
F(i,j) is the score of the best alignment between x1…j of x, and y 1…j of y.
 We begin the matrix by initializing F (0, 0) = 0, then we proceed to fill the matrix from top left to
bottom right.
 Fill the path-graph according to scoring rules.

G A A T T C

0 -2 -2 -2 -2 -2 -2

G -2 2 -1 -1 -1 -1 -1

A -2 -1 2 2 -1 -1 -1

T -2 -1 -1 -1 2 2 -1

A -2 -1 2 2 -1 -1 -1

2. Make a path to ach cell that maximizes the score for that cell.
 There are three possible ways to obtain best score for F (i, j) of an alignment.
 If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j).
 To do this, start from the upper left hand corner and fill each cell according to the following rule:

F(i,j)=max[F(i-1,j-1) + s(xi , yi), F((i,j-1) + d), F((i-1,j) + d)]

Where, F (i, j) = the entry in the ith row and jth column of the path-graph.
s (xi, yi) = the score for the residues being aligned
d= gap penalty
F (i-1, j-1) + s (xi, yi) represent the diagonal move on the path graph, in which two residues x i aligned
with yi
F (i, j-1) + d represents a horizontal move on the path graph, in which a residue in the first
sequence (xi) is aligned with a gap in the second sequence
6

F (i-1,j) + d represents a vertical move on the path graph, in which a residue in the second sequence
(yi) is aligned with a gap in the first sequence
 The best score of (i, j) will be largest of these three options.
 This equation is repeatedly applied to fill the matrix of F (i, j) values, calculating value in bottom
right hand corner of each square of four cells from any one of the three values (left, above left, or
above) as in the figure.

G A A T T C

0 -2 -4 -6 -8 -10 -12

G -2 2 0 -2 -4 -6 -8

A -4 0 4 2 0 -2 -4

T -6 -2 2 -1 4 2 0

T -8 -4 0 -1 5 6 4

A -10 -6 -2 2 3 4 5

 As we fill the F (i, j) values, a pointer is kept in each cell back form which F (i, j) was derived.
 We also have to deal with some special boundary conditions to complete this algorithm.
 Along the top row where j=0, the values F(i-1,j-1) and F(i,j-1) are not defined so the values F(i, 0)
represents alignments of a prefix of x to all gaps in y, hence we can define F(i, 0)=id similarly for the left
first column F(0,j)= jd.

3. The value in the final cell of the matrix is by definition the best score for an alignment of x 1…n and y
1…m which is the score of best global alignment of x to y. the path that leads from the first cell to the last cell
,
that gives this score, corresponds to the maximum-scoring alignment.
To discover this path, the trace back, a path from the last cell to the first cell, using only transitions that
maximize the score.

G A A T T C

0 -2 -4 -6 -8 -10 -12

G -2 2 0 -2 -4 -6 -8

A -4 0 4 2 0 -2 -4

T -6 -2 2 -1 4 2 0

T -8 -4 0 -1 5 6 4

A -10 -6 -2 2 3 4 5

Note: there can be more than one path though the path graph, corresponding to more than one optimal alignment.
4. Trace the path or paths forward to give the optimal alignment. Trace back of the path of choices is done that
has led to the final value.
This is done, by building the alignment in the reverse direction that is starting from the final cell and
following the pointers path.
7

 In each trace back step, from the current cell (i, j) have to move back to one of the cells (i-1, j-1), (i, j-1)
and (i-1, j) from which the F (i, j) was derived.
 Simultaneously we add a pair of symbols on to the current alignment during this trace back process.
xi and yj if step was traced back to (i-1, j-1), xi and gap character ‘-‘if step was to (i-1, j) or ‘-‘ and y j if
the step was (i, j-1) .
 At the end we have reached at the start of matrix, i=j=0.
 This trace back procedure finds out just one alignment with optimal score.
 The main reason that this algorithm works is that the score is made of sum of scores of best aligned
independent pieces.

G A A T T C

0 -2 -4 -6 -8 -10 -12

G -2 2 0 -2 -4 -6 -8

A -4 0 4 2 0 -2 -4

T -6 -2 2 -1 4 2 0

T -8 -4 0 -1 5 6 4

A -10 -6 -2 2 3 4 5

G A A T T C

G A A T T C
G A A T T C

G A - T T C

SMITH-WATERMAN LOCAL ALIGNMENT ALGORITHM (10 marks)

A subsequence of x must be aligned with a subsequence of y. this arises when it is suspected that two
sequences may share common sub-regions (domains in proteins).
This is most sensitive method to detect similarity between two very highly diverged sequences, even though
they have shared evolutionary origin.
In such cases small conserved portions with rest of the sequence could have accumulated due to mutation that
is no longer alignable.
The highest scoring alignment of sub sequences of x and y is called the best local alignment. Local alignments
are more meaningful than global alignments because they include patterns that are conserved in sequences.

1. The first step of the Smith-Waterman algorithm is to generate a path graph identical to the Needleman-
Wunsch algorithm
2. In the Smith-Waterman algorithm the minimum allowed final value of a cell is zero-no negative values are
allowed. This rule is expressed formally by including 0 as an alternative in the expression for F (i, j):

F(i,j)=max[F(i-1,j-1) + s(xi , yi), F((i,j-1) + d),

F((i-1,j) + d), 0] G A A T T C

This rule is 0 0 0 0 0 0 0 equivalent at setting all

negative cell entries to zero in the Needleman-
Wunsch path graph; this G 0 2 0 0 0 0 0 gives the terminating
effect at any alignment
A 0 0 4 2 0 0 0

T 0 0 2 3 4 2 0

T 0 0 0 1 5 6 4

A 0 0 0 0 3 4 5
8

Position (i,j) may be reached,

From nowhere, with score 0, because we can always start a new local alignment.
From (i-1,j-1) with a match, adding score xi ,yj to the score
From (i-1, j) with a gap in y, subtracting d from the score or
From (i,j-1) with a gap in x, subtracting d from the score

TOOLS FOR SEQUENCE ALIGNMENT

Sequence alignment is a common tool in bioinformatics and comparative genomics. It is generally
assumed that multiple sequence alignment yields better results than pair wise sequence alignment, but this
assumption has rarely been tested, and never with the control provided by simulation analysis.

Definition of sequence alignment: (2 marks)

Sequence alignment is the identification of residue-residue correspondences. It is the basic tool of
bioinformatics. Any assignment of correspondences that preserves the order of the residues within the
sequences is an alignment
Sequence alignment is the procedure of comparing two (pair-wise alignment) or more (multiple
sequence alignment) sequences by searching for a series of individual characters or character patterns that
are in the same order in the sequences.
Two sequences are aligned in two rows. Identical or similar characters are placed in the same
column, and nonidentical characters can either be placed in the same column as a mismatch or opposite a
gap in the other sequence.
In an optimal alignment, nonidentical characters and gaps are placed to bring as many identical or
similar characters as possible into vertical register. Sequences that can be readily aligned in this manner are
said to be similar.
We must define criteria so that an algorithm can choose the best alignment. For the sequences
gctgaacg and ctataatc:

An uninformative alignment - - - - - - - g c t g a a c g
ctataatc-------

An alignment without gaps g c t g a a c g

ctataatc

An alignment with gaps g c t g a - a - - c g

--ct-ataatc

EXPLAIN IN DETAIL ABOUT THE TYPES OF SEQUENCE ALIGNMENT (5 marks)

There are two types of sequence alignment,
-Local and
-Global alignment
In global alignment, an attempt is make to align the entire sequence, using as many characters as
possible, up to both ends of each sequence.
9

Sequences that are quite similar and approximately the same length are suitable candidates for
global alignments. In local alignments, stretches of sequences with highest density of matches are aligned,
thus generating one or more islands of matches or sub alignments in the aligned sequences.
Local alignments are more suitable for aligning sequences that are similar along some of their
lengths but dissimilar in others, sequences that differ in length or sequences that share a conserved region or
domain.

Global alignment:
For the two hypothetical protein sequence fragments, the global alignment is stretched over the entire
sequence length to include as many matching amino acids as possible up to and including the sequence ends.
Vertical bars between the sequences indicate the presence of identical amino acids.
Although there is an obvious region of identity in this example (the sequence GKG preceded by a
commonly observed substitution of T for A), a global alignment may not align such regions so that more
amino acids along the entire sequence lengths can be matched.
Local alignment:
In a local alignment, the alignment stops at the ends of regions of identity or strong similarity, and a
much higher priority is given to finding these local regions, than to extending the alignment to include more
neighboring amino acid pairs. Dashes indicate sequence not included in the alignment. This type of
alignment favors finding conserved nucleotide patterns, DNA sequences, or amino acid patterns in protein
sequences

Definitions: Homology. Similarity, Identity:

Homologous sequences: derived from a common ancestor. Homology is an inference sequences
are homologous or not identity and similarity are quantities that describe the relatedness of sequences.
Proteins that are homologous may be orthologous or paralogous.
Orthologs are homologous sequences in different species that arose from a. common ancestral
gene during speciacion.
10

Paralogs are homologous sequences that arose by a mechanism such as gene duplication.

What is algorithm and program? (2 MARKS)

An algorithm is a procedure that is structured in a computer program For example, there are many
algorithms used for pair wise alignment. A computer program is a set of instructions that uses an algorithm
(or multiple algorithms) to solve a task. For example, the BLAST program uses a set of algorithms to per-
perform sequence alignments.

BLAST :( BASIC LOCAL ALIGNMENT SEARCH TOOL) (10 MARKS)

INTRODUCTION:
Basic Local Alignment Search Tool (BLAST) is a sequence similarity search program that can be
used via a web interfaces or as a stand-alone tool to compare a user’s query to a database of sequences.
Several variants of BLAST compare all combinations of nucleotide or protein queries with nucleotide or
protein databases.
BLAST is a heuristic that finds short matches between two sequences and attempts to start
alignments from these ‘hot spots’. BLAST provides statistical information about an alignment; this is the
‘expect’ value or false-positive rate

OVERVIEW:
When the user initiates a new job from a BLAST form, BLAST immediately presents the Job
Running page, which reports the status of a running job and an estimate of how long it will take to complete.
The formatting parameters for a BLAST job may be changed on the Format Control page as the job runs,
since formatting only occurs after search and alignment.
When the job completes, BLAST presents the BLAST Report. From the Report, the user may now
re-format the current job, run another BLAST job using the same parameters as a starting point, or navigate
to one of the other application pages.
The Recent Results page shows the status and some of the parameters of the user’s unexpired
BLAST jobs, and links directly to the BLAST Report for each job.

BLAST screen flow map. Each box represents a different page in the BLAST web application. A
user will normally enter through the ‘Home’ page and from there select a ‘BLAST form’ to submit a search.
After the search is submitted the ‘Job running’ page is shown until the search is done, after which the
‘Report’ page is shown. From the ‘Report’ page the user may reformat, modify the current search and
resubmit, or save the search strategy in My NCBI.

APPLICATION PAGES:
BLAST form:
The Enter Query Sequence section at the top of the form provides a place to enter one or more query
sequences, either by accession or gi number, or as IUPAC sequence in FASTA format.
11

The optional Query Sub range boxes limit the search to a subrange of the query sequence. As an
alternative to cut/pasting sequence into a text box, you may also upload the query sequence(s) from a local
disk file.
The new Job Title is the job name that appears in Saved Strategies and Recent Results, as well as at
the top of every BLAST report.
When the input sequence is an accession or gi number, the BLAST web interface automatically looks
up the definition line in GenBank without reloading the page.
If multiple sequences are present, an appropriate descriptive title is generated (e.g. ‘5 nucleotide
sequences’).

The Choose Search Set section of the BLAST form selects the BLAST database to be searched and
applies limiting criteria, such as organism or Entrez query.
Searches may be limited to a specific organism (species or taxonomic group) by typing the scientific
name, common name or taxid (the integer id for the taxon in the NCBI Taxonomy database).

The Program Selection section of the BLAST form selects the algorithm used for search and alignment.
For nucleotide searches, the choices are megablast (default), discontiguous megablast and blastn. For protein
searches, the options are blastp (default), PSI-BLAST and PHIBLAST.

Job running
The user submits a new BLAST job by pressing the BLAST form button. BLAST immediately presents the
12

Job Running page, which reports some statistics about the job, and provides an estimate of completion time.
The Job Running view periodically refreshes itself, effectively polling the server while the job runs. BLAST
automatically displays the BLAST report when the job completes.

Format control:
Limit controls (i.e. the Descriptions, Graphical Overview and Alignments counts; the Organism and
Entrez limits; and the expect value range) limit the items shown on the report for a completed job, rather
than limiting the search set, as they do on the BLAST form.
The Format Control form has a text input for the Request ID (RID), allowing the user to format the
current job, or any other known RID.
Report page
The current BLAST report pages are basically the same as the previous design, with a reformatted
header and some new features. To the right of the breadcrumbs are three links:
(1) Reformat these results leads to the Format Control page,
(2) Edit and Resubmit leads to the original BLAST form, with the current parameters selected
and
(3) Save Search Strategy saves the search parameters for the job so the user can run the same
job again later with identical parameters. This option is available only if the user is signed in
to My NCBI, since saved strategies are user-specific.

The Report Page is divided into four sections:

(1) The Summary section provides links to alternate report formats: the taxonomy report (hits
clustered taxonomically). The link to the MapViewer’s ‘Genome View’ (hits shown on a genomic sequence
map), and a new tree view (hits clustered by similarity).
(2) The Graphical Overview section presents a graphic of the regions of the result set that aligned to
the query(called ‘high-scoring pairs’, or HSPs), plotted against the query sequence. The graphic is
unchanged from the previous design.
(3) The Descriptions section is a table of the sequences that matched the query, sorted by increasing
expect value. When the ‘Advanced view’ box is checked on the Format Control form, the Descriptions table
can be resorted by clicking the header columns and more of each result sequence definition line is visible.
(4) The Alignments section presents the alignments of the HSPs, either as a series of pairwise
alignments (default), or as a single block of all HSPs anchored to the query. Web log analysis has shown
that the links from subject sequences to other databases, particularly to Gene, are underutilized, so now each
alignment contains an informative link to Gene, where such a link exists.

USES OF BLAST:
Genomic BLAST pages
o Human
o Mouse
o Flies
o Nematodes
o Rat
o Fugu rubripes
o Zebrafish
o Plants
o Yeasts
o Malaria
o Other eukaryotes
o Microbial
Publicity available tools:
Sequence similarity tools
BLAST
TBLASTN
BLASTX
13

TBLASTX

SEARCHING FOR SIMILARITY (5 marks)

 Similarity searching methods involve matching of the query sequence to the sequences deposited in
the database.
 A similarity score is calculated by measuring the closeness between the residues. The closeness is
nothing but the number of nucleotide bases or amino acid residues that are similar between the
compared sequences.
 The two sequences should be aligned properly before a comparison. Even an alignment mistake by a
single residue, will end up with a mismatch of the entire sequence.
 The early sequence alignment tools developed by Needleman and Wunch in 1970 and Sellers in
1974 relied upon searching global similarity between the sequences.
 The most widely used tools are the BLAST and the FASTA.
 These two algorithms function in a similar fashion. They differ only in ranking the similarity or the
differences between the sequences. FASTA program is sensitive for searching DNA against DNA.
 BLAST is faster than FASTA and can be used for DNA as well as proteins and also for other
combinations.
 Both BLAST and FASTA are designed to read either strands of DNA while searching. The query
DNA sequence may be a noncoding or a coding strand and it is essential to search both the strands of
the DNA sequence of the database for complementarity.

FASTA (10 marks)

FASTA is another sequence analysis tool very much similar to BLAST, this was originally
developed by [Link] and Lipman and this algorithm can be accessed from EBI site. FASTA gives
better results for nucleotide sequences than protein sequences. The FASTA programs searches the database
files to find a number of related sequences to the query sequence and displays a pairwise alignment between
them.

Salient features of FASTA

FASTA is used for nucleic acids and FASTp proteins. They find regions of similarity by first breaking the
sequence into short subsequences (words or Ktups). Then searching for diagonals with highest density of
words that match. Gaps can be allowed. FASTA can be better than BLASTN for nucleic acid comparisons,
but usually no better for proteins so BLASTP is preferable because it is much faster. The following are the
programs available in FASTA
 FASTA3: scans a protein or DNA sequence library for similar sequences.
 FASTx/y3: compares a translated DNA sequence in forward and reverse frame against a protein
sequence database
 tFASTx/y3: compares a query protein sequence to a translated DNA databank.
 FASTs3: compares linked peptides to a protein databank.
 FASTf3: compares mixed peptides to a protein databank.

PARAMETERS USED IN FASTA

Ktup determines how many consecutive identities should be present for a word to match (word length). If
the word length is 2, then the program searches only those regions in the database sequence that has atleast 2
adjacent identical residues.

MATRIX: The default matrix for FASTA is Blosum62.

GAP PENALTY: GAP open penalty for the first residue in the gap (-12 by default for proteins, -16 for
DNA), Gap Extension penalty for additional residues in a gap (-2 by default for proteins, -4 for DNA).
Therefore it is given high value which has to be deducted from the similarity score. This is called as gap
penalty.
The common formulation for gap penalty is to gibe a certain high value for introduction of gap and
additional value for extension of gaps. There are two parameters like the gap opening penalty G and gap
extending penalty L.
14

The total gap penalty to be deducted from the alignment score is as follows: G + Ln. G are the gap opening
penalty and Ln is the length of the gap that follows. Since gap opening in the sequence is considered to be a
rare event, the penalty for G is high (usually 10-15 in the context of BLOSUM 62).
Additional gap formation is considered to be not so very rare, if already a gap existed. The L value is around
1 or 2.

HISTOGRAM:
This displays the search histogram of the expected frequency of chance occurrence of the database matches
found.

E-VALUE
E (Expectation) value is used for the evaluation of statistical significance. The E-value for a given
alignment depends upon its score as well as the lengths of both the query sequence and the database
searched sequence. Smaller E-value indicates more statistical significance of the match.
The upper E-value limit for score and alignment display by defaults are 10.0 for FASTA with protein
searches, 5.0 for translated DNA/PROTEN comparisons, and 2.0 for DNA/DNA searches.
In the lower E-value limit a value of 1e-6 prevents library sequences with E 0 value lower than 1e-6 from
being displayed. This allows the use to focus on more distant relationships.

Understanding Substitution Matrices in Bioinformatics
No ratings yet
Understanding Substitution Matrices in Bioinformatics
10 pages
BLOSUM Matrices and Sequence Alignment Techniques
No ratings yet
BLOSUM Matrices and Sequence Alignment Techniques
7 pages
BLOSUM vs PAM: Matrix Comparison Guide
No ratings yet
BLOSUM vs PAM: Matrix Comparison Guide
13 pages
Sequence Alignment Techniques and Metrics
No ratings yet
Sequence Alignment Techniques and Metrics
103 pages
Understanding Sequence Alignment Basics
No ratings yet
Understanding Sequence Alignment Basics
40 pages
Understanding BLOSUM Matrices in Bioinformatics
No ratings yet
Understanding BLOSUM Matrices in Bioinformatics
3 pages
Mutation Patterns in Bioinformatics
No ratings yet
Mutation Patterns in Bioinformatics
37 pages
Pairwise Sequence Alignment Overview
No ratings yet
Pairwise Sequence Alignment Overview
26 pages
Protein Sequence Alignment Techniques
No ratings yet
Protein Sequence Alignment Techniques
29 pages
Scoring Matrices in Bioinformatics
No ratings yet
Scoring Matrices in Bioinformatics
27 pages
PAM vs BLOSUM: Matrix Comparison
No ratings yet
PAM vs BLOSUM: Matrix Comparison
3 pages
Amino Acid Substitution Scoring Methods
No ratings yet
Amino Acid Substitution Scoring Methods
3 pages
BLOSUM Scoring Matrices Explained
No ratings yet
BLOSUM Scoring Matrices Explained
5 pages
Biological Sequence Alignment Overview
No ratings yet
Biological Sequence Alignment Overview
15 pages
Understanding Scoring Matrices in Bioinformatics
No ratings yet
Understanding Scoring Matrices in Bioinformatics
19 pages
Pairwise Sequence Alignment Explained
No ratings yet
Pairwise Sequence Alignment Explained
70 pages
Evolutionary Protein Variation Insights
No ratings yet
Evolutionary Protein Variation Insights
93 pages
Evolution and Bioinformatics Insights
No ratings yet
Evolution and Bioinformatics Insights
42 pages
Pairwise Sequence Alignment in Bioinformatics
No ratings yet
Pairwise Sequence Alignment in Bioinformatics
39 pages
Substitution Matrices in Bioinformatics
No ratings yet
Substitution Matrices in Bioinformatics
43 pages
Scoring Matrices in Bioinformatics
No ratings yet
Scoring Matrices in Bioinformatics
3 pages
Evolutionary Insights from Bioinformatics
No ratings yet
Evolutionary Insights from Bioinformatics
41 pages
DNA Sequence Alignment Techniques
No ratings yet
DNA Sequence Alignment Techniques
6 pages
Where Did The BLOSUM62 Alignment Score Matrix Come From?: Primer
No ratings yet
Where Did The BLOSUM62 Alignment Score Matrix Come From?: Primer
2 pages
Lecture 2
No ratings yet
Lecture 2
88 pages
Sequence Alignment in Bioinformatics
No ratings yet
Sequence Alignment in Bioinformatics
114 pages
Pairwise Sequence Alignment Overview
No ratings yet
Pairwise Sequence Alignment Overview
11 pages
2 1강
No ratings yet
2 1강
32 pages
Database Similarity Search in Bioinformatics
No ratings yet
Database Similarity Search in Bioinformatics
55 pages
BLOSUM Matrices in Bioinformatics
No ratings yet
BLOSUM Matrices in Bioinformatics
18 pages
Scoring Matrices in Sequence Alignment
No ratings yet
Scoring Matrices in Sequence Alignment
30 pages
Sequence Homology and Alignment Methods
No ratings yet
Sequence Homology and Alignment Methods
20 pages
Homology Searching in Gene Analysis
No ratings yet
Homology Searching in Gene Analysis
52 pages
Multiple Sequence Alignment Techniques
No ratings yet
Multiple Sequence Alignment Techniques
51 pages
Substitution Matrices in Sequence Alignment
No ratings yet
Substitution Matrices in Sequence Alignment
138 pages
FASTA Algorithm and Statistical Significance
No ratings yet
FASTA Algorithm and Statistical Significance
42 pages
Lecture 3
No ratings yet
Lecture 3
35 pages
Lecture 4
No ratings yet
Lecture 4
41 pages
Basics of Bioinformatics
No ratings yet
Basics of Bioinformatics
59 pages
Multiple Sequence Alignment MSA
No ratings yet
Multiple Sequence Alignment MSA
8 pages
Sequence Alignment Creation in Biochemistry
No ratings yet
Sequence Alignment Creation in Biochemistry
21 pages
Sequence Alignment Methods in Bioinformatics
No ratings yet
Sequence Alignment Methods in Bioinformatics
10 pages
PAM vs BLOSUM: Key Differences Explained
No ratings yet
PAM vs BLOSUM: Key Differences Explained
3 pages
Sequence Alignment Methods and Analysis
No ratings yet
Sequence Alignment Methods and Analysis
63 pages
Understanding Sequence Alignment Techniques
No ratings yet
Understanding Sequence Alignment Techniques
27 pages
Bioinformatics: Pairwise Sequence Alignment
No ratings yet
Bioinformatics: Pairwise Sequence Alignment
85 pages
Evolutionary Models in Sequence Alignment
No ratings yet
Evolutionary Models in Sequence Alignment
19 pages
1 Pearson
No ratings yet
1 Pearson
9 pages
Understanding Protein Homologs and Alignment
No ratings yet
Understanding Protein Homologs and Alignment
22 pages
Bioinfo Week3
No ratings yet
Bioinfo Week3
135 pages
Comparison of The PAM and BLOSUM Amino Acid Substitution Matrices
No ratings yet
Comparison of The PAM and BLOSUM Amino Acid Substitution Matrices
4 pages
Bioinformatics: Merging Biology and Computing
No ratings yet
Bioinformatics: Merging Biology and Computing
59 pages
Evolutionary Basis of Sequence Alignment
No ratings yet
Evolutionary Basis of Sequence Alignment
26 pages
Sequence Alignment
No ratings yet
Sequence Alignment
44 pages
Homology and Orthology Explained
No ratings yet
Homology and Orthology Explained
6 pages
Understanding Scoring Matrices in Bioinformatics
No ratings yet
Understanding Scoring Matrices in Bioinformatics
4 pages
DNA and Protein Sequence Analysis Techniques
No ratings yet
DNA and Protein Sequence Analysis Techniques
58 pages
BCH 201 Exam Q&A Overview
No ratings yet
BCH 201 Exam Q&A Overview
12 pages
Molecular Inheritance Overview
No ratings yet
Molecular Inheritance Overview
2 pages
Types and Applications of Sequence Alignment
No ratings yet
Types and Applications of Sequence Alignment
27 pages
Cancer Research: Gene Expression Insights
No ratings yet
Cancer Research: Gene Expression Insights
12 pages
Emerging Database Technologies Overview
No ratings yet
Emerging Database Technologies Overview
18 pages
Form Four Biology Exam 2024
No ratings yet
Form Four Biology Exam 2024
10 pages
Understanding Nucleic Acids: DNA & RNA
No ratings yet
Understanding Nucleic Acids: DNA & RNA
8 pages
Legal and Ethical Issues of Human DNA
No ratings yet
Legal and Ethical Issues of Human DNA
335 pages
DNA Methods in Wildlife Forensics
No ratings yet
DNA Methods in Wildlife Forensics
11 pages
Sequence Alignment and Analysis Techniques
No ratings yet
Sequence Alignment and Analysis Techniques
57 pages
Genetic Engineering and Evolution Overview
No ratings yet
Genetic Engineering and Evolution Overview
30 pages
Mutation Effects on RFP Promoter Strength
No ratings yet
Mutation Effects on RFP Promoter Strength
3 pages
DNA and RNA Composition Overview
No ratings yet
DNA and RNA Composition Overview
2 pages
Cracking the Central Dogma of Biology
No ratings yet
Cracking the Central Dogma of Biology
9 pages
Molecular Biology Guided Reading Answers
No ratings yet
Molecular Biology Guided Reading Answers
10 pages
Algorithms For Next-Generation Sequencing - Wing-Kin Sung - 2017 - CRC - 9781466565500 - Anna's Archive
No ratings yet
Algorithms For Next-Generation Sequencing - Wing-Kin Sung - 2017 - CRC - 9781466565500 - Anna's Archive
351 pages
Chromosomes and DNA Structure Guide
No ratings yet
Chromosomes and DNA Structure Guide
8 pages
Evolution Theories and Origins of Life
No ratings yet
Evolution Theories and Origins of Life
23 pages
Evidence of Evolution in Fossil Records
No ratings yet
Evidence of Evolution in Fossil Records
27 pages
Final Exam Practice: DNA & Genetics Test
No ratings yet
Final Exam Practice: DNA & Genetics Test
8 pages
Cell Structures and Plant Importance
No ratings yet
Cell Structures and Plant Importance
24 pages
Biology Exam Review Questions 2023
No ratings yet
Biology Exam Review Questions 2023
4 pages
Understanding DNA Structure and Function
No ratings yet
Understanding DNA Structure and Function
20 pages
Bioinformatics Tools for Life Sciences
No ratings yet
Bioinformatics Tools for Life Sciences
16 pages
Microbiology Methods & Techniques Overview
No ratings yet
Microbiology Methods & Techniques Overview
23 pages
Introduction to Sequence Analysis
No ratings yet
Introduction to Sequence Analysis
2 pages
DNA, RNA, and Proteins: Life's Blueprint
No ratings yet
DNA, RNA, and Proteins: Life's Blueprint
10 pages
Understanding Public Health Genomics
No ratings yet
Understanding Public Health Genomics
23 pages
Human Body and Biological Processes
No ratings yet
Human Body and Biological Processes
11 pages
Ordering DNA Structures: Gene to Nucleus
100% (2)
Ordering DNA Structures: Gene to Nucleus
8 pages

Sequence Alignment and BLOSUM Matrix Overview

Uploaded by

Sequence Alignment and BLOSUM Matrix Overview

Uploaded by

1

BLOSUM MARIX (BLOck SUbstitution Matrix)

Differences between PAM and BLOSUM

DAYHOFF AMINO ACD SUBSTITUTIN MATRICES (PAM MUTATIONS) (10 marks)

F(i,j)=max[F(i-1,j-1) + s(xi , yi), F((i,j-1) + d), F((i-1,j) + d)]

SMITH-WATERMAN LOCAL ALIGNMENT ALGORITHM (10 marks)

F(i,j)=max[F(i-1,j-1) + s(xi , yi), F((i,j-1) + d),

This rule is 0 0 0 0 0 0 0 equivalent at setting all

Position (i,j) may be reached,

TOOLS FOR SEQUENCE ALIGNMENT

Definition of sequence alignment: (2 marks)

An alignment without gaps g c t g a a c g

An alignment with gaps g c t g a - a - - c g

EXPLAIN IN DETAIL ABOUT THE TYPES OF SEQUENCE ALIGNMENT (5 marks)

Definitions: Homology. Similarity, Identity:

What is algorithm and program? (2 MARKS)

BLAST :( BASIC LOCAL ALIGNMENT SEARCH TOOL) (10 MARKS)

The Report Page is divided into four sections:

SEARCHING FOR SIMILARITY (5 marks)

FASTA (10 marks)

Salient features of FASTA

PARAMETERS USED IN FASTA

MATRIX: The default matrix for FASTA is Blosum62.

You might also like