0% found this document useful (0 votes)
30 views9 pages

Global and Local Sequence Alignment Techniques

The document discusses global alignment and local alignment algorithms. It describes the Needleman-Wunsch algorithm as the first algorithm for global sequence alignment using dynamic programming to find the optimal alignment between entire sequences. The Smith-Waterman algorithm is presented as the method for local alignment to find locally similar regions between divergent or variably sized sequences. Key steps of the Needleman-Wunsch algorithm including setting up a scoring matrix and performing a trace-back procedure are outlined.

Uploaded by

Raj Lonkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views9 pages

Global and Local Sequence Alignment Techniques

The document discusses global alignment and local alignment algorithms. It describes the Needleman-Wunsch algorithm as the first algorithm for global sequence alignment using dynamic programming to find the optimal alignment between entire sequences. The Smith-Waterman algorithm is presented as the method for local alignment to find locally similar regions between divergent or variably sized sequences. Key steps of the Needleman-Wunsch algorithm including setting up a scoring matrix and performing a trace-back procedure are outlined.

Uploaded by

Raj Lonkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Global alignment

A global alignment contains the entire sequence of each

protein or DNA molecule that means it tries to align entire

sequence.

 One of the first and most important algorithms for aligning

two protein sequences was described by Needleman and

Wunsch (1970).

 TheNeedleman-Wunsch algorithm is an example of dynamic


programming.

 In global alignment, two sequences to be aligned are

assumed to be generally simmilar over their entire length.

 Alignment is carried out from beginning to end of both


sequences to find the best possible alignment across the entire
length between the two sequences.

 This
method is more applicable for aligning two closely related
sequences of roughly the same length.

 For divergent sequences and sequences of variable lengths, this


method may not be able to generate optimal results because it
fails to recognize highly similar local regions between the two
sequences.

 This algorithm is important because it produces an optimal


alignment of protein or DNA sequences, even allowing the
introduction of gaps.
 the Needleman-Wunsch approach to global sequence alignment
in three steps:

(1) setting up a matrix.


 First step is comparasion of two sequences in a
two-dimensional matrix.
 First sequence is listed horizintally along the matrix, second
sequence is listed vertically along the matrix .
 Then a matrix is build of dimensions m + 1 by n + 1
 A perfect alignment between two identical sequences would
simply be represented by a diagonal line extending from the top
left to the bottom right
 Any mismatches between two sequences would still be
represented on this diagonal path
 Gaps are represented in this matrix using horizontal or vertical
paths.

(2) scoring the matrix.


 The goal of this algorithm is to identify an optimal alignment.
 goal in finding an optimal alignment is to determine the path
through the matrix that maximizes the score.
 There are four possible occurrences at each position
 two residues may be perfectly matched
 they may be mismatched;
 a gap may be introduced from the first sequence
 a gap may be introduced from the second sequence,

(3) identifying the optimal alignment.


 After the matrix is filled, the alignment is determined by a
trace-back procedure.
 There are rewards and penalties match 1 mismatch -1 and gap
-2
 In the matrix the right bottom value will be larger than its

diagonal value then we can say it is match and if mis

matched then diagonal value will be larger than right bottom

one.

 If there is a match go diagonal, if not then go highest value

of the neighbour value and this is represented as gap.


Local alignment
 Localalignment, does not assume that the two sequences in
question have similarity over the entire length.

 It
only finds local regions with the highest level of similarity
between the two sequences and aligns these regions only .

 Stretches of sequences with highest density of matches are


aligned.

 Thisapproach can be used for aligning partially similar, different


length or more divergent sequences with the goal of searching for
conserved patterns in DNA or protein sequences.

 Thetwo sequences to be aligned can be of different lengths. In


which alignment of substring of target with substring of query is
done.

 This approach is more appropriate for aligning divergent


biological sequences containing only modules that are similar,
which are referred to as domains or motifs.

 The general local alignment method used is smith-waterman


which is an example of dynamic programming.
 The smith waterman method is very much similar to
needleman-wunsch method of gobal alignment , the only main
difference is the negative values in needleman-wunsch method is
converted to zero.
 The traceback step is far more simpler and straight forward than
global alignment, choosing the highest value first and then
moving upto zero is all needed in this [Link] would give a
conserved pattern in both the sequences.
Applications of bioinformatics:

Databases
 database is a computerized archive used to store and organize
data in such a way that information can be retrieved easily via a
variety of search criteria.
 Databases are composed of computer hardware and software
for data management.
 The chief objective of the development of a database is to
organize data in a set of structured records to enable easy
retrieval of information.
 To retrieve a particular record from the database, a user can
specify a particular piece of information, called value, to be found
in a particular field and expect the computer to retrieve the whole
data record. This process is called making a query

 Biological databases:
 Itis the a collection of biological information or data that is
organised so that it can be easily accessed, managed, updated.
 The kind of data includes DNA sequences of gene or full
genome, protein sequences and 3d structure protein, nucleic
acids and protein -nucleic acid complex.
 Current biological databases use all three types of database
structures: flflat fifiles, relational, and object oriented.
 Based on their contents, biological databases can be roughly
divided into three categories: primary databases, secondary
databases, and specialized databases.
Similarity identity

 An important concept in sequence analysis is sequence


homology.
 When two sequences are descended from a common
evolutionary origin, they are said to have a homologous
relationship or share homology.
 A related but different term is sequence similarity, which is the
percentage of aligned residues that are similar in physio-chemical
properties such as size, charge, and hydrophobicity.
 To be clear, sequence homology is an inference or a conclusion
about a common ancestral relationship drawn from sequence
similarity comparison when the two sequences share a high
enough degree of similarity.
 On the other hand, similarity is a direct result of observation from
the sequencealignment.
 Sequence similarity can be quantifified using percentages;
homology is a qualitative statement.
 In a protein sequence alignment, sequence identity refers to the
percentage of matches of the same amino acid residues between
two aligned sequences.
 Sequence Similarity and sequence identity are same

words for nucleotide sequence, but are different for

protein sequence where identity means % of exact

matches between 2 aligned sequences and similarity

means % of aligned resides that share characteristics.


 Bothidentity and similarity are used to deduce homology.
Homology has a specific definition having a common evolutionary
ancestor.

Homology
 Homologous are two or more sequence that descend from a
common ancestral sequence
 Homologos are results of divergent evolution.
 Two sequences are homologous if they share a common
evolutionary ancestry.
 There are no degrees of homology; sequences are either
homologous or not.
 Homologous proteins almost always share a significantly related
three-dimensional structure
 Proteins that are homologous may be orthologous or
paralogous.
 Orthologs are homologous sequences in different species that
arose from a common ancestral gene during speciation, result of
speciation events.
 Paralogs are homologous sequences that arose by a mechanism
such as gene duplication, result of gene duplication.
 Xenologsn result of horizontal gene transfer
 Gametologs :the gene in sex chromosomes that have not
recombined.
 Homologs : the gene which are separated by a speciation event
when hybridised together via lateral gene transfer.

Common questions

Powered by AI

Global alignment methods, such as the Needleman-Wunsch algorithm, assume that two sequences are similar over their entire length and aim to align them from beginning to end . These are suitable for sequences of roughly the same length that are closely related. In contrast, local alignment methods, like the Smith-Waterman algorithm, do not assume overall sequence similarity and instead focus on finding and aligning only the regions with the highest similarity . Local alignments are more appropriate for aligning divergent sequences of different lengths or sequences with only few similar modules, as they focus on conserved patterns within the sequences .

Sequence similarity and sequence identity are related but distinct concepts in protein sequence analysis. Sequence identity specifically refers to the percentage of exact matches of the same amino acid residues between two aligned sequences . Similarity, however, refers to the percentage of aligned residues that share physio-chemical properties . Homology is an inference made based on high sequence similarity, indicating a common evolutionary origin . Although sequence similarity can provide insights into possible homology, it is not a definitive measure; homology is a qualitative assessment asserting a shared ancestry, often deduced when sequence similarity is substantial .

The choice between global and local alignment is influenced by the nature and goal of the sequence analysis. Global alignment is preferred when the sequences are of similar length and are expected to be similar across their entirety, as it attempts to align entire sequences . Local alignment is more appropriate when analyzing sequences that may only share some regions of similarity, such as in cases of divergent sequences or sequences of varying lengths, as it identifies and aligns only the most similar subsequences . The specific research objective, whether to align full sequences or identify conserved patterns, also dictates the alignment method .

The Needleman-Wunsch algorithm, a global alignment method, is not optimal for sequences of differing lengths or sequences that are not closely related because it attempts to produce an alignment over the entire length of both sequences . This approach may overlook significant local similarities or conserved regions which are more relevant in divergent or variably-length sequences, as it emphasizes alignment across the full sequence length .

Biological databases are comprised of computer hardware and software designed for data management, organization, and retrieval . They are categorized based on their content into primary, secondary, and specialized databases, and encompass a range of data types such as DNA sequences, protein sequences, 3D protein structures, nucleic acids, and protein-nucleic acid complexes . These databases usually employ flat files, relational, or object-oriented structures to store the data, allowing for efficient data access and management .

Local alignment enables the alignment of partially similar or divergent biological sequences by concentrating on aligning only the regions with the highest similarity or density of matches within the sequences, rather than attempting to align them entirely . This approach, exemplified by the Smith-Waterman algorithm, is particularly useful for identifying conserved patterns or motifs within sequences of differing lengths or variable similarity, which would be missed by global alignment methods that require overall similarity across full sequence lengths .

Homology in bioinformatics helps in understanding evolutionary relationships by indicating that homologous sequences have descended from a common ancestral sequence . Homologous proteins typically have related three-dimensional structures and may be categorized as orthologs, paralogs, or xenologs, reflecting different evolutionary processes like speciation, gene duplication, or horizontal gene transfer, respectively . Therefore, identifying homologous relationships can provide insights into the evolutionary history and functional similarities of proteins across different species .

Biological databases are critical for sequence alignment tasks as they store and organize vast amounts of biological data, including DNA and protein sequences, which are essential for conducting sequence alignment . They facilitate easy retrieval, management, and updating of sequence information, allowing researchers to efficiently access data needed for alignment tasks . The structured nature of databases enables users to perform precise searches and queries, ensuring the retrieval of accurate and relevant sequences for alignment .

The Needleman-Wunsch algorithm involves three primary steps to determine the optimal alignment: setting up a scoring matrix, scoring the matrix, and identifying the optimal alignment. Initially, a matrix is created with one sequence listed horizontally and the other vertically . Each position in the matrix considers four possibilities: a perfect match, a mismatch, a gap from the first sequence, or a gap from the second sequence, scored with specific rewards or penalties . After scoring the matrix, the optimal alignment is traced back from the highest scoring path, identifying the best alignment across the entire sequences .

The Smith-Waterman algorithm offers advantages over the Needleman-Wunsch algorithm for aligning divergent sequences because it performs local alignment, which focuses only on the most similar regions between sequences and disregards the rest . This approach is more suitable for sequences of different lengths or when only certain motifs or domains are conserved, allowing for more meaningful alignments in cases where full-length alignment might not capture the evolutionary or functional similarities . Moreover, its ability to handle sequences of varying similarity levels by setting negative values to zero makes it robust against noise in alignment scores, which can be crucial for finding local similarities .

You might also like