Sequence Analysis Dr. P.
Saha
Sequence Analysis
Sequence analysis refers to the process of subjecting a DNA, RNA or peptide sequence to
any of a wide range of analytical methods to understand its features, function, structure, or
evolution. Methodologies used include sequence alignment, searches against biological
databases, and others. Since the development of methods of high-throughput production of
gene and protein sequences, the rate of addition of new sequences to the databases increased
exponentially. Such a collection of sequences does not, by itself, increase the scientist's
understanding of the biology of organisms. However, comparing these new sequences to
those with known functions is a key way of understanding the biology of an organism from
which the new sequence comes. Thus, sequence analysis can be used to assign function to
genes and proteins by the study of the similarities between the compared sequences.
Nowadays, there are many tools and techniques that provide the sequence comparisons
(sequence alignment) and analyze the alignment product to understand its biology.
Sequence analysis in molecular biology includes a very wide range of relevant topics:
1. The comparison of sequences in order to find similarity often to infer if they are
related (homologous)
2. Identification of intrinsic features of the sequence such as active sites, post
translational modification sites, gene-structures, reading frames, distributions of
introns and exons and regulatory elements
3. Identification of sequence differences and variations such as point mutations and
single nucleotide polymorphism (SNP) in order to get the genetic marker.
4. Revealing the evolution and genetic diversity of sequences and organisms
5. Identification of molecular structure from sequence alone
There are millions of protein and nucleotide sequences known. These sequences fall into
many groups of related sequences known as protein families or gene families. Relationships
between these sequences are usually discovered by aligning them together and assigning this
alignment a score. There are two main types of sequence alignment. Pair-wise sequence
alignment only compares two sequences at a time and multiple sequence alignment compares
many sequences in one go. Two important algorithms for aligning pairs of sequences are the
Needleman-Wunsch algorithm and the Smith-Waterman algorithm. Popular tools for
sequence alignment include:
• Pair-wise alignment - BLAST
• Multiple alignment - ClustalW, PROBCONS, MUSCLE, MAFFT, and T-Coffee.
A common use for pairwise sequence alignment is to take a sequence of interest and compare
it to all known sequences in a database to identify homologous sequences. In general the
matches in the database are ordered to show the most closely related sequences first followed
by sequences with diminishing similarity. These matches are usually reported with a measure
of statistical significance such as an Expectation value.
Profile comparison
In 1987 Michael Gribskov, Andrew McLachlan and David Eisenberg introduced the method
of profile comparison for identifying distant similarities between proteins. Rather than using
a single sequence, profile methods use a multiple sequence alignment to encode a profile
1
Sequence Analysis Dr. P. Saha
which contains information about the conservation level of each residue. These profiles can
then be used to search collections of sequences to find sequences that are related. Profiles are
also known as Position Specific Scoring Matrices (PSSMs). In 1993 a probabilistic
interpretation of profiles was introduced by David Haussler and colleagues using hidden
Markov models. These models have become known as profile-HMMs. In recent years
methods have been developed that allow the comparison of profiles directly to each other.
These are known as profile-profile comparison methods.
Sequence assembly
Sequence assembly refers to the reconstruction of a DNA sequence by aligning and merging
small DNA fragments. It is an integral part of modern DNA sequencing. Since presently-
available DNA sequencing technologies are ill-suited for reading long sequences, large pieces
of DNA (such as genomes) are often sequenced by (1) cutting the DNA into small pieces, (2)
reading the small fragments, and (3) reconstituting the original DNA by merging the
information on various fragment.
Gene prediction
Gene prediction or gene finding refers to the process of identifying the regions of genomic
DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may
also include prediction of other functional elements such as regulatory regions. Gene finding
is one of the first and most important steps in understanding the genome of a species once it
has been sequenced. In general the prediction of bacterial genes is significantly simpler and
more accurate than the prediction of genes in eukaryotic species that usually have complex
intron/exon patterns.
Protein Structure Prediction
Target protein structure (3dsm, shown in ribbons), with Calpha backbones (in gray) of 354
predicted models for it submitted in the CASP8 structure-prediction experiment. The 3D
structures of molecules are of great importance to their functions in nature. Since structural
prediction of large molecules at an atomic level is largely intractable problem, some
biologists introduced ways to predict 3D structure at a primary sequence level. This includes
biochemical or statistical analysis of amino acid residues in local regions and structural
inference from homologs (or other potentially related proteins) with known 3D structures.
There have been a large number of diverse approaches to solve the structure prediction
problem. In order to determine which methods were most effective a structure prediction
competition was founded called CASP (Critical Assessment of Structure Prediction).
Methodology
The tasks that lie in the space of sequence analysis are often non-trivial to resolve and require
the use of relatively complex approaches. Of the many types of methods used in practice, the
most popular include:
• Dynamic programming
• Artificial Neural Network
• Hidden Markov Model
• Support Vector Machine
2
Sequence Analysis Dr. P. Saha
• Clustering
• Bayesian Network
• Regression Analysis
Dot-matrix methods
A dot matrix picture provides a global picture of local similarities between two sequences.
They are appropriate:
• for comparing large sequences (several 1000 residues)
• if one does not know in advance whether two sequences share detectable similarity or
which parts of the sequences are related to each other.
They are useful for:
• detection of repeats within protein sequences
• detection of shared domains between protein sequences