Pairwise Sequence Alignment
CS 838
[Link]/~craven/[Link]
Mark Craven
craven@[Link]
January 2001
Announcements
• New optional, but recommended, reading on the
course web page: Molecular Biology for Computer
Scientists by Larry Hunter
1
Pairwise Alignment:
Task Definition
• Given
– a pair of sequences (DNA or protein)
– a method for scoring the similarity of a pair of
characters
• Do
– determine the correspondences between
substrings in the sequences such that the
similarity score is maximized
Motivation
• comparing sequences to gain information
about the structure/function of a query
sequence
• putting together a set of sequenced
fragments (fragment assembly)
• comparing a segment sequenced by two
different labs
2
The Role of Homology
• homology: similarity due to descent from a
common ancestor
• often we can infer homology from similarity
• thus we can sometimes infer
structure/function from sequence similarity
Homology
• homologous sequences can be divided into
two groups
– orthologous sequences: sequences that differ
because they are found in different species
(e.g. human α-globin and mouse α-globin)
– paralogous sequences: sequences that differ
because of a gene duplication event
(e.g. human α-globin and human β-globin,
various versions of both )
3
Issues in Sequence Alignment
• the sequences we’re comparing probably differ in
length
• there may be only a relatively small region in the
sequences that matches
• we want to allow partial matches (i.e. some amino
acid pairs are more substitutable than others)
• variable length regions may have been
inserted/deleted from the common ancestral
sequence
Gaps
• sequences may have diverged from a
common ancestor through various types of
mutations:
– substitutions (ACGA AGGA)
– insertions (ACGA ACCGA)
– deletions (ACGA AGA)
• the latter two will result in gaps in
alignments
4
Insertions/Deletions and
Protein Structure
loop structures: insertions/deletions
here not so significant
Example Alignment
GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL
++ ++++H+ KV + +A ++ +L+ L+++H+ K
NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG
• gaps depicted with –
• middle line shows matches
– identical matches shown with letters
– similar amino acids shown with +
– dissimilar amino acids/gaps indicated by space
5
Alignments in the Olden Days:
Dot Plots
G A C G G A T T A G
G n n n n
A n n n
T n n
C n
G n n n n
G n n n n
A n n n
A n n n
T n n
A n n n
G n n n n
Types of Alignment
• global: find best match of both sequences in their
entirety
• local: find best subsequence match
• semi-global: find best match without penalizing
gaps on the ends of the alignment
6
Pairwise Alignment Via Dynamic
Programming
• Needleman & Wunsch, Journal of Molecular
Biology, 1970
• dynamic programming: solve an instance of a
problem by taking advantage of computed
solutions for smaller subparts of the problem
• determine alignment of two sequences by
determining alignment of all prefixes of the
sequences
Scoring Scheme Components
• substitution matrix
– s(a,b) indicates score of aligning character a
with character b
• gap penalty function
– w(k) indicates cost of a gap of length k
7
Linear Gap Penalty Function
• different gap penalty functions require
somewhat different DP algorithms
• the simplest case is when a linear gap
function is used
w(k ) = gk
where g is a constant
• we’ll start by considering this case
Dynamic Programming Idea
• consider last step in computing alignment of AAAC with
AGC
• three possible options; in each we’ll choose a different
pairing for end of alignment, and add this to best alignment
of previous characters
AAA C AAAC -
AG C AG C
AAA C consider best score of
AGC -
alignment of + aligning
these prefixes this pair
8
Dynamic Programming Idea
• given an n-character sequence x, and an m-
character sequence y
• construct an (n+1) x (m+1) matrix F
• F [ i, j ] = score of the best alignment of
x[1…i ] with y[1…j ]
Dynamic Programming Idea
F[i-1, j-1] F[i, j-1]
+g
+ s(x[i],y[j])
F[i-1, j] F[i, j]
+g
9
Dynamic Programming Idea
• in extending an alignment, we have 3 choices:
– align x[ 1… i-1] with y[ 1… j-1] and match x[ i ]
with y[ i ]
– align x[1… i ] with y[ 1… j-1 ] and match a gap
with y[ j ]
– align x[ 1…i-1 ] with y[ 1… j ] and match a gap
with x[ i ]
• choose highest scoring choice to fill in F [ i, j ]
DP Algorithm for Global Alignment
with Linear Gap Penalty
• one way to specify the DP is in terms of its
recurrence relation:
F (i − 1, j − 1) +s ( xi, yj )
F (i, j ) = max F (i − 1, j ) + g
F (i, j − 1) + g
10
Initializing Matrix: Global
Alignment with Linear Gap Penalty
A G C
0 g 2g 3g
A g
A 2g
A 3g
C 4g
DP Algorithm Sketch
• initialize first row and column of matrix
• fill in rest of matrix from top to bottom, left
to right
• for each F [ i, j ], save pointer(s) to cell(s)
that resulted in best score
• F [m, n] holds the optimal alignment score;
trace pointers back from F [m, n] to F [0, 0]
to recover alignment
11
DP Algorithm Example
• suppose we choose the following scoring scheme:
s(x[i], y[j]) =
+1 when x[i] = y[j]
-1 when x[i] <> y[j]
g (penalty for aligning with a gap) = -2
DP Algorithm Example
A G C
0 -2 -4 -6
A one optimal alignment
-2 1 -1 -3
x: A A A C
A y: A G - C
-4 -1 0 -2
A -6 -3 -2 -1
C -8 -5 -4 -1
12
DP Comments
• works for either DNA or protein sequences,
although the substitution matrices used
differ
• finds an optimal alignment
• the exact algorithm (and computational
complexity) depends on gap penalty
function (we’ll come back to this issue)
Equally Optimal Alignments
• many optimal alignments may exist for a given
pair of sequences
• can use preference ordering over paths when
doing traceback
highroad 1 lowroad 3
2 2
3 1
• highroad and loadroad alignments show the two
most different optimal alignments
13
Highroad & Lowroad Alignments
A G C
highroad alignment
0 -2 -4 -6
x: A A A C
A y: A G - C
-2 1 -1 -3
A -4 -1 0 -2 lowroad alignment
x: A A A C
A -6 -3 -2 -1 y: - A G C
C -8 -5 -4 -1
Dynamic Programming Analysis
• there are
2n (2n)! 2 2 n
= ≈
n ( n! ) 2
πn
possible alignments of length n
• e.g. two sequences of length 1000 have ≈ 10
600
possible alignments
• but the DP approach finds an optimal alignment
efficiently
14
Computational Complexity
• initialization: O(m), O(n)
• filling in rest of matrix: O(mn)
• traceback: O(m + n)
• hence, if sequences have nearly same
length, the computational complexity is
O (n 2 )
Local Alignment
• so far we have discussed global alignment,
where we are looking for best match
between sequences from one end to the
other.
• more commonly, we will want a local
alignment, the best match between
subsequences of x and y.
15
Local Alignment Motivation
• useful for comparing protein sequences that
share a common domain but differ
elsewhere
• useful for comparing against genomic
sequences (long stretches of
uncharacterized sequence)
• more sensitive when comparing highly
diverged sequences
Local Alignment DP Algorithm
• original formulation: Smith & Waterman,
Journal of Molecular Biology, 1981
• interpretation of array values is somewhat
different
– F [ i, j ] = score of the best alignment of a
suffix of x[1…i ] and a suffix of y[1…j ]
16
Local Alignment DP Algorithm
• the recurrence relation is slightly different than for
global algorithm
F (i − 1, j − 1) +s( xi, yj )
F (i − 1, j ) + g
F (i, j ) = max
F (i, j − 1) + g
0
Local Alignment DP Algorithm
• initialization: first row and first column initialized
with 0’s
• traceback:
– find maximum value of F(i, j); can be anywhere
in matrix
– stop when we get to a cell with value 0
17
Local Alignment Example
A A G A
0 0 0 0 0
T 0 0 0 0 0
T 0 0 0 0 0
A 0 1 1 0 1
A 0 1 2 0 1
G 0 0 0 3 1
x: A A G
y: A A G
18