String Alignment using Dynamic Programming


by Gina M. Cannarozzi

     Dynamic programming is an algorithm in which an optimization problem is solved by saving the optimal scores for the solution of every subproblem instead of recalculating them. In this biorecipe, we will use the dynamic programming algorithm to calculate the optimal score and to find the optimal alignment between two strings. First, we will compute the optimal alignment for every substring and save those scores in a matrix. For two strings, s of length m and t of length n, D[i,j] is defined to be the best score of aligning the two substrings s[1..j] and t[1..i].

    Several different kinds of string alignment can be done with the dynamic programming algorithm. For global alignment, the conditions are set such that we compute the best score and find the best alignment of two complete strings, while for local alignment, the conditions are such that we find the highest possible scoring substrings. For global alignment with cost-free end gaps we want to compute the best score for aligning two complete strings with no penalty for opening gaps on the ends. These three approaches differ in the base conditions, the methods of calculating the score and in the location of the endpoint. For global alignment, which we will consider in this biorecipe, the best score for the alignment is precisely D[m,n], the last value in the table. We will compute D[m,n] by computing D[i,j] for all values of i and j where i ranges from 0 to m and j ranges from 0 to n. These scores, the D[i,j] are the optimal scores for aligning every substring, s[1..j] and t[1..i].

    This is the standard dynamic programming approach. In general, dynamic programming has two phases- the forward phase and the backward phase. In the forward phase, we compute the optimal cost for each subproblem. In the backward phase, we reconstruct the solution that gives the optimal cost. For our string matching problem, specifically, we will:

(i)use the recurrence relation to do the tabular computation of all D[i,j] (forward phase), and
(ii)do a traceback to find the optimal alignment.

    The recurrence relation establishes a relationship between D[i,j] and values of D with indices smaller than i and j. When there are no smaller values of i and j then the values of D[i,j] must be stated explicitly in the base conditions. The base conditions are used to calculate the scores in the first row and column where there are no matrix elements above and to the left. This corresponds to aligning with strings of gaps. After the base conditions have been established, the recurrence relation is used to compute all values of D[i,j] in the table.

    The recurrence relation is:

   D[i,j] = max{ D[i-1,j-1] + sim_mat[s[j],t[i]],
                 D[i-1,j] + gap_score,
                 D[i,j-1] + gap_score }

    In other words, if you have an optimal alignment up to D[i-1,j-1] there are only three possibilities of what could happen next:

1. the characters for s[i] and t[j] match,
2. a gap is introduced in t and
3. a gap is introduced in s

    It is not possible to add a gap to both substrings. The maximum of the three scores will be chosen as the optimal score and is written in matrix element D[i,j].

    The second phase is to construct the optimal alignment by tracing back in the matrix any path from D[m,n] to D[0,0] that led to the highest score. More than one possibility can exist because it is possible that there are more than one ways of achieving the optimal score in each matrix element D[i,j].

Initialization

      Now we will run an example on two strings of DNA, ACGGTAG and CCTAAG. First, the two strings are assigned to variables, s and t and their lengths to n and m, respectively.

s := 'ACGGTAG'; t := 'CCTAAG';
s := ACGGTAG
t := CCTAAG
n := length(s); m := length(t);
n := 7
m := 6

    A matrix D is created to save the optimal dynamic programming scores for every pair of substrings up to that point. This matrix has dimensions (m+1) by (n+1). At the beginning the matrix looks like this:

  ACGGTAC
 0       
C        
C        
T        
A        
A        
G        
D := CreateArray(1..m+1, 1..n+1 ,0):

Create a Scoring Matrix

    A scoring matrix for scoring substitutions, matches, and gap creation is needed. This scoring matrix is chosen arbitrarily but of course it is preferable if the scoring has some basis in reality. Because the strings in this example are DNA, the scoring matrix used is appropriate for DNA.

    The four DNA bases are of two types, purines (A and G) and pyrimidines (T and C). The purines are chemically similar to each other and the pyrimidines are chemically similar to each other. Therefore, we will penalize substitutions between a purine and a purine or between a pyrimidine and a pyrimidine (transitions) less heavily than substitutions between purines and pyrimidines (transversions). We will use the following matrix for substitutions and matchings. The score is 2 for a match, 1 for a purine with a purine or a pyrimidine with a pyrimidine, and -1 for a purine with a pyrimidine.

A C G T
A 2-1 1-1
C-1 2-1 1
G 1-1 2-1
T-1 1-1 2

    For more information on scoring matrices, see the following biorecipe: Nigerian Prince Scam.

     Create an array called sim_mat with these values and assign the score for introducing a gap. In this case, a constant gap penalty of -2 is the score of aligning a character of either sequence with a gap. The variable gap_score is assigned with the gap penalty of -2.

sim_mat := [ 
[2, -1, 1, -1], 
[-1, 2, -1, 1],
[1, -1, 2, -1],
[-1, 1, -1, 2]]: 

gap_score := -2:

Initialize the dynamic programming calculation using base conditions

     The first element of the matrix that is filled in is the D[1,1] which is assigned 0. There is no penalty or score of aligning nothing with nothing. Recall that to calculate matrix element D[i,j], the values of D[i-1,j-1], D[i,j-1] and D[i-1,j] are needed. The first row and column of the dynamic programming matrix are the scores of aligning the subsequences s[1..j] and t[1..i] against a string of gaps of length j or i, respectively. Knowing this, the scores of the first row and column can be calculated.

D[1,1] := 0;
D[1,1] := 0

     The recurrence relation for the first row and column is:

     D[1,j] = D[1,j-1] + gap_score;

     D[i,1] = D[i-1,1] + gap_score;

     Put these into loops and initialize the first row and column.

for j to n do
  D[1,j+1] := gap_score*j:
od:

for i to m do
  D[i+1,1] := gap_score*i:
od:

print(D);
   0  -2  -4  -6  -8 -10 -12 -14
  -2   0   0   0   0   0   0   0
  -4   0   0   0   0   0   0   0
  -6   0   0   0   0   0   0   0
  -8   0   0   0   0   0   0   0
 -10   0   0   0   0   0   0   0
 -12   0   0   0   0   0   0   0

     Each score is the highest score of aligning up to that point. Matrix element D[1,3] is -4 which is the best score of aligning AC against two gaps at that point.

Calculate all D[i,j]

    As stated earlier, the recurrence relation for the rest of the table is:

    D[i,j] = max { D[i-1,j-1]+ sim_mat[s[j],t[i]], D[i-1,j] + gap_score, D[i,j-1] + gap_score };

    1 If the highest score comes from the element D[i-1,j-1] then characters s[i] and t[j] match.

    2 If the highest score comes from the element D[i-1,j] then a gap in string s is matched with a character in string t.

    3 If the highest score comes from the element D[i,j-1] then a gap in string t is matched with a character in string s.

    Loop over i and j and calculate the values for the 3 possibilities and then assign the maximum value to D[i,j]. Once the table is filled in, the score of the Global alignment is in matrix element D[m + 1,n + 1];

for i from 2 to m + 1 do
  for j from 2 to n + 1 do
    match := D[i-1,j-1] + sim_mat[BToInt(s[j-1]),BToInt(t[i-1])];
    gaps := D[i,j-1] + gap_score;
    gapt := D[i-1,j] + gap_score;
    D[i,j] := max(match,gaps,gapt);
  od;
od;

print(D);
   0  -2  -4  -6  -8 -10 -12 -14
  -2  -1   0  -2  -4  -6  -8 -10
  -4  -3   1  -1  -3  -3  -5  -7
  -6  -5  -1   0  -2  -1  -3  -5
  -8  -4  -3   0   1  -1   1  -1
 -10  -6  -5  -2   1   0   1   2
 -12  -8  -7  -3   0   0   1   3
score := D[m+1,n+1];
score := 3

Do the Traceback to create the alignment

    To create the traceback, reconstruct the path from position D[m+1,n+1] to D[1,1] that led to the highest score.

     Start at matrix element D[m + 1,n + 1] which contains the best score of the global alignment. Initialize two variable s_aln and t_aln to empty strings and then concatenate the aligned characters to the beginning of the string as we do the traceback. The characters are concatenated to the beginning because we start at the end of the strings, element D[m+1,n+1], and work backwards through the strings to the starting element D[1,1]. It is possible that there is not one unique path backwards through the scoring matrix, i.e. some of the highest scores for matching two substrings, s[j] and t[i], could have come from more than one matrix element. Only one arbitrary path of the set of possible paths that led to the highest score is traced back through the D matrix. Although, other equivalent paths may exist only one is returned.

    Assign variables.

i := m + 1; j := n + 1; 
i := 7
j := 8

    Assign empty strings to hold the final alignment.

t_aln := '': s_aln := '': 

    To traceback the path that led to the optimal score, we start at the end, matrix element D[m,n] and redo the calculations in the backwards direction to find out from where we came. In other words, from box D[i,j] we recalculate the scores in boxes D[i-1,j-1], D[i-1,j], and D[i,j-1]. By doing this we can reconstruct the path from which we came. If we came from the D[i-1,j-1] then we add the next character from s and t to s_aln and t_aln. If we came from D[i,j-1] then we add a gap to t_aln and the next character in s to s_aln. If we came from D[i-1,j], then we will add a gap to s_lan and the next character in t to t_aln. As this loop is written, our priority is to take the path through the diagonal first (if it exists). Adding a gap to t second priority and adding a gap to s third. Again, this choice of priority is completely arbitrary.

while i > 1 and j > 1 do
  if D[i,j] - sim_mat[BToInt(s[j-1]),BToInt(t[i-1])] = D[i-1,j-1] then
    t_aln := t[i-1].t_aln:
    s_aln := s[j-1].s_aln:
    i := i-1: j := j-1:
  elif D[i,j] - gap_score = D[i,j-1] then
    s_aln := s[j-1].s_aln:
    t_aln := '_'.t_aln:
    j := j-1:
  elif D[i,j] - gap_score = D[i-1,j] then
    s_aln := '_'.s_aln:
    t_aln := t[i-1].t_aln:
    i := i-1:
  else
    error('should not happen'):
  fi:
od: 

    At this point either i or j (or both) should be one. Finish to the ends by adding gaps and the rest of the remaining string until both counters, i and j, are equal to one.

if j > 1 then  
  while j > 1 do
    s_aln := s[j-1].s_aln:
    t_aln := '_'.t_aln:
    j := j-1; od;
elif i > 1 then 
  while i > 1 do
    s_aln := '_'.s_aln;
    t_aln := t[i-1].t_aln;
    i := i-1; od;
fi;

    Print the alignment.

printf('%s\n%s\n',s_aln,t_aln);
ACGGTAG
CCTA_AG

    For a local alignment, negative scores are not allowed. For this reason the first row and column are filled with zeros and if a negative number is the maximum score for a D[i,j] then it is replaced by a zero. At the end, the traceback is done from the highest score backwards until the first zero is reached.

    For a global alignment with cost free end gaps, the base condition is that each element in the first row and column is assigned the value zero, then the scores are computed as in the global alignment but the traceback is done from the highest score in the last row or column backwards to the first row or column.

© 2011 by Gina Cannarozzi, Informatik, ETH Zurich

Index of bio-recipes

Last updated on Thu Oct 13 13:35:24 2011 by zo

!!! Dieses Dokument stammt aus dem ETH Web-Archiv und wird nicht mehr gepflegt !!!
!!! This document is stored in the ETH Web archive and is no longer maintained !!!