COP 6936/CAP 6990 Homework 2

Due: October 17, 2002.

Part A: Multiple Alignment.

CLUSTALW is a widely used multiple sequence alignment tool. This assignment will expose you to the features and capabilities of this program.

  1. Go to Entrez
  2. You need to download the protein U1A from four different organisms (human, mouse, Xenopus laevis, and Drosophila melanogaster). For some of these organisms, you may find several proteins being found by Entrex. Pick the ones with the following accessions numbers: human - 2554638; mouse - 543325; Xenopus laevis - 65181; Drosophila melanogaster - 1173325. You can verify that you have the correct sequences by looking in the GenPept Report for the sequences you picked. For each of the four sequences, open the FASTA report and copy them all into one file.
  3. Go to CLUSTALW.
  4. Type in your e-mail address. Either upload your file or paste your sequences into the appropriate box. Now run CLUSTALW.
  5. Study the output that you get within a few moments.
  6. Try the "JalView" option.
  7. Go back to the CLUSTAL page, and try it again after changing some of the settings. Try a different substitution matrix. Also try different gap penalties.
  8. JalView has an option to mail yourself the postscript version of the alignment. Try this option.
  9. Go back to the CLUSTAL page, and change the "TREE TYPE" to "phylip". Look at the tree that is output.
  10. Pick 4 more (new) sequences that are related to the U1A proteins. Now align all 8 of the sequences using the "TREE TYPE" of "phylip".
How long are the 4 sequences? What are the pairwise alignment scores?

Q1: What do the "*", ":", and the "." in the alignment indicate? Consult the substitution matrix values, if necessary. Were there any differences in the alignment when you tried two different substitution matrices (PAM and BLOSUM)?

Q2: What sequence formats are supported by CLUSTALW??

Q3: How can the tree information be interpreted? Can you draw the tree that you obtained when you used the "TREE TYPE" of "phylip" with the 8 sequences that you aligned? What do the numbers in the tree information mean?

Q4: Find the "center" and "consensus" sequences (as defined in class) for the multiple alignment that you found.


Part B: Hidden Markov Models.

Here is a small alignment of 12 members of a DNA sequence family.
(column:  1234)
seq1      GATC
seq2      CTAG
seq3      GATC
seq4      CC-G
seq5      GATC
seq6      CC-G
seq7      GTAC
seq8      CG-G
seq9      GCGC
seq10     CTAG
seq11     GATC
seq12     CTAG
Suppose you were to build a profile HMM of this alignment. The profile has four match states; match state 1 is assigned to the symbols in column 1, etc.

Q1: Draw a profile HMM in terms of states (circles) and state transitions (arrows). You need to use the "Learning Algorithm" we discussed in class for HMMs. Note that unless you remove states that have no probability of being reached from the "Begin" state, you will be unable to work out this problem by hand.

Q2: Calculate the emission probability parameters for A,C,G,T in match state 1 (column 1). Do a maximum likelihood estimate, i.e., ratio of the frequency of that character being emitted to the sum of frequencies of all the characters.

Q3: Using the above answer, calculate the "log odds scores" (equal to the log of the ratio of its emission probability to its background frequency) for A,C,G,T in match state 1. Assume that the expected background frequencies of A,C,G,T are each 0.25. Use log base two so your scores are in units of bits.

Q4: Column 3 has gap symbols which would be assigned to delete state 3. Calculate the scores (log_2 probabilities) for the match_2 -> match_3 state transition and the match_2 -> delete_3 state transition.

Q5: Calculate the HMM log odds score (in bits) for the sequence

 GAAG 
and the sequence
 GATC
Notice that columns 1-4 and 2-3 covary as if they are Watson-Crick base pairs. It would therefore seem that the sequence GAAG should not be a true member of the sequence family. (Hint: the score will be the sum of four emission log-odds probabilities and one state transition log probability, since all other state transitions have probability one in this case. Also, make the Viterbi assumption that the obvious alignment of the four symbols to the four match states is correct, so you do not need to sum over all possible paths.) Now recall the discussions we had in class about the disadvantages of HMMs for the next question.

Q6: Is the HMM a good model of the pairwise correlations? Comment on the limitations of the HMM model.

Q7: [Extra Credit] How can you modify the HMM model so that it recognizes the correlation between locations? It may help to first ignore the correlation between locations 2-3 and only assume that locations 1-4 have a correlation.