COT 6936/ CAP 6990: Homework #3

Due: November 14, 2002

Part A: Hidden Markov Models Parameters and Scoring (using HMMER)

You can use HMMs to build a Profile of any alignment. Assume that you have many examples of a motif and that you are given a multiple alignment for the examples. HMMs can provide you with a way to detect other examples of such motifs (along with a score for how strong the prediction is). We will use the HMMer software to try this on a well-known motif called Helix-Turn-Helix motif. These are DNA-binding motifs, and are present in a large number of proteins. Detection of these can also be done using GCG software. However, we are going to build our own detection method here using HMMs.

  1. Familiarize yourself with the simple features of HMMer.
  2. Download the set of examples of Helix-Turn-Helix Motifs (alignFile) that are present in known protein sequences. You will notice that these examples are already aligned and only contain the motif region. It contains 88 examples.
  3. Train your HMM with these 88 examples, and build a profile. You need to use the following command:
    hmmbuild HMMParametersFile alignFile
  4. Download a new protein sequence (try Mu MOR protein with accession number 138783) in FASTA format.
  5. Align this sequence to the HMM Profile you built above. Use the command:
    hmmalign HMMParametersFile seqFile
  6. What does it say about the location of the Helix-Turn-Helix Motif in Mu MOR protein?

Part B: Understanding Protein Structures

CASP is a contest held every other year in which "target" protein sequences are proposed for which there was no known structure, and experts worldwide were asked to use their programs to predict the structure for these target sequences. Later, the structures are solved by x-ray crystallography or other experimental means, and everyone meets to evaluate how well the computational prediction methods did. The results for the best predictions appear in a special issue of the journal Proteins.

In the fold recognition category of CASP, the participants are asked to find very, very remote homologies. They are judged not so much on the detailed accuracy of the three-dimensional coordinates that they predict (nobody using currently known methods would be judged as very successful if that were the case), but rather on whether they predict the overall fold of the protein correctly or not. Participants can do this by searching the PDB database of protein sequences whose structure is known, and predicting which if any sequences in PDB will have structure similar to that of the target sequence, sometimes without constructing an explicitly three-dimensional model for the structure of the target sequence. These methods are very useful for practicing molecular biologists who want to get a rough idea of what a new protein they have discovered might look like or what its function might be.

Go to the CASP4 homepage. Find the link to the page describing the target sequences that you are supposed to predict structure for. Go there. Click on target labeled T0106. That is the one we will try. This protein is the secreted frizzled protein 3, from mouse (Reference: Dann, CE; Hsieh, JC; Rattner, A; Sharma, D; Nathans, J; Leahy, DJ; (2001) Insights Into Wnt Binding and Signaling from the Structures of Two Frizzled Cysteine-Rich Domains. Nature, 412:86-90.) You get a page giving information about this protein. This is what the CASP contestants were given as information about T0106. At the bottom, click on "Template Sequence file" and save this file as "T0106.seq". This is a FASTA format file containing the amino acid sequence of T0106. The protein is also said to have the GenBank accession number of AAC53147.1. You are warned however that if you look at the protein with accession number AAC53147.1, the amino acid sequence is different from that provided to you in T0106.seq (seems to have some extra residues at the start and end of the protein and also differes in location 17..

Go to the BLAST database search page, paste the T0106 sequence in, and do a BLAST search making sure that you restrict your search  to "pdb" instead of "nr". Feel free to try any of the more advanced versions of BLAST as well. PSI-BLAST is generally the best, and is the most popular.) You should definitely find 1IJX. It gets a great E-value. Look at the alignment and you see that this is T0106 itself. Since the time of the contest, the structure of T0106 has been solved and deposited in the PDB database. You should get one other "hit" for 1IJY. Look at its E-value, and its alignment.

Q1: Define E-values. List the other "hit" for the above search and its E-value. Do you think it is a real homology or a chance match.

Also, the BLAST search should point out that the protein is essentially one domain called the frizzled domain (labeled "FRI"). If you click on the red band labeled FRI, it will do a "Conserved Domain Search" for the FRI domain. One of them is an entry from the SMART database for close relatives of the protein 1IJX, the other is for protein 1IJY. 

Q2: Print out the alignments that you get from following the 2 links from the Conserved Domain Search corresponding to PSSM IDs 3814 and 6437.  What are the Scores and E-values of the 2 alignments?

Go to the FSSP database of protein structure classification. Type in 1IJX and hit "return". FSSP uses a special, nonredundant subset of PDB that contains only "representative" protein chains, no two of which are identical or almost identical. This page shows that 1ijxA (chain A of the PDB entry 1IJX) is chosen by FSSP as a representative sequence for all the 8 PDB sequences: itself, 1ijxB through 1ijxF (the other chains in the PDB entry 1IJX, which are all identical to 1ijxA), 1ijyA and 1ijyB. Now click on 1ijxA. You get a list of representative PDB sequences that are structurally related to 1IJX. For each PDB structure you see its 4 letter PDB identifier (and possibly another letter to identify a particular chain), a "Z" score indicating how similar its structure is to that of 1IJX (a high Z score, say Z > 7, indicates close structural relationship), the length of the part of both sequences that can be aligned ("LALI"), the total length of the second sequence ("LSEQ2"), the percentage of amino acids in one that are the same as the corresponding amino acids in the other sequence ("%IDE"), and the common protein name for the second sequence.

Notice that several new proteins are listed along with the Z score corresponding to a structure-structure alignment with 1IJX. These structure-structure comparisons are performed by the Dali search engine, written by Liisa Holm.

Q3: Read the FSSP and DALI web sites and define Z-scores? Based on the Z scores, which proteins are predicted to be structurally and functionally similar to 1AGJ? There is another column called RMSD. What is it referring to? Explain.

Now look at the list of structural neighbors of 1IJX. There are 10 proteins listed besides 1IJX. Select all the neighbors by clicking on the box to its left. Now try "Multiple alignment (narrow)" on these proteins. You can also try "Multiple families (narrow)" on these proteins with say, 1ijxa and 1aa7a. However, note that the resulting output is quite large. Finally, try a "3-d Superposition" on 1ijx and 1ijy (see Q5). Also try the pair 1ijx and 1aa7a.

Q4: Comment on the results returned.

In order to view Protein 3-d structures from PDB, one way to do this is to download RasMol or Chime, free software for molecular visualization. Both are easy to install on a PC running Windows. Simply follow the instructions at Download Info.

There is a better solution, i.e., use Protein Explorer. The advantage is that you do not need to install RASMOL on your machine. But you will need to download the Chime plug-in for you to be able to view all the cool stuff! Until recently, Protein Explorer could only be accessed using Netscape (and not IE. These guys must hate Microsoft!). But now, they say something much milder, such as: "Although PE works satisfactorily for most purposes in IE, a few functions of PE work better in Netscape 4.7x/4.8x than in IE. If you have a choice and are comfortable in Netscape, we recommend that you return to this site in Netscape 4.7x or 4.8x.". 

View the protein "1ijx" with Protein explorer. Try "toggling" the spinning and turning the molecule manually with your mouse. Try to hide the water molecules. Click on various locations on the image, and read out what amino acid residues you are clicking on. Click on "Explore More". Try various "Display" (especially Cartoon) options. Try various "Color" options (especially Structure). Read the notes on these options. Click on one of the points of the image. If you clicked on a residue of the figure, the message frame tells you what location on the protein you clicked on and what the resiude was.

Q5: Why do you have 6 disconnected images? Locate at least one disulfide bond and write down the location and the amino acid residues at its two ends. How many atoms are there in the structure including the water molecules, and excluding the water molecules? When you hide the water molecules, there is one small structure with some red atoms that are still shown on the image. Click on it and write down what it is and where it is located. What is its nearest amino acid and what chain is it on? How many alpha helices and beta strands can you identify in the protein? After clicking on "Explore More" choose the "Structure" option under "Color" to see the alpha helices and beta strands clearly. 

Now explore one more structure. Go to protein explorer homepage and type in the PDB ID code "1LCD"). This is not simply a protein, but a complex with the protein (Lac repressor) binding to a DNA fragment at the helix-turn-helix motif of the protein. Identify the protein molecule, and the two strands of the DNA molecule. Locate the sodium atom (Na), which should be a large blue colored atom (it would help to hide the water molecules). Locate the helix that is inserted into the major groove of the DNA molecule (this is where the binding of the protein and the DNA takes place). In homework #1, we looked at 1TSR, which was also a protein-DNA complex. I encourage you to view 1TSR using protein explorer. Can you locate the DNA strand and the 3 Zinc atoms in the complex?

At the bottom of the Protein Explorer page, you will find a link to the Protein Comparator. It helps to compare 2 PDB structures. Go to the Comparator page and try out all the features with the proteins "1agj" and "5ptp".

Finally, let's study a PDB entry. Go to PDB and find structure for "1ijx". "Explore" the "1ijx" link. Note the people responsible for figuring out this structure. This structure was computed using X-ray crystallography, instead of using NMR.  Click on "View Structure" on the panel on the left, and then click on one of the 4 Still images for the structure. Then click on "Download/Display file", and click on the PDB text or html option for viewing with full coordinates. Study it carefully and answer the following questions:

Q6: What is the resolution of the 1tsr crystal structure (also provided in the summary page)? How many chains are there in the protein and what are their lengths? How many atoms are there in the protein? How many solvent atoms and heterogen atoms are provided in this structure? How many residues were not located by the experiment? How many helices are present in chain A of the protein and what are their locations? How many disulfide bonds are there in this structure? Are any of the disulfide bonds between atoms of different chains? Write down the 3-dimensional coordinates of the central carbon atom (marked CA) of the fourth residue in chain B. Can you locate the entries for the 5 atoms of the sulfate ion? There are coordinates for the oxygen atoms of the water molecules of this complex. How many such oxygen atoms are listed in this structure? What do the entries labeled "CONECT" signify. Choose one of the CONECT entries and explain it. 

Click on "Geometry" and look at the list of "Dihedral Angles", "Common Bond Angles", and "Bond Lengths". Note the dihedral angle displays the largest deviation from the average. The scale for the deviations are given at the bottom of the page with all the dihedral angles. Follow the links for "Ramachandran Plot" to see how the software from "Sting Millenium" has integrated all the information into a nice graphics-based package. Follow the "Motif Summary" link to identify all the motifs in the protein. Study other links to understand all the information that is available for a PDB entry. 

Go back to the PDB homepage and type in the code "1cop". This structure was determined using NMR methods and not using X-ray crystallography. On the left panel, click on "NMR Restraints". Save the compressed file on your computer and open it. Study this file and make sure you understand the format of this file. For those of you who attended the talk on October 11 by Prof. Narsingh Deo, this data file should make sense. After some header information, the distance restraints for the protein Lambda CRO are provided. It is in dimeric form, which means that it has two chains. The distance restraints are followed by the dihedral angle constraints. Make sure you understand why there are two dihedral angle constraints per amino acid residue. They then provide 20 different models that satisfy these constraints. 

Once again, Part B of this homework is inspired in part by a homework designed by David Haussler for his course on Bioinformatics at University of California, Santa Cruz.