BSC 4934 Homework 2

Due: July 26, 2010.

Part A: SwissPROT, UniPROT and PDB DATABASE.

We will now explore Swiss-Prot (currently called UniPROT, the best-curated protein sequence database. A related database is TrEMBL, which is the uncurated version of Swiss-Prot. Go to the SWISSPROT database. Search for P53. Confine your search to only UniProtKB. Since you already know its length, it should be easy to locate P53_HUMAN. Click on it and go to the entry for the protein. After the list of references on the protein, the comments fields tell us, among other things, that p53 acts as a tumor suppressor, and its normal function is to stop cells from growing, or to die at the right time (apoptosis). When something goes wrong with p53, cells can grow in an uncontrolled manner, a hallmark of cancer. Find the section titled "Natural Variations", and you will find a list of alternative splice variants and many mutations of the p53 gene that cause it to make a different amino acid at some position in the protein, making the person prone to getting cancer. These are usually SNPs (single nucleotide polymorphisms) that cause a substitution of one amino acid for another. Find the tumor-causing substitutions of R (arginine) at position 110.

Q1: What amino acid substitutions of the R at position 110 in the p53 protein are listed as involved in cancers? What SNPs might cause these?

To answer the above question, you will need to go back and find the three nucleic acids in p53 that form the codon that makes the R in position 110 in p53. Then you will have to look in a table of the genetic code to see what codons code for these other amino acids. You can find the genetic code by clicking here.

SWISSPROT gives extensive cross-references to other databases, including GenBank and the mirror site at EMBL (European Laboratory for Molecular Biology), PIR (protein Information Resource), and PDB, the Protein Data Bank, a database of three-dimensional protein structures. We'll look at this later in the assignment. The fact that P53 has PDB links implies that the protein structure has been determined by crystallography or NMR methods.

For now, find the PFAM entry in the p53 SWISSPROT record and click on it. PFAM is a database of multiple alignments of related protein sequences. Sets of protein sequences that have evolved from a common ancestor are very useful in understanding and predicting aspects of protein structure and function. Read the description of the P53 family. Click on "Alignments". Under "View Options" select "Jalview", which is a Java tool to look at the multiple alignment, if you want to explore this further.) You see a "seed" alignment of 7 protein sequences. You can also look at the "full" alignment of 288 proteins from the family. Scan across, and note that some regions of the protein are more highly conserved than others.

In the early days, Amos Bairoch, the designer of SWISSPROT, and his collaborators put a lot of effort into developing generalized "signature" motifs that allow particular substitutions in particular places in the motif, in hopes of finding motifs that would have no false positives or false negatives for a given protein family. The motif database they produced is called PROSITE.

Go back to the SWISSPROT p53 page and click on the PROSITE link. Study the entry. The PROSITE entry proposes the completely conserved motif M-C-N-S-S-C-[MV]-G-G-M-N-R-R as a signature motif for the p53 family, and they tested this pattern at the time this work was done, concluding that it found no false positives or false negatives. However, the database has grown considerably since then, as has our ability to locate likely orthologs. We now know that hidden Markov models (HMMs) are a better way to define characteristic "patterns" that are present in protein families, which can then be used to find new members of that family. Feel free to explore this aspect of PFAM further. Also explore the links for "HMM Logo" to obtain a signature pattern in Logo format.

Now go back to the SWISSPROT record for human p53 and find the link for "Structures". Click on the link for "1tsr". Follow the "RCSB PDB" link. This gives access to information about the structure of the p53 protein from PDB. PDB is the Protein Data Bank, a repository of protein structures solved by x-ray crystallography or by NMR. Each solved structure has a 4 letter identifier. This is the PDB record for 1TSR. This particular structure is p53 bound to DNA. Notice that 1TSR is the structure for the core DNA-binding domain of the protein (here defined as residues 102-292) bound to a piece of DNA. p53 is a DNA binding protein that can influence another protein by binding in front of (i.e. on the 5' side) of its gene and thereby altering the way the gene for that other protein is transcribed, in this case by causing it to make more copies of the protein. In this structure, p53 is "caught in the act", so to speak.

Part B: BLAST and PSI-BLAST.

As before, find the Sex determining region (SRY) gene from the human genome in the "Gene" database from the NCBI Entrez portal. Again, as before, click on "reference sequence detials" and find the protein sequence for this gene (click on ID that starts as "NP_"). Print out this sequence in FASTA format.

Q2: Print the SRY protein in FASTA format. Write down the RefSeq ID and the GI number of this protein. Now run it through BLASTP (there is a convenient link on the right of the GenBank page to do BLAST) and answer the following questions. Make sure you confine your search to the RefSeq database.

  1. What scoring matrix and gap penalties were used?
  2. What value of K and l were used for calculating the Expect scores for the gapped alignment (please note that ou can find this by clikcing on "Search Summary")? Where do these values come from?
  3. The score shown in the program output is in units of "normalized bits" = [(l x raw score) - ln K] / ln2. The raw score is shown in parentheses. What are the units of the raw score (those of the BLOSUM62 matrix)? Calculate the raw score in bits from the "normalized bits" for the top three hits.
  4. How many database sequences were searched?
  5. The best hit must be a "self-hit". Is the alignment of the next highest scoring sequence (from an organism different from human) significant and why? What condition should be tested to decide significance?
  6. What was the lowest reported score in this search, and is this score significant? Why?

We are now going to try PSI-BLAST. Read about it by going to following tutorial (Click here). PSI-BLAST is a version of the BLAST algorithm that uses the results from an initial search for similar protein sequences to construct a type of scoring matrix that can then be used for additional rounds of searches, called iterations. The variability found in each column of the scoring matrix allows additional sequences that have different combinations of amino acids in the sequence positions to be found. The algorithm provides a rapid but less precise search than other methods because the scoring matrix produced is only approximate and includes most of the original query sequence. (Caution: The iterations can lead to more sequences being added that do not share a region in common with the original query sequence, but share a totally different region in some of the added sequences; e.g., these new sequences are not true family members but foreigners.) The process will stop when no more sequences are found. The user can control the number of sequences to be included at each iteration or else use the score cutoff recommended by the program. The method is often used to perform a rapid and preliminary search for members of a sequence family. The found sequences can then be multiply aligned by other better-defined methods.

Go to the NCBI BLAST page and click on "PSI-BLAST". Cut and paste the SRY protein into the PSI-BLAST form. Make sure you confine your search to the RefSeq database and not to the nr database. It will find you a large number of hits labeled "Results of PSI-BLAST iteration 1". After studying this page, click on "Run PSI-BLAST iteration 2". Inspect the results and click to run more iterations. In each iteration it spreads its net wider and finds close relatives of the close relatives.

Q3: After iteration 1, how many hits were "Sequences with E-value WORSE than threshold"? How many "New" hits did you get in iterations 2. Continue on up to iteration 4 and in each case, find the number of "New" hits? What was the E-value of the worst hit in each iteration? PSI-BLAST was designed by Altschul, Madden, Scahffer, and others. Go to the Entrez database browser and look for their publication on this topic. Instead of searching in the protein or nucleotide or genome database of Entrez, search in the publications database (PubMed). Type in the authors names and perform the search. The authors appear to have at least 3 publications together, of which two are on PSI-BLAST. Access these publications and repeat one of the experiments that were reported in the second NAR paper from 2001 (say, the one using sequence GI:4982166). You do not need to report this in your homework.


Part C: Multiple Alignment.

CLUSTALW is a widely used multiple sequence alignment tool. This assignment will expose you to the features and capabilities of this program.

  1. Go to Entrez
  2. You need to download the protein U1A from four different organisms (human, mouse, Xenopus laevis, and Drosophila melanogaster). For some of these organisms, you may find several proteins being found by Entrez. For each of the four sequences, open the FASTA report and copy them all into one file.
  3. Go to CLUSTALW.
  4. Type in your e-mail address. Either upload your file or paste your sequences into the appropriate box. Now run CLUSTALW.
  5. Study the output that you get within a few moments.
  6. Try the "JalView" option.
  7. JalView has an option to mail yourself the postscript version of the alignment. Try this option.
How long are the 4 sequences? What are the pairwise alignment scores?

Q4: What do the "*", ":", and the "." in the alignment indicate? Consult the substitution matrix values, if necessary. What sequence formats are supported by CLUSTALW? You can download your own version of CLUSTAL (called ClustalX) from ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX/ Download the appropriate version for your machine. Versions for other operating systems also exist.


Part D: Hidden Markov Models.

Here is a small alignment of 12 members of a DNA sequence family.
(column:  1234)
seq1      GATC
seq2      CTAG
seq3      GATC
seq4      CC-G
seq5      GATC
seq6      CC-G
seq7      GTAC
seq8      CG-G
seq9      GCGC
seq10     CTAG
seq11     GATC
seq12     CTAG
Build a profile HMM of this alignment. I suggest you start with a standard model consisting of four match states, four insert states, and four delete states; match state 1 is assigned to the symbols in column 1, etc.

Q5: Draw a profile HMM in terms of states (circles) and state transitions (arrows). Make sure you remove all edges with zero transition probabilities.

Q6: Calculate the emission probability parameters for A,C,G,T in match state 1 by looking at the charcters in column 1. Column 3 has gap symbols which would be assigned to delete state 3. Calculate the scores (log_2 probabilities) for the match_2 -> match_3 state transition and the match_2 -> delete_3 state transition.


Part E: Logo images of Transcription Factor Binding Sites

The following are potential binding sites for the transcription factor AlgU in the organism Psuedomonas aeruginosa PAO1.

CTGAACTTGT osmC operon
GTGAACTTTG phuR operon
AAGAACTTTG oprF operon
TGGAACTTCA lptA operon
TGGAACTTGG ycfJ operon
TGGAACTTTC betT2 operon
CCGAACTTTG dksA operon
TGGAACTTCT tal operon
GAGAACTTTT algU operon
TGGAACTTTC algU operon
GGGCACTTTT algR operon
AGGAACTTAT rpoH operon
CGGAACTTCC algD-algA operon

Q7: Go to weblogo and create a "logo" for this binding site. Paste the resulting image into the document you submit for this exam. Explain in a few sentences how to interpret this image. Next build a profile matrix (PSSM) for this alignment as discussed in class.


Part F: UCSC Browser

Go to the UCSC genome browser. Click on "Genome Browser" from the menu on the left column. Type in "p53" in the box titled "position or search term" in the human genome, and click on "Submit". You will get a lot of hits for the search. Make sure you pick the right one and click on it. You should see a complicated browser view with navigation tools on top and several tracks in the main part of the figure.

Q8: Which chromosome is this gene on? Write down its coordinates according the UCSC browser. Is it on the positive strand or the reverse strand?

Get used to the UCSC genome browser by navigating left and right and by zooming in and out. You can also change the view by clicking on "Configure" and deciding what things you want to or do not want to see on your browser.