CAP 5510 / CGS 5166 Homework 2

Due: September 12, 2018

Using Entrez, GenBank, SwissPROT, Pfam, PROSITE, and BLAST.

Go to the Entrez database browser at the National Center for Biotechnology Information (NCBI). NCBI is a division of the National Library of Medicine (NLM) at the National Institutes of Health (NIH). This page will soon become our portal of choice, our default start point for any search and exploration. You may want to bookmark it so that you can go there easily. It should take you to a webpage with the title "GQuery: Gobal Cross-database NCBI search - NCBI". It is your gateway to Entrez, which has been referred to as "The Life Sciences Search Engine". Study this page and browse through the various databases that are available from this portal. It is grouped into Literature, Health, Genomes, Gene, Protein, and Chemical.

p53 is a tumor protein associated with the regulation of cell growth. It is frequently found to be mutated or inactivated in 60% of hereditary cancers. In this assignment we'll get some exposure to some of the key bioinformatics tools and databases on the web to explore p53.

You will search for the protein "p53". Make sure you search in the protein database by clicking on "Protein". On Aug 29, 2018, this gave me 75146 hits. (The same search three years ago had only 33590 hits.) Modify your search to look for "p53 human" and you still get 10530 hits. Obviously, humans do not carry 10,000+ copies of p53. However, the database contains many updates, mutants and partials. Now modify your search as follows: Delete the phrase "p53 human" you typed earlier for the search. Then click on "Advanced"; in the Builder, click on "Protein Name" and type p53; then click on "Organism", type "human", and click on "AND". This should enter the phrase "(p53[Protein Name]) AND Human[Organism]" for the search. Now click on "Search" to launch the search. I still get 59 hits. All except one (58th hit) are "partials". Figure out what that means. We could have done this differently. Go back to the Entrez database browser and click on "Gene", taking you to Entrez Gene. This is a searchable database of genes from RefSeq Genomes. This time you should only get one hit. Before we continue our investigations into p53, I want you to read up on RefSeq. On a different tab, read about RefSeq. Accession formats are well specified for the RefSeq database. Later you can figure out the RefSeq identifiers for p53 nucleotide and protein sequences.

Q1: What is RefSeq and what are its distinguishing features? All RefSeq nucleotide and protein records start with two specific characters. What are they? Make sure you pay attention to the basic prefixes NC, NM, and NP.

Going back to our search for p53, try the same advanced query as before: "p53[Gene Name] AND Human[Organism]" at Entrez Gene. There is only one hit. In fact, it takes you directly to the page for the human gene. Here it is called TP53. Read the summary information on this gene and make sure you understand it. Now look at the section titled "Genomic context". Look at the genes adjacent to it on the chromosome. Genes can lie on the forward strand or the reverse strand, as indicated by the direction of the arrow.

Q2: What is its Official Symbol and Full name? What is HGNC (which provides the symbol and name information)? What chromosome is this gene on? What are its alternative names? What strand does p53 lie on? Write down the names of the genes adjacent to it along with the strand that the neighbors are on.

You can view a graphic of the Genomic Context in Full View or in the abbreviated version. Next look at the section titled "Genomic regions, transcripts, and products" and follow the link for "NCBI Reference Sequences". Notice the identifiers for the nucleotide and protein sequences of p53. As you saw earlier, these are typical of RefSeq identifiers. The graphic under "Genomic context" link shows the location of the gene and its neighborhood. Under "Genomic regions, transcripts, and products", you can see the intron-exon structure of this gene. It is small and hard to see the details of the structure clearly. We are going to navigate and study this in excruciating detail next. Before we do that, quickly scroll down this page and look at all the sections listed on this page. The different topics are summarized in the "Table of Contents" on the right side of the page. Before you finish this assignment, you need to understand the nature of information on this page (regardless of whether or not I have explicit questions directing you).

In the section titled "Genomic regions, transcripts, and products", click on "reference sequence details". This is how we access RefSeq information on p53. The RefSeq genomic entry NG_017013.2 provides links to the RefSeq entry and can be viewed as a GenBank, Fasta, or Graphics entry. Click on "GenBank" and study the GenBank entry. The Pevsner text has explanations of its contents. A GenBank entry gives you sequence details, tells you where the database submission came from, and gives information about related sequences, and has citations to the scientific literature. Locate its "GI" number. After some preliminary information about the entry, there are some references cited for this entry. This gene/protein has clearly been researched extensively. Many more references are available from PubMed and we will inspect them later. Find the section titled "COMMENT". This provides a summary of information on the p53 gene. This is followed by a section on "FEATURES". Study this carefully, especially the nucleotide sequence at the end of this page. In the "CDS" (coding sequence) section, you will also find the link to the amino acid sequence of the corresponding protein (e.g., protein product NP_000537.3). Try displaying the sequences in Fasta formats. You can also try out the XML and Graphics formats. Note that the actual sequence of nucleotides is at the bottom of the entry.

Q3: Explain the following terms in about one sentence each: LOCUS NAME, ACCESSION NUMBER, GI NUMBER, RefSeqID. How many nucleotides does the mRNA sequence for human p53 have? What is its GI number? How many residues does the protein sequence for human p53 have? What is its GI number? What were the start and stop codons in the mRNA sequence?

Go to the "Display Settings" and set it to the "Graphics" option. Eukaryotic genes consist of exons and introns. These are best viewed in the graphic display. The thick grey line shows the region of the chromosome being considered here. The thick green line represents the entire GenBank entry for this gene. The other lines show the locations of key features in this gene. There is also a navigational icon on the right hand side to adjust the "Zoom level". Play with this to see it under different levels. The dark blue rectangles are the exons in the mRNA sequence made by this gene. The magenta rectangles are coding exons. Only a part of the mRNA of this gene is translated by the ribosome into amino acid residues. Identify the portion of exons 1 and 2 that are not translated. This corresponds to the 5' untranslated region (UTR) of the mRNA. Identify the portion of the last exon that is not translated (this is part of the 3' UTR). If you navigate to the coding exons, they even have the amino acid sequence written underneath them.

Q4: How many coding exons are there in this GenBank entry for the human p53 gene? What are their coordinates and lengths? How many (mRNA) exons? Write down their coordinates and lengths. Write down the coordinates of the untranslated 5' and 3' regions of this gene. Write down the amino acid sequence produced by the first coding exon (i.e., translated part of exon 2).

On the right side of the TP53 page, you will find a long list of Links. Click on "SNP: GeneView". SNP stands for single nucleotide polymorphisms. These are single nucleotide mutations or changes between different versions of the same gene. There is a table of all these SNPs and a graphic summary. Make sure you understand the color legend of the graphic summary. Read about the difference between a "synonymous" and a "non-synonymous" mutation.

Back to the TP53 page. Go to the "Bibliography" section and click on the "PubMed" link. This will take you to the PubMed page (read about PubMed) and will give you a link to each of the over 6750 publications that report on the p53 gene or protein. (There were only about 4300 3 years ago.) Clicking on articles will help you read the abstract of the publication. Clicking on articles that have green or orange strips on them will help you download or read the corresponding publication. All of this indicates the critical role played by p53 and the amount of research that has gone into it. (Optional reading.) To find out more about p53, you are encouraged to look at some of the papers cited on this protein, abstracts can be retrieved via the PubMed links in Entrez, however there are a lot of them (thousands)!

The GeneRIF portion lists out functions that p53 may be involved in. Clikc on "What's GeneRIF" and read more abut it. Under the "Phenotypes" section, you can find out how p53 is linked to a variety of diseases. Follow the link to one of them (say, Breast Cancer) and find out how p53 may be involved in the disease. Under Pathways information, you can find various processes in which p53 is involved. Go to the "Gene Ontology" section. This annotation is provided by the Gene Ontology Consortium. Later this semester, we will learn more about GO.

We will now explore UniProtKB/Swiss-Prot, the best-curated protein sequence database. A related database is TrEMBL, which is the uncurated version of Swiss-Prot. Click on UniProtKB/SwissPROT. Go to the Advanced search page and search for P53 under "Organism" Human. Since you already know its length, it should be easy to locate P53_HUMAN. Click on it and go to the entry for the protein. The Function field tells us, among other things, that p53 acts as a tumor suppressor, and its normal function is to stop cells from growing, or to die at the right time (apoptosis). When something goes wrong with p53, cells can grow in an uncontrolled manner, a hallmark of cancer. p53 has 5 main sites, one for interaction with DNA and 4 Zinc-binding sites. Scan to near the bottom of the record, and you will find a list of many mutations of the p53 gene that cause it to make a different amino acid at some position in the protein, making the person prone to getting cancer. These are usually SNPs (single nucleotide polymorphisms) that cause a substitution of one amino acid for another. Find the tumor-causing substitutions of R (arginine) at position 110.

Q5: What amino acid substitutions of the R at position 110 in the p53 protein are listed as involved in cancers? What SNPs might cause these?

To answer the above question, you will need to go back and find the three nucleic acids in p53 that form the codon that makes the R in position 110 in p53. Then you will have to look in a table of the genetic code to see what codons code for these other amino acids. You can find the genetic code by clicking here.

UniProt gives extensive cross-references to other databases, including GenBank and the mirror site at EMBL (European Laboratory for Molecular Biology), PIR (protein Information Resource), and PDB, the Protein Data Bank, a database of three-dimensional protein structures. We'll look at this later in the assignment. The fact that P53 has PDB links implies that the protein structure has been determined by crystallography or NMR methods.

For now, find the PFAM entry in the p53 SWISSPROT record and click on it. PFAM is a database of multiple alignments of related protein sequences. Sets of protein sequences that have evolved from a common ancestor are very useful in understanding and predicting aspects of protein structure and function. Read the description of the P53 family. Click on "Get alignment". (The default "Colored Alignment" view works quite well. "Jalview" is a Java tool to look at the multiple alignment, if you want to explore this further.) You see a "seed" alignment of 9 protein sequences. P53_human is not on this list; the list contains similar proteins. If you look at the "full" alignment, you will find p53_HUMAN, although it shows only only a small portion of it (318-359). Dashes are inserted so that the corresponding amino acids from all twelve organisms line up in columns. Scan across, and note that some regions of the protein are more highly conserved than others. Multiple alignments will be discussed in class soon.

In the early days, Amos Bairoch, the designer of SWISSPROT, and his collaborators put a lot of effort into developing generalized "signature" motifs that allow particular substitutions in particular places in the motif, in hopes of finding motifs that would have no false positives or false negatives for a given protein family. The motif database they produced is called PROSITE.

Go back to the SWISSPROT p53 page and click on the PROSITE link. Study the entry. The PROSITE entry proposes the completely conserved motif M-C-N-S-S-C-[MV]-G-G-M-N-R-R as a signature motif for the p53 family, and they tested this pattern at the time this work was done, concluding that it found no false positives or false negatives. However, the database has grown considerably since then, as has our ability to locate likely orthologs. Later we'll see how hidden Markov models (HMMs) are a better way to define characteristic "patterns" that are present in protein families, which can then be used to find new members of that family. That is what the HMMs in PFAM are used for. Feel free to explore this aspect of PFAM further; we'll return to this in later assignments.

Now go back to the SWISSPROT record for human p53 and find the list of PDB entries. Click on the ExPASy link for 1TSR. This gives access to information about the structure of the p53 protein from PDB. PDB is the Protein Data Bank, a repository of protein structures solved by x-ray crystallography or by NMR. Each solved structure has a 4 letter identifier. This is the PDB record for 1TSR. This particular structure is p53 bound to DNA. Notice that 1TSR is the structure for the core DNA-binding domain of the protein (here defined as residues 102-292) bound to a piece of DNA. p53 is a DNA binding protein that can influence another protein by binding in front of (i.e. on the 5' side) of its gene and thereby altering the way the gene for that other protein is transcribed, in this case by causing it to make more copies of the protein. In this structure, p53 is "caught in the act", so to speak.

Follow the ExPASy PDB link for 1TSR. (If you follow the MMDB entry, followed by the PubMed link it leads you to the 1996 Science paper that describes this structure. You can read the abstract online or go the library to look at a copy of the paper.) Click on "Still image" to see a picture of the structure of p53.

We now move on to BLAST. Go to the BLAST homepage. Follow the link for Blasting 2 sequences (bl2seq). Choose the "blastp" program. Align the 2 proteins with Swiss-Prot IDs P53_HUMAN and P53_XENLA.

Q6: Print out the alignment output by BLAST.

Play around with the various options that BLAST offers and see how it affects the output. Go to the Standard Protein-Protein Blast page. Cut and paste a FASTA formatted version of p53 (gi 3041867) into the "Search" box. Then "Blast" it. When it returns, format it so that the Number of Descriptions is limited to 15, and the alignment view is "Pairwise". Study the pairwise alignment shown at the bottom of the results page. Now let's do the regular protein BLAST. Go to the BLAST homepage. Follow the link for protein-protein BLAST (blastp). Cut and paste the Fasta version of p53 (gi 3041867 or Swiss-Prot ID p53_HUMAN) in the box denoted "Search". For "Choose database", click on swissprot. Make sure the "Low Complexity Filter" is turned on (this is the default). Study the other default parameters used by blastp. Now click on "BLAST!". This may take a few minutes to respond with an answer. Which ones are significant and why? The first hit on the results page corresponds to the original sequence itself. Pick one of the other hits and study the pairwise alignment. Read the BLAST tutoral pages and find answers to the following questions. There is no need to write down your answers. This is merely for your benefit. Find the definitions of the following terms: Score (bits), E-Value, Indentities, Positives, & Gaps . Figure out how Score and E-Values are computed (see Mount's book). Figure out how E-Value is different from P-Value of an alignment.

Now scroll down to the bottom of the results page. You will see a summary of your search. Figure out what the information down there means.

Q7 (do not submit): Repeat questions 2 through 4 for the human protein "insulin". Also study some of the interesting SNPs. This is just an exercise and is not for submission.

Q8: Run the Needleman-Wunsch global sequence alignment algorithm on the sequences "VEPPLSQETFSDLWKLLPENNVLSPL" and "MDPPLSQETFEDLWSLLPDPL" using the BLOSUM62 substitution matrix, gap open and gap extension penalties of 11 and 1 respectivey. Write down the optimal alignment and the alignment score.