Implement TEIRESIAS

Group Member:  Chengjun Zhan, Na Zhao
Nov. 19th, 2002
 

bullet

Project Idea

bullet Our Presentation
bullet

Introduction

bullet Background Definitions
bullet

TEIRESIAS Algorithm
Scan Phase
Convolution Phase  

bullet

Result & Conclution

bullet

References

Project Idea:

Implement TEIRESIAS: GYM is based on supervised pattern discovery techniques. TEIRESIAS is its unsupervised counterpart. The first part of TEIRESIAS has been implemented. Implement the second part of this.

The idea comes from: http://www.cs.fiu.edu/~giri/teach/6936/F02/ProjectIdeas2.pdf

Introduction

In the analysis of biological sequences, one of the important works is to locate the similarities between different sequences. Currently, several computational algorithms exist for the discovery of similar patterns in different gene or protein sequences. One such algorithm is the Teiresias algorithm proposed by Rigoutsos and Floratos [5]. This algorithm does not use alignment of sequences. Instead, it tries to discover patterns shared by the sequences. Hence it can locate the local similarities of sequences. Generally speaking, the Teiresias algorithm can be divided into two phases: a scan phase and a convolution phase [9]. Some former students have already finished the coding for the scan phase. So in this term project, only the second phasethe convolution phase is implemented and the program follows the style of the code of the scan phase. The program is implemented using Visual C++ 6.0.

The organization of this report is as follows. In section 2, some basic definitions of pattern discovery and Teiresias algorithm will be given. In section 3, two phases of Teiresias algorithms will be explained and the corresponding implementation methods (including the flow charts) will be given.  A sequences example and the related pattern discovery results will be shown in section 4 and a conclusion is also given in this section. Finally, in section 5, by comparing the Teiresias algorithm with another motif detection algorithm: GYM, some possible future works are discussed.

Background Definitions

In the research of the protein sequences, sometimes we may believe some proteins have similar biological properties on structural or functional feature. How to analyze and prove this kind of problems? Firstly, we need to group together these protein sequences. Then discover a set of common sub-sequences in this set of sequences which we have grouped. Finally, by studying these sub-sequences we can observe correlation within biological properties of the collection of sequences (typically a family, super family or sub family). This procedure is called “Pattern/Motif Discovery in Protein Sequences”. In contrast to pattern matching (solvable in polynomial time), Pattern/Motif discovery (an NP-hard problem) is a challenging problem. In order to facilitate the understanding of this problem, here we would like to give some basic definitions:

Pattern/Motif: a region or portion of a protein sequence. It has specific structure and functionally significant [10]. Protein family should have similar motifs so they can be characterized. That is to say, specific motifs may help to classify a protein.

Motif/Pattern Discovery: Detect motifs/patterns from known protein sequences. The result can be used to detect unknown protein sequences.

Pattern Alphabets: {S (Basic alphabet set); P (Character/Equivalency group alphabets); x | . (which stands for Wild card or Don’t care Alphabet)}

∑ (Basic alphabet set): The amino-acid names can be abbreviated as the listed symbols in alphabetical order (Please check reference [11] for detail information.) [11]:

∑={A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}

P (Character/Equivalency group alphabets): is a character corresponding to a subset of Σ.  For example:          A-[CJ]-G

x (Wild-card or don’t care): is a special kind of ambiguous character that matches any character in ∑. For example, X in protein sequences and N in nucleotide

(L, W) pattern:  Pattern P is a (L, W) pattern if and only if P is a string of characters from Σ and wild cards ‘.’. P starts and ends with a character from Σ. Characters in Σ are called residues. Any sub pattern of P (i.e subsequence starting and ending with a character from Σ) containing exactly L non-wildcard characters (residues) has length of at most W [9]. For example, L=3 and W=5, “CD..E” is a (3, 5) pattern.

To a collection of sequences, if a certain pattern (common sub sequence) is characteristic, it may be important for the function or structure of the family. These kinds of patterns can be used for the prediction of the function or structure of unknown sequences. Based on their features, the hypothesis about new sub-families may be made. IBM Bioinformatics and Pattern Discovery Group have the corresponding implementation on the Teiresias algorithm [2] [3] [4].

Teiresias Algorithm

The basic idea of Teiresias algorithm is: If a pattern P is a (L, W) pattern occurring in at least K sequences, then its sub patterns are also (L, W) patterns occurring in at least K sequences [5][6]. (K>=2) That is to say, pattern P is more specific than pattern Q if we can get Q from P by removing several characters from P and/or replacing several non-wildcard characters with wildcard characters. For example: pattern “A.BC” is more specific than pattern “A..C”. Definitely, we can convert this idea and get the algorithm’s framework: First use the scan phase to discover the elementary patterns, and then extend them to get the final results.

The Teiresias algorithm operates in two phases [9]: “Scan” and “Convolution”. Firstly, scan the input sequences, identify all the elementary patterns and then select the elementary patterns which satisfy the minimum support; secondly sort the element patterns and extract useful prefixes and suffixes. Finally in the convolution phase, by convolving the elementary patterns, the maximal length and composition of the motifs will be detected.

 
  • Scan Phase

    The output for the scan phase is a set of “elementary patterns”. The elementary pattern should contain L residues, has a maximum length of W, and appear in at least K (K>=2) of the sequences in the population under study [8].

    In the following example, there are four sequences, we give the parameters as: K=2; L=3; W=5. The string “AST” can be found in three sequences (>K); its length is 3 (=L). So it can be an elementary pattern.

     

    Figure 2 shows the flow chart of the scan phase. An “Extend” procedure is designed here to extend the elementary patterns step by step. First empty the stack of elementary patterns. Then for each letter in the alphabet, count how many sequences contain this letter. If less than K sequences contain this letter, ignore it. Or the program will extend it until it is ignored or it is accepted.

    In the extend procedure, the given string will be added by all possible suffixes consisting a set of dot and then followed by a letter. If there are enough supports on the extended sequence, then accept or continue extending; otherwise it will be ignored.

    In our implementation, the patterns are represented as objects. Each pattern consists two parts, one is the pattern string itself, the other is an offset list which contains all pairs (x, y) such that string x matches pattern p at position y [8].

     

  • Convolution Phase

    The term “convolution” is used here to represent the operation of the second phase of Teiresias algorithm. The principle here is quite clear: “Given any pattern with at least L residues, the prefix of p is the unique substring containing the first L – 1 residues of p. Similarly the suffix of p is the unique substring containing the last L –1 residues of p. Given two patterns p and q, p convolve q is the empty pattern if the suffix of p is not the prefix of q, otherwise it is the pattern p followed by the part of q which remains when its prefix is removed. The offset list of p convolve q is simply computed from the offset lists of p and q separately. Note that the prefix of t = p convolve q is the prefix of p and the suffix of t is the suffix of q.” [8] The following figure shows an example of extending the elementary patterns.

    Based on the algorithm, the convolution phase can also be divided into two sub-phases. Firstly, we need to sort all the elementary patterns by pair-wise < and then pick up their prefixes and suffixes. This can be called as “Pre-Convolution Phase”.

    3.2.1 Pre-Convolution Phase

    The flow chart and a sample result of pre-convolution phase are shown in figure 4 and figure 5 respectively.  The main focus in this phase is to sort the elementary patterns. In our implementation, we used insertion sorting in generally. More over, we also designed a function big (string a, string b) to compare two elementary patterns.

    After sorting the elementary patterns, two structures are also produced: Left (prefixes) and Right (suffixes). Left [i] returns all the elementary patterns which can be convolved to the left of elementary pattern EP [i], and similarly Right [i] contains all the elementary patterns which could be convolved to the right of the elementary pattern EP [i]. “Each of the structures provide O(1) access to the needed patterns. However, Floratos implies that structures equivalent to Left and Right can be constructed in O(|EP|log|EP|) time, whereas the current implementation requires O(|EP|^2) time.” [8]

     

    3.2.2 Convolution Phase

    The main target of the operation of convolution is: For each elementary pattern P, try to extend the pattern with other elementary patterns by convolving them. The flow of the procedure can be summarized as:

    bullet Find a short pattern that appears in K input sequences
    bullet Extend them until the support go below K
    bullet Once we find pattern that cannot be extended further, the patterns in maximal and can be written to output.

    How to convolve and extend the patterns?  Here are the steps to follow:

    If there exists an elementary pattern Q can be glued to the left side of P.

             {Take such Q which is largest in suffix ordering.

               Let R be the pattern resulting from gluing Q to the left side of P

                If ((R’s occurrences >= K) && (R’s occurrence is maximal with respect to the set of already reported patterns))   

                          {Try to extend pattern R with other elementary patterns.}

                 If (R’s occurrences == P’s occurrences)

                           {P is not maximal;

                             Do not need to search for other extensions of P. }

                  Else

                            {P is not a significant pattern.}

               }

    Repeat the same process for the elementary patterns which can be glued to the right side of P (starting with the largest pattern in prefix ordering).

    Report pattern P.

    Figure 6 shows the flow chart of the whole convolution phase.

     

    Result & Conclusion

    For the previous example, we get the final result which is shown in figure 7. Also we run our program on a complex data set, the original file is here, the result file is here.

     

    Generally speaking, Teiresias do a good job when solving the pattern discovery problem of non-aligned sequences.

    Teiresias and GYM [10] are counterparts in motif detection area. The Teiresias algorithm is designed for non-aligned sequences. On the other hand, the GYM algorithm is efficient to process aligned motif sequences. The output pattern of GYM algorithm is like <amino acid, position>. Compared to GYM, Teiresias has a significant drawback: it can’t find out the repeated pattern in one sequence. That is to say, different numbers of “ABC” mean different pattern. Also, in Teriesias, different number of “don’t care” means different patterns.

    References:

    1.        Brona Brejova, Chrysanne DiMarco, Tomas Vinar, Sandra Romero Hidalgo, Gina Holguin, Cheryl Patten; “Finding Patterns in Biological Sequences”, 2000.

    2.        IBM Bioinformatics and Pattern Discovery Group / TEIRESIAS: Sequence Pattern Discovery. http://cbcsrv.watson.ibm.com/Ttspd.html

    3.        Bioinformatics & Pattern Discovery @ IBM / The Home of TEIRESIAS. http://cbcsrv.watson.ibm.com/Help/aboutTspd.htm

    4.        Rurgard, A., Moore, G., Maranas, C., ‘Web Site Review- Review of the TEIRESIAS-Based Tools of the IBM Bioinformatics and Pattern Discovery Group’, Metab. Eng. 3, 285-288, 2001

    5.        Rigoutsos, I., and Floratos, A., “Combinatorial Pattern Discovery in Biological Sequences: The TEIRESIAS Algorithm” , Bioinformatics, 14(1), 55-67, 1998

    6.        Rıgoutsos, I., Floratos, A., Parida, L., Gao, Y.,  and Platt, D., ‘The Emergence of Pattern Discovery Techniques in Computational Biology’, Metab. Eng. 2, 159-177, 2000

    7.        Murray Wolinsky, ‘An Experimental Implementation of the Teiresias Algorithm’, Stanford University Biochemistry 218 Autumn 1999.

    8.        Rigoutsos, I., and Floratos, A., “Research Report: Combinatorial Motif Discovery in Biological Sequences Using the Teiresias Algorithm”.

    9.        http://www.cs.njit.edu/~bcohen/teaching/786/studentProjects/patternRecognition/patternRecognition.ppt

    10.     Y. Gao, K. Mathee, G. Narasimhan and X. Wang, “Motif Detection in Protein Sequence” IEEE, 1999.

    11.     http://www.chem.qmw.ac.uk/iupac/AminoAcid/A2021.html#AA21