## A class of edit kernels for SVMs to predict translation initiation sites in eukaryotic mRNAs (2004)

### Cached

### Download Links

- [www.cs.ucr.edu]
- [alumni.cs.ucr.edu]
- [www.ics.uci.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Journal of Computational Biology |

Citations: | 17 - 0 self |

### BibTeX

@ARTICLE{Li04aclass,

author = {Haifeng Li and Tao Jiang},

title = {A class of edit kernels for SVMs to predict translation initiation sites in eukaryotic mRNAs},

journal = {Journal of Computational Biology},

year = {2004},

volume = {12},

pages = {718}

}

### OpenURL

### Abstract

The prediction of translation initiation sites (TISs) in eukaryotic mRNAs has been a challenging problem in computational molecular biology. In this paper, we present a new algorithm to recognize TISs with a very high accuracy. Our algorithm includes two novel ideas. First, we introduce a class of new sequence-similarity kernels based on string edit, called the edit kernels, for use with support vector machines (SVMs) in a discriminative approach to predict TISs. The edit kernels are simple and have significant biological and probabilistic interpretations. Although the edit kernels are not positive definite, it is easy to make the kernel matrix positive definite by adjusting the parameters. Second, we convert the region of an input mRNA sequence downstream to a putative TIS into an amino acid sequence before applying SVMs to avoid the high redundancy in the genetic code. The algorithm has been implemented and tested on previously published data. Our experimental results on real mRNA data show that both ideas improve the prediction accuracy greatly and our method performs significantly better than those based on neural networks and SVMs with polynomial kernels or Salzberg kernel.

### Citations

9805 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...the SVM is the optimal hyperplane y = sign(〈w,x〉 + b) that maximizes the margin 1/ �w� 2 between the classes, which is the minimum distance from positive/negative samples to the separation hyperplane =-=[40,41]-=-. The reason to maximize the margin is that hyperplanes with a larger margin have a smaller capacity (actually a smaller upper bound on the VC-dimension) [40,41]. In this way, the overfitting problem ... |

2112 | Data Mining: Concepts and Techniques
- Han, Kamber
- 2000
(Show Context)
Citation Context ...5%. The specificity is slightly improved to 96.0%, while the sensitivity is significantly improved to 98.0%. 9 Zien et al. used a different definition, TP/(TP + FP), which is usually called precision =-=[15]-=-. 15sTable 5: Comparison of the accuracies among the edit kernels on small data sets of size 500, 1000, and 2000. Edit kernel III employs the SCM250 cost matrix. The accuracy is estimated by six-fold ... |

1330 |
Binary codes capable of correcting deletions, insertions, and reversals
- Levenshtein
- 1966
(Show Context)
Citation Context ...are prevalent, the Hamming distance between two sequences is often an exaggerated over-estimation of the true dissimilarity. On the other hand, the edit distance (also known as the Levenshtein metric =-=[27]-=- and evolutionary distance [37]) is a more general and accurate measure of sequence dissimilarities. The (basic, unweighted) edit distance between two sequences denotes the minimum number of edit oper... |

1102 |
Fast training of support vector machines using sequential minimal optimization
- Platt
- 1999
(Show Context)
Citation Context ...xity of the kernel function. If the sequential minimal optimization (SMO) is employed to train the SVM, the number of iterations ranges somewhere between linear and quadratic in the training set size =-=[32]-=-, depending on the actual data and kernel function. Table 7 lists the average numbers of the iterations used in training the SVMs with different edit kernels in the above six-fold cross validation exp... |

1060 |
A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences
- Kimura
- 1980
(Show Context)
Citation Context ... as discussed in the previous section. To obtain the cost matrices for nucleotides, we use the 1-PAM matrix from [30], which is based on Kimura’s two-parameter (K2P) model of nucleotide substitutions =-=[20]-=-. Specifically, the probability of a transition for each nucleotide is 0.006 and that of a transversion is 0.002. The cost matrix with p = 250 is listed in Table 1. Note that, the matrix is symmetric.... |

908 | Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids
- Durbin
- 1999
(Show Context)
Citation Context ...NAs, which are linear and unstructured. The cost scheme can be easily extended to accommodate more general probabilistic models such as affine gap costs. 6 A similar model is the log-odds ratio model =-=[3,13]-=-. However, we think that the log probability interpretation is more direct and it perhaps deals with indels better. 8 i (10) (11)sHowever, we can still use the edit kernel in support vector machines a... |

832 |
Theory of Reproducing kernels
- Aronszajn
(Show Context)
Citation Context ...ause all information that we need supply to an SVM are the dot products 〈Φ(xi), Φ(xj)〉 in the feature space F, which can be computed through a positive definite kernel k(·, ·) in the input data space =-=[2,4]-=-: (4) k(xi,xj) = 〈Φ(xi), Φ(xj)〉 (6) The positive definite kernel (also known as Mercer kernel) is formally defined as follows [5]: Definition 1 Let X be a nonempty set. A function k(·, ·) : X × X → R ... |

828 |
Amino acid substitution matrices from protein blocks
- Henikoff, Henikoff
(Show Context)
Citation Context .... A general edit cost matrix can be defined for nucleotides based on some fixed transversion/transition ratio. The most widely used (similarity) score matrices for amino acids are PAM [12] and BLOSUM =-=[17]-=- matrices. PAM matrices are based on the Dayhoff model of evolutionary rates. Using an alternative approach, BLOSUM matrices were derived from about 2000 blocks of aligned sequence segments characteri... |

826 | Prediction of complete gene structures in human genomic
- Burge, Karlin
- 1997
(Show Context)
Citation Context ...find the components of a gene, which include translation initiation sites (TISs), exon-intron splice sites, promoters, poly-adenylation signals, and CpG islands. Although the algorithms (e.g. GENSCAN =-=[8]-=-) for finding the internal coding exons of a gene have reached a high degree of sophistication and accuracy, finding translation initiation sites that encode the start of protein translation, still re... |

796 |
Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific
- Bertsekas, N
- 1996
(Show Context)
Citation Context ...ant. Solving this quadratic programming problem, we get w, b, and thus the optimal hyperplane � � � y = sign αiyi〈x,xi〉 − b i According to the well-known Karush-Kuhn-Tucker (KKT) necessary conditions =-=[7]-=-, the solution w and b to the above quadratic programming problem must satisfy αi(yi(〈w,xi〉 + b) − 1) = 0 (5) It means that only the points xi on the hyperplanes 〈w,xi〉+b = ±1 have nonzero Lagrange mu... |

636 |
Biological Sequence Analysis
- Durbin, Eddy, et al.
- 1998
(Show Context)
Citation Context ...inear and unstructured. The cost scheme can be easily extended to accommodate more general probabilistic models such as affine gap costs. 6A similar model is the log-odds ratio model (Altschul, 1991; =-=Durbin et al., 1998-=-). However, we think that the log probability interpretation is more direct and it perhaps deals with indels better.708 LI AND JIANG achieved by adjusting the parameter γ . It is well known that a sy... |

380 |
A model of evolutionary change in proteins
- Dayhoff, Schwartz, et al.
- 1978
(Show Context)
Citation Context ... edit kernel III. A general edit cost matrix can be defined for nucleotides based on some fixed transversion/transition ratio. The most widely used (similarity) score matrices for amino acids are PAM =-=[12]-=- and BLOSUM [17] matrices. PAM matrices are based on the Dayhoff model of evolutionary rates. Using an alternative approach, BLOSUM matrices were derived from about 2000 blocks of aligned sequence seg... |

331 |
Fast text searching allowing errors
- Wu, Manber
- 1992
(Show Context)
Citation Context ...ithm runs in O(s ∗ n) time for instances of length n and edit distance s. Ukkonen’s algorithm, however, works only for the unit cost model. Although Ukkonen’s algorithm has been improved by Wu et al. =-=[43,44]-=- and Berghel and Roach [6], the improved algorithms still depends on the unit cost model. Therefore, Ukkonen’s algorithm and its improvements are not suitable for edit kernel III. Finding a fast (appr... |

298 |
An analysis of 5’-noncoding sequences from 699 vertebrate messenger RNAs
- Kozak
- 1987
(Show Context)
Citation Context ...o the mRNA, resume scanning, and reinitiate at a downstream ATG codon. This procedure is called reinitiation [26]. To predict TISs, Kozak developed a weight matrix from an extended collection of data =-=[21]-=-. Statistical methods have also been developed to predict TISs. In 1997, Salzberg developed a positional conditional probability matrix that takes into account the dependency between adjacent bases [3... |

297 |
Theoretical foundations of the potential function method in pattern recognition learning
- Aizerman, Braverman, et al.
- 1964
(Show Context)
Citation Context ...ause all information that we need supply to an SVM are the dot products 〈Φ(xi), Φ(xj)〉 in the feature space F, which can be computed through a positive definite kernel k(·, ·) in the input data space =-=[2,4]-=-: (4) k(xi,xj) = 〈Φ(xi), Φ(xj)〉 (6) The positive definite kernel (also known as Mercer kernel) is formally defined as follows [5]: Definition 1 Let X be a nonempty set. A function k(·, ·) : X × X → R ... |

296 |
RefSeq and LocusLink: NCBI genecentered resources
- Pruitt, Maglott
- 2001
(Show Context)
Citation Context ... does not assume that the input sequence must have a TIS. 20s6 Prediction on the Human mRNAs We downloaded all human mRNAs with the status code REVIEWED from NCBI Reference Sequence (RefSeq) database =-=[33]-=-. These sequences have been reviewed by NCBI staffs or their collaborators and we may assume that they are of high quality. The dataset contains 8824 sequences. After deleting the sequences whose upst... |

282 | The scanning model for translation: an update
- Kozak
- 1989
(Show Context)
Citation Context ... mRNA. However, sometimes a downstream ATG is selected due to leaky scanning, reinitiation, and internal initiation of translation (this happens only for some viral mRNAs), etc. [25,26]. According to =-=[24,45]-=-, downstream ATGs are used as start codons in less than 10% of investigated eukaryotic mRNAs. So, it seems that we could easily obtain an accuracy of more than 90% in the prediction of TISs by simply ... |

250 |
On the theory and computation of evolutionary distance
- Sellers
- 1974
(Show Context)
Citation Context ...ance between two sequences is often an exaggerated over-estimation of the true dissimilarity. On the other hand, the edit distance (also known as the Levenshtein metric [27] and evolutionary distance =-=[37]-=-) is a more general and accurate measure of sequence dissimilarities. The (basic, unweighted) edit distance between two sequences denotes the minimum number of edit operations that transform one seque... |

226 |
Molecular evolution. Sinauer Associates
- Li
- 1997
(Show Context)
Citation Context ...e processes of DNA replication and evolution, the errors like insertions and deletions (i.e. indels) of nucleotides are common. In particular, short tandem repeats (STR) are often hotspots for indels =-=[28]-=-. 3 When indels are prevalent, the Hamming distance between two sequences is often an exaggerated over-estimation of the true dissimilarity. On the other hand, the edit distance (also known as the Lev... |

198 |
Algorithms for approximate string matching
- Ukkonen
- 1985
(Show Context)
Citation Context ...kernels have time complexity O(n 2 ) based on dynamic programming. To improve the time complexity, one may attempt to use some fast algorithm to compute the edit distance, such as Ukkonen’s algorithm =-=[39]-=-. Ukkonen’s algorithm runs in O(s ∗ n) time for instances of length n and edit distance s. Ukkonen’s algorithm, however, works only for the unit cost model. Although Ukkonen’s algorithm has been impro... |

162 |
Amino acid substitution matrices from an information theoretic perspective
- Altschul
- 1991
(Show Context)
Citation Context ...NAs, which are linear and unstructured. The cost scheme can be easily extended to accommodate more general probabilistic models such as affine gap costs. 6 A similar model is the log-odds ratio model =-=[3,13]-=-. However, we think that the log probability interpretation is more direct and it perhaps deals with indels better. 8 i (10) (11)sHowever, we can still use the edit kernel in support vector machines a... |

161 |
Solutions of incorrectly formulated problems and the regularization method
- Tikhonov
- 1963
(Show Context)
Citation Context ...chniques in applications. More precisely, the capacity control capability makes SVMs free of the overfitting problem [40, 41]. SVMs can also been interpreted in the framework of regularization theory =-=[38]-=-, which is a general approach to handle ill-posed problems. The small number of support vectors used in an SVM also has natural interpretations in the context of algorithmic complexity and minimum des... |

155 |
Initiation of translation in prokaryotes and eukaryotes
- Kozak
- 1999
(Show Context)
Citation Context ... first ATG codon in an mRNA. However, sometimes a downstream ATG is selected due to leaky scanning, reinitiation, and internal initiation of translation (this happens only for some viral mRNAs), etc. =-=[25,26]-=-. According to [24,45], downstream ATGs are used as start codons in less than 10% of investigated eukaryotic mRNAs. So, it seems that we could easily obtain an accuracy of more than 90% in the predict... |

154 |
Bioinformatics. Sequence and Genome Analysis
- Mount
- 2003
(Show Context)
Citation Context ...edit distance definition by taking the average of edit(x,y) and edit(y,x) in the kernel as discussed in the previous section. To obtain the cost matrices for nucleotides, we use the 1-PAM matrix from =-=[30]-=-, which is based on Kimura’s two-parameter (K2P) model of nucleotide substitutions [20]. Specifically, the probability of a transition for each nucleotide is 0.006 and that of a transversion is 0.002.... |

146 |
Harmonic Analysis in Semigroups
- Berg, Ressel, et al.
- 1984
(Show Context)
Citation Context ...d through a positive definite kernel k(·, ·) in the input data space [2,4]: (4) k(xi,xj) = 〈Φ(xi), Φ(xj)〉 (6) The positive definite kernel (also known as Mercer kernel) is formally defined as follows =-=[5]-=-: Definition 1 Let X be a nonempty set. A function k(·, ·) : X × X → R is called a positive definite kernel 2 if k(·, ·) is symmetric (i.e. k(x,y) = k(y,x) for all x,y ∈ X) and n� i=1 n� cicjk(xi,xj) ... |

136 | Support Vector Learning
- Schölkopf
- 1997
(Show Context)
Citation Context ...robability interpretation is more direct and it perhaps deals with indels better. 8 i (10) (11)sHowever, we can still use the edit kernel in support vector machines according to the following theorem =-=[36]-=-. Theorem 2 Suppose the data x1,...,xℓ and the kernel k(·, ·) are such that the matrix Kij = k(xi,xj) (12) is positive. Then it is possible to construct a map Φ into a feature space F such that k(xi,x... |

109 | Engineering support vector machine kernels that recognize translation initiation sites
- Zien, Rätsch, et al.
- 2000
(Show Context)
Citation Context ...mployed linear discriminant analysis for the final scoring [34]. In 2000, Zien et al. used support vector machines (SVMs) to predict TISs and achieved an 88.6% accuracy on Pedersen and Nielsen’s data =-=[46]-=-. Recently, Hatzigeorgiou achieved a 94% accuracy on 475 cDNA sequences [16]. Her system includes two modules (both based on neural networks), one sensitive to the conserved motif and the other sensit... |

101 |
Theory of Pattern Recognition
- Chervonenkis
- 1974
(Show Context)
Citation Context ...port vectors. It is also known that the expectation of the number of learned support vectors from a training set of size ℓ, divided by ℓ−1, is an upper bound on the expected probability of test error =-=[42]-=-. Thus, the very small number of support vectors required by edit kernel III with SCM250 is an assurance of the SVM’s good performance. 17sTable 7: Comparison of the average numbers of iterations amon... |

99 |
At least six nucleotides preceding the AUG initiator codon enhance translation in mammalian cells
- Kozak
- 1987
(Show Context)
Citation Context ...ively studied by using biological approaches, machine learning, and statistical models. In 1987, Kozak found that an ATG codon in a very weak context is not likely to be the start site of translation =-=[22]-=-. The optimal context for initiation of translation in vertebrate mRNA is GCCACCatgG. Within this consensus motif, nucleotides in two highly conserved positions exert the strongest effect: a G residue... |

75 |
Interpreting cDNA sequences: some insights from studies on translation
- Kozak
- 1996
(Show Context)
Citation Context ... first ATG codon in an mRNA. However, sometimes a downstream ATG is selected due to leaky scanning, reinitiation, and internal initiation of translation (this happens only for some viral mRNAs), etc. =-=[25,26]-=-. According to [24,45], downstream ATGs are used as start codons in less than 10% of investigated eukaryotic mRNAs. So, it seems that we could easily obtain an accuracy of more than 90% in the predict... |

70 |
Computational identification of promoters and first exons in the human genome. Nat Genet 29
- Davuluri, Grosse, et al.
- 2001
(Show Context)
Citation Context ...ngless to compare the performance between Hatzigeorgiou’s approach and Pedersen and Nielsen’s approach here since they were tested on the different data. Finding TISs was also addressed indirectly in =-=[11]-=- in terms of finding the first exon of a gene contained in a genomic sequence. In [11], Davuluri et al. developed the program FirstEF based on a decision tree consisting of quadratic discriminant func... |

63 | Neural network prediction of translation initiation sites in eukaryotes: perspectives for EST and genome analysis
- Pedersen, Nielsen
- 1997
(Show Context)
Citation Context ...error-free mRNA sequences. However, it has been reported that, in the GenBank nucleotide data that are annotated as being equivalent to mature mRNAs, almost 40% of the sequences contain upstream ATGs =-=[31]-=-. This problem is enhanced when using unannotated genomic data and when analyzing expressed sequence tags (ESTs), which are single-pass partial sequences derived from cDNAs and are usually error-prone... |

59 | ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions - Iseli, Jongeneel, et al. - 1999 |

54 | A method for identifying splice sites and translational start sites in eukaryotic mRNA
- Salzberg
- 1997
(Show Context)
Citation Context ...1]. Statistical methods have also been developed to predict TISs. In 1997, Salzberg developed a positional conditional probability matrix that takes into account the dependency between adjacent bases =-=[35]-=-. In 1998, Agarwal and Bafna developed the so called generalized secondorder profiles that consider dependencies between non-adjacent bases [1]. However, both methods suffer from high rates of false p... |

52 |
Effects of intercistronic length on the efficiency of reinitiation by eucaryotic ribosomes
- Kozak
- 1987
(Show Context)
Citation Context ...ns in the 84 and 165 nucleotide downstream regions, respectively. Besides, reinitiation in eukaryotes is most efficient when the upORF terminates at some distance before the start of the next cistron =-=[23]-=-. The reason is that the 40S ribosomal subunit requires time (distance) to reacquire Met-tRNAi·eIF-2, without which the downstream ATG codon cannot be recognized [18]. According to the study of Luukko... |

32 |
G.Myers. An O(NP) sequence comparison algorithm
- Wu, Manber
- 1990
(Show Context)
Citation Context ...ithm runs in O(s ∗ n) time for instances of length n and edit distance s. Ukkonen’s algorithm, however, works only for the unit cost model. Although Ukkonen’s algorithm has been improved by Wu et al. =-=[43,44]-=- and Berghel and Roach [6], the improved algorithms still depends on the unit cost model. Therefore, Ukkonen’s algorithm and its improvements are not suitable for edit kernel III. Finding a fast (appr... |

30 | Rational kernels
- Cortes, Haffner, et al.
- 2003
(Show Context)
Citation Context ... the symmetry property. i To use the edit kernel in support vector machines, we would hope that it is positive definite. Unfortunately, it has been shown that the edit kernel is not positive definite =-=[9,10]-=-. 4 The cost is restricted to 1 or 0 in this basic string edit model. However, it will be relaxed to arbitrary nonnegative numbers in the next section. 5 The additive cost scheme corresponds to the as... |

24 |
Assessing protein coding region integrity in cDNA sequencing projects
- Salamov, Nishikawa, et al.
- 1998
(Show Context)
Citation Context ...ses, the codon GTG is used. 2ssequences [31]. Salamov et al. used six characteristics to analyze the area around a putative start codon and employed linear discriminant analysis for the final scoring =-=[34]-=-. In 2000, Zien et al. used support vector machines (SVMs) to predict TISs and achieved an 88.6% accuracy on Pedersen and Nielsen’s data [46]. Recently, Hatzigeorgiou achieved a 94% accuracy on 475 cD... |

23 |
Translation Initiation Start Prediction in Human cDNAs with High Accuracy
- Hatzigeorgiou
- 2002
(Show Context)
Citation Context ...en et al. used support vector machines (SVMs) to predict TISs and achieved an 88.6% accuracy on Pedersen and Nielsen’s data [46]. Recently, Hatzigeorgiou achieved a 94% accuracy on 475 cDNA sequences =-=[16]-=-. Her system includes two modules (both based on neural networks), one sensitive to the conserved motif and the other sensitive to the coding/non-coding potential around the start codon. The program l... |

22 | Positive Definite Rational Kernels
- Cortes, Haffner, et al.
- 2003
(Show Context)
Citation Context ... the symmetry property. i To use the edit kernel in support vector machines, we would hope that it is positive definite. Unfortunately, it has been shown that the edit kernel is not positive definite =-=[9,10]-=-. 4 The cost is restricted to 1 or 0 in this basic string edit model. However, it will be relaxed to arbitrary nonnegative numbers in the next section. 5 The additive cost scheme corresponds to the as... |

22 |
Efficiency of reinitiation of translation on human immunodeficiency virus type 1 mRNAs is determined by the length of the upstream open reading frame and by intercistronic distance
- Luukkonen, Tan, et al.
- 1995
(Show Context)
Citation Context ...downstream translation initiation is inhibited in 50% of the cases by an upORF of 84 nucleotides and should be 21sentirely abrogated by an ORF longer than 165 nucleotides (predicted by extrapolation) =-=[29]-=-. In our 173 cases, 51 (29.5%) and 74 (42.8%) of the false TISs have stop codons in the 84 and 165 nucleotide downstream regions, respectively. Besides, reinitiation in eukaryotes is most efficient wh... |

10 |
Detecting non-adjoining correlations with signals in DNA
- Agarwal, Bafna
- 1998
(Show Context)
Citation Context ...akes into account the dependency between adjacent bases [35]. In 1998, Agarwal and Bafna developed the so called generalized secondorder profiles that consider dependencies between non-adjacent bases =-=[1]-=-. However, both methods suffer from high rates of false positives. Since 1997, the machine learning approach has been applied to find TISs. With a neural network, Pedersen and Nielsen achieved a 84.6%... |

9 |
Control of translation initiation in Saccharomyces cerevisiae
- Yoon, Donahue
- 1992
(Show Context)
Citation Context ... mRNA. However, sometimes a downstream ATG is selected due to leaky scanning, reinitiation, and internal initiation of translation (this happens only for some viral mRNAs), etc. [25,26]. According to =-=[24,45]-=-, downstream ATGs are used as start codons in less than 10% of investigated eukaryotic mRNAs. So, it seems that we could easily obtain an accuracy of more than 90% in the prediction of TISs by simply ... |

8 |
Translational regulation of yeast GCN4
- Hinnebusch
- 1997
(Show Context)
Citation Context ...ore the start of the next cistron [23]. The reason is that the 40S ribosomal subunit requires time (distance) to reacquire Met-tRNAi·eIF-2, without which the downstream ATG codon cannot be recognized =-=[18]-=-. According to the study of Luukkonen et al., an intercistronic distance shorter than 37 nucleotides appears to negatively affect initiation frequency at downstream ATGs [29]. In the 74 false TISs tha... |

5 |
An extension of Ukkonen’s enhanced dynamic programming ASM algorithm
- Berghel, Roach
- 1996
(Show Context)
Citation Context ... instances of length n and edit distance s. Ukkonen’s algorithm, however, works only for the unit cost model. Although Ukkonen’s algorithm has been improved by Wu et al. [43,44] and Berghel and Roach =-=[6]-=-, the improved algorithms still depends on the unit cost model. Therefore, Ukkonen’s algorithm and its improvements are not suitable for edit kernel III. Finding a fast (approximate) algorithm for edi... |

3 |
The Nature of Statistical Learning Thoery
- Vapnik
- 1995
(Show Context)
Citation Context ...the SVM is the optimal hyperplane y = sign(〈w,x〉 + b) that maximizes the margin 1/ �w� 2 between the classes, which is the minimum distance from positive/negative samples to the separation hyperplane =-=[40,41]-=-. The reason to maximize the margin is that hyperplanes with a larger margin have a smaller capacity (actually a smaller upper bound on the VC-dimension) [40,41]. In this way, the overfitting problem ... |