Publications
A decision tree system for finding genes in DNA
Salzberg S, Delcher AL, Fasman KH, Henderson J
PMID: 10072083
Abstract
MORGAN is an integrated system for finding genes in vertebrate DNA sequences. MORGAN uses a variety of techniques to accomplish this task, the most distinctive of which is a decision tree classifier. The decision tree system is combined with new methods for identifying start codons, donor sites, and acceptor sites, and these are brought together in a frame-sensitive dynamic programming algorithm that finds the optimal segmentation of a DNA sequence into coding and noncoding regions (exons and introns). The optimal segmentation is dependent on a separate scoring function that takes a subsequence and assigns to it a score reflecting the probability that the sequence is an exon. The scoring functions in MORGAN are sets of decision trees that are combined to give a probability estimate. Experimental results on a database of 570 vertebrate DNA sequences show that MORGAN has excellent performance by many different measures. On a separate test set, it achieves an overall accuracy of 95 %, with a correlation coefficient of 0.78, and a sensitivity and specificity for coding bases of 83 % and 79%. In addition, MORGAN identifies 58% of coding exons exactly; i.e., both the beginning and end of the coding regions are predicted correctly. This paper describes the MORGAN system, including its decision tree routines and the algorithms for site recognition, and its performance on a benchmark database of vertebrate DNA.