Publications
Comparing protein sequence-based and predicted secondary structure-based methods for identification of remote homologs
Geetha V, Di Francesco V, Garnier J, Munson PJ
PMID: 10436078
Abstract
We have compared a novel sequence-structure matching technique, FORESST, for detecting remote homologs to three existing sequence based methods, including local amino acid sequence similarity by BLASTP, hidden Markov models (HMMs) of sequences of protein families using SAM, HMMs based on sequence motifs identified using meta-MEME. FORESST compares predicted secondary structures to a library of structural families of proteins, using HMMs. Altogether 45 proteins from nine structural families in the database CATH were used in a cross-validated test of the fold assignment accuracy of each method. Local sequence similarity of a query sequence to a protein family is measured by the highest segment pair (HSP) score. Each of the HMM-based approaches (FORESST, MEME, amino acid sequence-based HMM) yielded log-odds score for the query sequence. In order to make a fair comparison among these methods, the scores for each method were converted to Z-scores in a uniform way by comparing the raw scores of a query protein with the corresponding scores for a set of unrelated proteins. Z-Scores were analyzed as a function of the maximum pairwise sequence identity (MPSID) of the query sequence to sequences used in training the model. For MPSID above 20%, the Z-scores increase linearly with MPSID for the sequence-based methods but remain roughly constant for FORESST. Below 15%, average Z-scores are close to zero for the sequence-based methods, whereas the FORESST method yielded average Z-scores of 1.8 and 1.1, using observed and predicted secondary structures, respectively. This demonstrates the advantage of the sequence-structure method for detecting remote homologs.