An improved hidden Markov model for transmembrane protein detection and topology prediction and its applications to complete genomes

RY Kahsay, G Gao, L Liao - Bioinformatics, 2005 - academic.oup.com
RY Kahsay, G Gao, L Liao
Bioinformatics, 2005academic.oup.com
Motivation: Knowledge of the transmembrane helical topology can help identify binding sites
and infer functions for membrane proteins. However, because membrane proteins are hard
to solubilize and purify, only a very small amount of membrane proteins have structure and
topology experimentally determined. This has motivated various computational methods for
predicting the topology of membrane proteins. Results: We present an improved hidden
Markov model, TMMOD, for the identification and topology prediction of transmembrane …
Abstract
Motivation: Knowledge of the transmembrane helical topology can help identify binding sites and infer functions for membrane proteins. However, because membrane proteins are hard to solubilize and purify, only a very small amount of membrane proteins have structure and topology experimentally determined. This has motivated various computational methods for predicting the topology of membrane proteins.
Results: We present an improved hidden Markov model, TMMOD, for the identification and topology prediction of transmembrane proteins. Our model uses TMHMM as a prototype, but differs from TMHMM by the architecture of the submodels for loops on both sides of the membrane and also by the model training procedure. In cross-validation experiments using a set of 83 transmembrane proteins with known topology, TMMOD outperformed TMHMM and other existing methods, with an accuracy of 89% for both topology and locations. In another experiment using a separate set of 160 transmembrane proteins, TMMOD had 84% for topology and 89% for locations. When utilized for identifying transmembrane proteins from non-transmembrane proteins, particularly signal peptides, TMMOD has consistently fewer false positives than TMHMM does. Application of TMMOD to a collection of complete genomes shows that the number of predicted membrane proteins accounts for ∼20–30% of all genes in those genomes, and that the topology where both the N- and C-termini are in the cytoplasm is dominant in these organisms except for Caenorhabditis elegans.
Availability:  http://liao.cis.udel.edu/website/servers/TMMOD/
Contact:  lliao@cis.udel.edu
Oxford University Press