Construction of a Membrane Protein Database and an Evaluation of Several Prediction Methods of Transmembrane Segments

Toshio Shimizu[1] (slsimi@si.hirosaki-u.ac.jp)
Kenta Nakai[2] (nakai@nibb.ac.jp)

[1] Faculty of Science, Hirosaki University
3 Bunkyo-cho, Hirosaki 036 Japan
[2] National Institute for Basic Biology
Myodaiji, Okazaki 444 Japan

Abstract

How reliable and useful are predictions of transmembrane segments(TMSs) of membrane proteins from the amino acid sequences? It remains still under debate. Kyte and Doolittle proposed a simple scheme for the prediction of TMSs [1]. It is based on the hydropathy plot and is widely accepted as a basic and standard method. Since then, a large number of more sophisticated predictive algorithms have been proposed, which are improved varieties of the Kyte-Doolittle's approach. Although these methods have been considered to give rather good results, their abilities are still not enough to predict the number and positions of TMSs precisely; they often give totally different predictive results with proteins having many TMSs, in particular [2, 3]. One reason for this situation can be attributable to the low quality of the information on TMSs described in general amino acid sequence databases. The information included within the SWISS-PROT database, for example, is mostly not based on any experimental evidence but on predicted models; there is often no explicit description about whether the data comes from experiments or calculations in databases. Higher quality of information on TMSs from experimental evidence only is essential to evaluate existing prediction methods more precisely and to develop an algorithm overcoming their problems.
We have collected 128 references reporting the membrane topology of proteins, and are continuing our efforts to triple this number. From them, we selected 54 topology models based on experimental evidence, at least partially. Combining these data with the sequence information from the SWISS-PROT database, we have constructed a membrane protein database in the form of relational database. Current version includes 54 proteins which are classified into 3 groups (eukaryotic proteins, prokaryotic proteins, and the proteins with non-helical segments) as shown in Figure 1. Using this database we evaluated the predictability of the algorithms of following authors: Eisenberg [4]; Klein, Kanehisa and DeLisi(KKD method) [5]; von Heijne(TopPred method) [6]; and Persson and Argos [7]. The KKD method and the TopPred method predicted the exact number of TMSs for 59% and 67% of proteins in our database, respectively. These values could be increased to 63% and 74% by optimizing respective parameter values. The KKD method tends to predict fewer number of TMSs than the correct number, while the TopPred method shows the opposite tendency. We are now testing our previous idea to use different cut-off parameters for one TMS proteins and multiple TMS proteins in the KKD method and are also trying to develop a new predictive algorithm, by taking more precise position-dependent information on TMS into account.