Assignment of Certainty-Factor Parameters with a Given Reasoning Tree for the Prediction of Protein Localization Sites

Kenta Nakai[1] (nakai@nibb.ac.jp)
Ayumi Shinohara[2] (ayumi@rifis.kyushu-u.ac.jp)
Satoru Miyano[2] (miyano@rifis.kyushu-u.ac.jp)

[1] National Institute for Basic Biology
Myodaiji, Okazaki 444 Japan
[2] Research Institute of Fundamental Information Science, Kyushu University
6-10-1 Hakosaki, Higashi-ku, Fukuoka 812 Japan

Abstract

In this age of large-scale sequencing, we have many ``potentially expressed" amino acid sequences of unknown function. Characterization of such sequences by computers is undoubtedly useful for further experimental analyses. We have developed a knowledge-based system PSORT for characterizing various sorting signals potentially coded in amino acid sequences and for predicting their final localization sites in cells [1, 2]. The system calculates the probability (certainty factor) of an input protein to be localized at each candidate site. One of the difficulties of our system is that, since it has many adjustable parameters, optimization of them to a given training data is difficult. Therefore, incorporation of recent knowledge into the system has not been easy. We present here a simple scheme for assigning certainty-factor parameters with a given reasoning tree.
Since the size of training data, i.e., sequences of known localization sites, is not large in most cases, we must suppress the number of parameters as possible. In this case, use of our knowledge on the reasoning flow is favorable. Such a flow can be organized into a reasoning tree, in which an input flux is divided into thinner flows on a step-by-step basis according to some characteristic values calculated from the input sequence (Fig. 1). Its final outputs are flows corresponding to candidate localization sites. In this stage, the amount of each flow can be interpreted as the corresponding certainty factor. Thus, the problem is how to find appropriate functions that transform a characteristic value at each step in an optimized performance for the classification of training data. We used the following formula for that function:
\[F_p(x_p(i)) = \frac{1}{1 + exp(-10 \times (x_p(i) - b_p))} \] where $x_p(i)$ represents a characteristic value of a sequence i at the step p, e.g., propensity that the input sequence i encodes a membrane protein, and $b_p$ is a threshold value which is obtained by the criterion that can classify the training data at step p with least mistakes. The certainty factor for localizing a candidate site is thus calculated as a probability to choose the corresponding path, e.g., the certainty factor for a protein i to localize at the site #3 is $F_1(i) \times F_2(i) \times (1-F_4(i))$ in Fig. 1.
To test the validity of our model, we prepared 156 sequences of Bacillus subtilis whose localization sites are the prediction results of PSORT. The cross-validation test showed rather good result. Thus, although there is no theoretical proof that our model always gives good results, it will be hopefully used for future improvement of PSORT. Moreover, because of its simplicity, this method may be generally used to interpret unknown sequence data with the latest knowledge of molecular cell biology.