Automatic Labeling of Protein Structures by Hidden Markov Models

Kiyoshi Asai[1] (asai@etl.go.jp)
Kentaro Onizuka[2] (onizuka@mrit.mei.co.jp)
Masayuki Akahoshi[3] (akahoshi@icot.or.jp)
Hidetoshi Tanaka[3] (htanaka@icot.or.jp)
Katunobu Itou[1] (kito@etl.go.jp)

[1] Electrotechnical Laboratory (ETL), 1-1-4 Umezono, Tsukuba, 305 Japan
[2] Matsushita Research Institute, 3-10-1 Higashimita, Tama-ku, Kawasaki, 214 Japan
[3] ICOT, Mita Kokusai Bldg. 21F, 1-4-28 Mita, Minato-ku, Tokyo, 108 Japan


Abstract

In this research, local structure labeling of protein is performed by Hidden Markov Models (HMMs) using Multi Scale Structure Description (MSSD).
HMMs have been used for structure prediction [Asai91,Asai93A,Asai93B], for sequence alignment [Haussler93], for protein classification [Tanaka93], and for motif extraction [Fujiwara94]. Most of them used 20 amino acids as the discrete output symbols of distributions in HMMs. In this paper, however, HMMs have continuous output distributions for the hidden states, which output MSSD-parameters of the protein structures. MSSD is a robust parameterization of protein structures using 3D coordinates of alpha carbons [Onizuka94].
In order to get appropriate HMMs for the purpose, the network shapes of the HMMs must be determined. The HMM training here consists of parameter learning and of dynamic network shape growth. For the network shape determination, iterative duplication method [Fujiwara94] and successive state splitting (SSS) algorithm [Tanaka93] have been used for protein HMMs. We used modified iterative duplication method, where negligible links are deleted and states of the largest output variances are duplicated.
After the HMM training, not only the output distributions of the states, but also the transition probabilities between the states characterize the features of the local structures. Therefore, both continuous structures and short ranged structures are categorized naturally as the hidden states in HMMs. By using the parameters of 5-residue MSSD, which nearly correspond to the secondary structures, "alpha helix" and "beta strand" and many types of "turns" and "coils" are expressed in HMMs.
By estimation of the hidden state transitions using Viterbi algorithm, the protein structures are aligned to the HMMs. The labeling of the local structures is an easy translation of this alignment. At the same time, the HMMs extract the rules between the local structures as the matrices of the transition probabilities. These rules are important for modeling the protein structures including the higher level structures.