Abstract |
In microarray experiments, each array contains the expression levels of several thousands of genes on a tissue sample. Our objective is to find groups in expression patterns of tissue samples on the basis of genes. A reliable and precise grouping of tissues is essential for the automated search and the validation for molecular
subtypes of disease. Major difficulty in this problem is that the number of tissues to be grouped is much smaller than the dimension of data which corresponds to the number of genes. In such a case, the application of conventional model-based clustering using finite mixture models, e.g. Gaussian mixture, leads to overfitting during the density estimation process. To overcome such difficulty, we consider an extension of the factor analysis. In our probability model, so referred to as the mixed factors model, the factor variable plays a role to present a parsimonious parameterization of Gaussian mixture, that is, a parsimonious description of clusters. As a results of this modeling, we can avoid the occurrence of overfitting during the density estimation process, even when the dimension of data is more
than several thousands and the number of sample is lesser than one hundred. Our method contains the automated search system of clusters and data compression. The compressed data are constructed as to be plausible estimate of which reveals the presence of groups. In addition to such utilities, we can detect sets of genes, i.e. transcriptional module genes, to be relevant to explain the presence of groups and considered to be co-expressed in combination with some other genes. We
will demonstrate the potential usefulness of our method with the application
to the gene expression data. |