Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations

Masseroli, Marco; Chicco, Davide; Pinoli, Pietro

doi:10.1109/IJCNN.2012.6252767

Consistency and completeness of biomolecular annotations is a keypoint of correct interpretation of biological experiments. Yet, the associations between genes (or proteins) and features correctly annotated are just some of all the existing ones. As time goes by, they increase in number and become more useful, but they remain incomplete and some of them incorrect. To support and quicken their time-consuming curation procedure and to improve consistence of available annotations, computational methods that are able to supply a ranked list of predicted annotations are hence extremely useful. Starting from a previous work on the automatic prediction of Gene Ontology (GO) annotations based on the Singular Value Decomposition of the annotation matrix, where every matrix element corresponds to the association of a gene with a feature, we propose the use of a modified Probabilistic Latent Semantic Analysis (pLSA) algorithm, named pLSAnorm, to better perform such prediction. pLSA is a statistical technique from the natural language processing field, which has not been used in bioinformatics annotation prediction yet; it takes advantage of the latent information contained in the analyzed data co-occurrences. We proved the effectiveness of the pLSAnorm prediction method by performing k-fold cross-validation of the GO annotations of two organisms, Gallus gallus and Bos taurus. Obtained results demonstrate the efficacy of our approach.