Genome-wide annotation prediction with SVD truncation based on ROC analysis

Chicco, Davide; Masseroli, Marco

Correct interpretation of many biological experiments is currently based on consistency of biomolecular annotation databases. Such databases are very useful for the scientific community, but, unfortunately, incomplete by definition. To improve their consistence, computational methods able to supply ranked lists of predicted annotations are hence extremely useful. We departed from a previous work on the automatic prediction of Gene Ontology (GO) annotations based on the truncated Singular Value Decomposition (SVD) of the annotation matrix, where every matrix element corresponds to the association of a biomolecular entity to a GO controlled term. Then, we developed a new method where the truncation level choice is based on computing and evaluating the Area Under the Curve (AUC) of different Receiver Operating Characteristic (ROC) curves for different truncation levels. Let the matrix A(i,j), with m rows (genes) and n columns (annotation terms), represent all the annotations of a specific controlled vocabulary for a given organism. The entry A(i,j) assumes value 1 if gene i is annotated to term j (or to any descendant of j) or 0 otherwise. The SVD-based annotation prediction is performed by computing a reduced rank approximation Ak of the matrix A by means of the singular value decomposition. Ak contains real valued entries related to the likelihood that gene i shall be annotated to term j. For a defined threshold t, if Ak(i,j)>t, gene i is predicted to be annotated to term j and, if A(i,j)<=0, a new annotation is suggested (Annotation Predicted, AP). Conversely, if A(i,j)>0 & Ak(i,j)<=t, an existing annotation is suggested to be semantically inconsistent with the available data (Annotation to be Reviewed, AR). The core of this SVD method is the truncation level k, which defines the size of the submatrix used by the algorithm to compute the SVD. To select the best truncation level, we implemented a greedy algorithm that, for any considered truncation, generates a ROC curve representing the AR_rate (1.0 - Sensitivity) vs the AP_rate (1.0 - Specificity) and computes the area under the ROC curve (AUC). By considering the maximum rank of A, the number of non-zero singular values along the Sigma diagonal, and, gradient variations in AUCs distribution function, for a sample dataset, we sampled q truncation values. We considered every qi as a new truncation for SVD, and computed the AUCqi of the corresponding ROCqi curve. When all the considered AUCqi values are calculated, we took as the best qi truncation the one related to min(AUCqi). To evaluate the performance of our method, we used annotations of different organism genes available on July 2009 in an old version of GO Annotation databases. By analyzing Gallus gallus annotations between genes and Biological Process terms, the best truncation parameter, suggested by the algorithm, led to better results than other truncation levels. From all the input annotations, the SVD method with best truncation predicted the highest number of annotations whose presence were confirmed in a more recent GO database version (October 2011). Contrariwise, other truncation values, related to higher AUC values, led to worst prediction results.

RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano