Integration and querying of genomic and proteomic semantic annotations for biomedical knowledge extraction

Masseroli, Marco; Canakoglu, Arif; Ceri, Stefano

doi:10.1109/TCBB.2015.2453944

Understanding complex biological phenomena involves answering complex biomedical questions on multiple biomolecular information simultaneously, which are expressed through multiple genomic and proteomic semantic annotations scattered in many distributed and heterogeneous data sources; such heterogeneity and dispersion hamper the biologists’ ability of asking global queries and performing global evaluations. To overcome this problem, we developed a software architecture to create and maintain a Genomic and Proteomic Knowledge Base (GPKB), which integrates several of the most relevant sources of such dispersed information (including Entrez Gene, UniProt, IntAct, Expasy Enzyme, GO, GOA, BioCyc, KEGG, Reactome and OMIM). Our solution is general, as it uses a flexible, modular and multilevel global data schema based on abstraction and generalization of integrated data features, and a set of automatic procedures for easing data integration and maintenance, also when the integrated data sources evolve in data content, structure and number. These procedures also assure consistency, quality and provenance tracking of all integrated data, and perform the semantic closure of the hierarchical relationships of the integrated biomedical ontologies. At http://www.bioinformatics.deib.polimi.it/GPKB/, a Web interface allows graphical easy composition of queries, although complex, on the knowledge base, supporting also semantic query expansion and comprehensive explorative search of the integrated data to better sustain biomedical knowledge extraction.