Numerous biomolecular data are available, but they are scattered in many databases and only some of them are curated by experts. Most available data are computationally derived and include errors and inconsistencies. Effective use of available data in order to derive new knowledge hence requires data integration and quality improvement. Many approaches for data integration have been proposed. Data warehousing seams to be the most adequate when comprehensive analysis of integrated data is required. This makes it the most suitable also to implement comprehensive quality controls on integrated data. We previously developed GFINDer (http://www.bioinformatics.polimi.it/GFINDer/), a web system that supports scientists in effectively using available information. It allows comprehensive statistical analysis and mining of functional and phenotypic annotations of gene lists, such as those identified by high-throughput biomolecular experiments. GFINDer backend is composed of a multi-organism genomic and proteomic data warehouse (GPDW). Within the GPDW, several controlled terminologies and ontologies, which describe gene and gene product related biomolecular processes, functions and phenotypes, are imported and integrated, together with their associations with genes and proteins of several organisms. In order to ease maintaining updated the GPDW and to ensure the best possible quality of data integrated in subsequent updating of the data warehouse, we developed several automatic procedures. Within them, we implemented numerous data quality control techniques to test the integrated data for a variety of possible errors and inconsistencies. Among other features, the implemented controls check data structure and completeness, ontological data consistency, ID format and evolution, unexpected data quantification values, and consistency of data from single and multiple sources. We use the implemented controls to analyze the quality of data available from several different biological databases and integrated in the GFINDer data warehouse. By doing so, we identified in these data a variety of different types of errors and inconsistencies; this enables us to ensure good quality of the data in the GFINDer data warehouse. We reported all identified data errors and inconsistencies to the curators of the original databases from where the data were retrieved, who mainly corrected them in subsequent updating of the original database. This contributed to improve the quality of the data available, in the original databases, to the whole scientific community.

Quality controls in integrative approaches to detect errors and inconsistencies in biological databases

GHISALBERTI, GIORGIO;MASSEROLI, MARCO;TETTAMANTI, LUCA
2010-01-01

Abstract

Numerous biomolecular data are available, but they are scattered in many databases and only some of them are curated by experts. Most available data are computationally derived and include errors and inconsistencies. Effective use of available data in order to derive new knowledge hence requires data integration and quality improvement. Many approaches for data integration have been proposed. Data warehousing seams to be the most adequate when comprehensive analysis of integrated data is required. This makes it the most suitable also to implement comprehensive quality controls on integrated data. We previously developed GFINDer (http://www.bioinformatics.polimi.it/GFINDer/), a web system that supports scientists in effectively using available information. It allows comprehensive statistical analysis and mining of functional and phenotypic annotations of gene lists, such as those identified by high-throughput biomolecular experiments. GFINDer backend is composed of a multi-organism genomic and proteomic data warehouse (GPDW). Within the GPDW, several controlled terminologies and ontologies, which describe gene and gene product related biomolecular processes, functions and phenotypes, are imported and integrated, together with their associations with genes and proteins of several organisms. In order to ease maintaining updated the GPDW and to ensure the best possible quality of data integrated in subsequent updating of the data warehouse, we developed several automatic procedures. Within them, we implemented numerous data quality control techniques to test the integrated data for a variety of possible errors and inconsistencies. Among other features, the implemented controls check data structure and completeness, ontological data consistency, ID format and evolution, unexpected data quantification values, and consistency of data from single and multiple sources. We use the implemented controls to analyze the quality of data available from several different biological databases and integrated in the GFINDer data warehouse. By doing so, we identified in these data a variety of different types of errors and inconsistencies; this enables us to ensure good quality of the data in the GFINDer data warehouse. We reported all identified data errors and inconsistencies to the curators of the original databases from where the data were retrieved, who mainly corrected them in subsequent updating of the original database. This contributed to improve the quality of the data available, in the original databases, to the whole scientific community.
2010
INF
File in questo prodotto:
File Dimensione Formato  
A43_JIB_2010_7(3)_119_1-13.pdf

Accesso riservato

: Post-Print (DRAFT o Author’s Accepted Manuscript-AAM)
Dimensione 460.22 kB
Formato Adobe PDF
460.22 kB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/570146
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 11
  • ???jsp.display-item.citation.isi??? ND
social impact