The overwhelming amount of biomedical scientific texts calls for the development of effective language models able to tackle a wide range of biomedical natural language processing (NLP) tasks. The most recent dominant approaches are domain-specific models, initialized with general-domain textual data and then trained on a variety of scientific corpora. However, it has been observed that for specialized domains in which large corpora exist, training a model from scratch with just in-domain knowledge may yield better results. Moreover, the increasing focus on the compute costs for pre-training recently led to the design of more efficient architectures, such as ELECTRA. In this paper, we propose a pre-trained domain-specific language model, called ELECTRAMed, suited for the biomedical field. The novel approach inherits the learning framework of the general-domain ELECTRA architecture, as well as its computational advantages. Experiments performed on benchmark datasets for several biomedical NLP tasks support the usefulness of ELECTRAMed, which sets the novel state-of-the-art result on the BC5CDR corpus for named entity recognition, and provides the best outcome in 2 over the 5 runs of the 7th BioASQ-factoid Challange for the question answering task.

ELECTRAMed: a new pre-trained language representation model for biomedical NLP

Carlotta Orsenigo
2021-01-01

Abstract

The overwhelming amount of biomedical scientific texts calls for the development of effective language models able to tackle a wide range of biomedical natural language processing (NLP) tasks. The most recent dominant approaches are domain-specific models, initialized with general-domain textual data and then trained on a variety of scientific corpora. However, it has been observed that for specialized domains in which large corpora exist, training a model from scratch with just in-domain knowledge may yield better results. Moreover, the increasing focus on the compute costs for pre-training recently led to the design of more efficient architectures, such as ELECTRA. In this paper, we propose a pre-trained domain-specific language model, called ELECTRAMed, suited for the biomedical field. The novel approach inherits the learning framework of the general-domain ELECTRA architecture, as well as its computational advantages. Experiments performed on benchmark datasets for several biomedical NLP tasks support the usefulness of ELECTRAMed, which sets the novel state-of-the-art result on the BC5CDR corpus for named entity recognition, and provides the best outcome in 2 over the 5 runs of the 7th BioASQ-factoid Challange for the question answering task.
2021
Pre-trained language models
ELECTRA
Biomedical NLP
File in questo prodotto:
File Dimensione Formato  
ElectraMed_Orsenigo_2021.pdf

accesso aperto

: Pre-Print (o Pre-Refereeing)
Dimensione 128.31 kB
Formato Adobe PDF
128.31 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1237608
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact