RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

We investigate a variant of the problem of automatic keyphrase extraction from scientific documents, which we define as Scientific Domain Knowledge Entity (SDKE) extraction. Keyphrases are noun phrases important to the documents themselves. In contrast, an SDKE is text that refers to a concept and can be classified as a process, material, task, dataset etc. A SDKE represents domain knowledge, but is not necessarily important to the document it is in. Supervised keyphrase extraction algorithms using non-sequential classifiers and global measures of informativeness (PMI, tf-idf) have been used for this task. Another approach is to use sequential labeling algorithms with local context from a sentence, as done in the named entity recognition. We show that these two methods can complement each other and a simple merging can improve the extraction accuracy by 5-7 percentiles. We further propose several heuristics to improve the extraction accuracy. Our preliminary experiments suggest that it is possible to improve the accuracy of the sequential learner itself by utilizing the predictions of the non-sequential model.

HESDK: A Hybrid Approach to Extracting Scientific Domain Knowledge Entities

Wu, Jian;Choudhury, Sagnik Ray;Chiatti, Agnese;Liang, Chen;Giles, C. Lee

2017-01-01

Abstract

We investigate a variant of the problem of automatic keyphrase extraction from scientific documents, which we define as Scientific Domain Knowledge Entity (SDKE) extraction. Keyphrases are noun phrases important to the documents themselves. In contrast, an SDKE is text that refers to a concept and can be classified as a process, material, task, dataset etc. A SDKE represents domain knowledge, but is not necessarily important to the document it is in. Supervised keyphrase extraction algorithms using non-sequential classifiers and global measures of informativeness (PMI, tf-idf) have been used for this task. Another approach is to use sequential labeling algorithms with local context from a sentence, as done in the named entity recognition. We show that these two methods can complement each other and a simple merging can improve the extraction accuracy by 5-7 percentiles. We further propose several heuristics to improve the extraction accuracy. Our preliminary experiments suggest that it is possible to improve the accuracy of the sequential learner itself by utilizing the predictions of the non-sequential model.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2017
			
	Titolo del libro
	
				2017 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL 2017)
			
	Parole chiave
	
				conditional random field
digital library
keyphrase extraction
natural language processing
scientific domain knowledge entity
			
	Appare nelle tipologie:
	
				04.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1299616

Citazioni

ND

6

2

ND

social impact