RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

The complexities introduced by compiler optimization have long stood as a significant obstacle in binary analysis and reverse engineering. Function inlining, in particular, complicates function recognition by replacing function calls with the entire body of the callee, mixing code from multiple functions. State-of-the-art approaches can identify inlined functions at basic block granularity, but cannot determine which instructions belong to each function and precisely deduce inlined boundaries. Without this information, further analyses such as decompilation cannot be performed effectively. This article presents Highliner, a novel approach that improves state-of-the-art approaches by identifying inline instances at instruction-level granularity. Highliner operates downstream of block-level detectors: given basic blocks reported by state-of-the-art approaches as belonging to a specific inlined function, it labels each instruction as Inlined or Not inlined and recovers the inlined-function boundaries. We treat the problem as a sequence tagging task typical of NLP and implement a learning-based technique involving instruction embedding and recurrent neural networks. We compile a dataset of open-source projects with different optimizations and use the DWARF debug information standard to construct labeled sequences of inline instructions. We use this dataset to train, validate, and test a sequence labeling architecture in which instructions are encoded via the pre-trained assembly language transformer PalmTree and then processed by an RNN-based classifier to produce binary predictions. When evaluated as a binary classifier, Highliner achieves an F1-score of 0.94 overall. In addition, when specifically tested on recognizing function boundaries, Highliner achieves an Accuracy of 0.82 on initial boundaries and 0.83 on final boundaries.

Highliner: Enhancing Binary Analysis through NLP-Based Instruction-Level Detection of C++ Inline Functions

Dall'Aglio, Lorenzo;Binosi, Lorenzo;Carminati, Michele;Zanero, Stefano;Polino, Mario

2025-01-01

Abstract

The complexities introduced by compiler optimization have long stood as a significant obstacle in binary analysis and reverse engineering. Function inlining, in particular, complicates function recognition by replacing function calls with the entire body of the callee, mixing code from multiple functions. State-of-the-art approaches can identify inlined functions at basic block granularity, but cannot determine which instructions belong to each function and precisely deduce inlined boundaries. Without this information, further analyses such as decompilation cannot be performed effectively. This article presents Highliner, a novel approach that improves state-of-the-art approaches by identifying inline instances at instruction-level granularity. Highliner operates downstream of block-level detectors: given basic blocks reported by state-of-the-art approaches as belonging to a specific inlined function, it labels each instruction as Inlined or Not inlined and recovers the inlined-function boundaries. We treat the problem as a sequence tagging task typical of NLP and implement a learning-based technique involving instruction embedding and recurrent neural networks. We compile a dataset of open-source projects with different optimizations and use the DWARF debug information standard to construct labeled sequences of inline instructions. We use this dataset to train, validate, and test a sequence labeling architecture in which instructions are encoded via the pre-trained assembly language transformer PalmTree and then processed by an RNN-based classifier to produce binary predictions. When evaluated as a binary classifier, Highliner achieves an F1-score of 0.94 overall. In addition, when specifically tested on recognizing function boundaries, Highliner achieves an Accuracy of 0.82 on initial boundaries and 0.83 on final boundaries.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Titolo della rivista
	
				ACM TRANSACTIONS ON PRIVACY AND SECURITY
			
	Appare nelle tipologie:
	
				01.1 Articolo in Rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1295976

Citazioni

ND

ND

ND

social impact