The complexities introduced by compiler optimization have long stood as a significant obstacle in binary analysis and reverse engineering. Function inlining, in particular, complicates function recognition by replacing function calls with the entire body of the callee, mixing code from multiple functions. State-of-the-art approaches can identify inlined functions at basic block granularity, but cannot determine which instructions belong to each function and precisely deduce inlined boundaries. Without this information, further analyses such as decompilation cannot be performed effectively. This article presents Highliner, a novel approach that improves state-of-the-art approaches by identifying inline instances at instruction-level granularity. Highliner operates downstream of block-level detectors: given basic blocks reported by state-of-the-art approaches as belonging to a specific inlined function, it labels each instruction as Inlined or Not inlined and recovers the inlined-function boundaries. We treat the problem as a sequence tagging task typical of NLP and implement a learning-based technique involving instruction embedding and recurrent neural networks. We compile a dataset of open-source projects with different optimizations and use the DWARF debug information standard to construct labeled sequences of inline instructions. We use this dataset to train, validate, and test a sequence labeling architecture in which instructions are encoded via the pre-trained assembly language transformer PalmTree and then processed by an RNN-based classifier to produce binary predictions. When evaluated as a binary classifier, Highliner achieves an F1-score of 0.94 overall. In addition, when specifically tested on recognizing function boundaries, Highliner achieves an Accuracy of 0.82 on initial boundaries and 0.83 on final boundaries.
Highliner: Enhancing Binary Analysis through NLP-Based Instruction-Level Detection of C++ Inline Functions
Dall'Aglio, Lorenzo;Binosi, Lorenzo;Carminati, Michele;Zanero, Stefano;Polino, Mario
2025-01-01
Abstract
The complexities introduced by compiler optimization have long stood as a significant obstacle in binary analysis and reverse engineering. Function inlining, in particular, complicates function recognition by replacing function calls with the entire body of the callee, mixing code from multiple functions. State-of-the-art approaches can identify inlined functions at basic block granularity, but cannot determine which instructions belong to each function and precisely deduce inlined boundaries. Without this information, further analyses such as decompilation cannot be performed effectively. This article presents Highliner, a novel approach that improves state-of-the-art approaches by identifying inline instances at instruction-level granularity. Highliner operates downstream of block-level detectors: given basic blocks reported by state-of-the-art approaches as belonging to a specific inlined function, it labels each instruction as Inlined or Not inlined and recovers the inlined-function boundaries. We treat the problem as a sequence tagging task typical of NLP and implement a learning-based technique involving instruction embedding and recurrent neural networks. We compile a dataset of open-source projects with different optimizations and use the DWARF debug information standard to construct labeled sequences of inline instructions. We use this dataset to train, validate, and test a sequence labeling architecture in which instructions are encoded via the pre-trained assembly language transformer PalmTree and then processed by an RNN-based classifier to produce binary predictions. When evaluated as a binary classifier, Highliner achieves an F1-score of 0.94 overall. In addition, when specifically tested on recognizing function boundaries, Highliner achieves an Accuracy of 0.82 on initial boundaries and 0.83 on final boundaries.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


