API functions often require the crafting of specific inputs and may return some output that is usually processed by the code that immediately follows their invocation. In this work, we claim that - for some APIs - those two stages are both frequently similar across different binaries and sufficiently unique to be fingerprinted. We build upon this intuition and present Apìcula, a static analysis tool for identifying API calls in generic streams of bytes, such as memory dumps, network traffic, or object code files. In a nutshell, Apìcula leverages the control flow graph of a binary to generate a set of fingerprints for all basic blocks that end with a call instruction. Those sets are then compared against a database of pre-computed fingerprints to establish whether any known API is being invoked. Due to its applicability to unstructured byte streams, Apìcula can complement the reverse engineering process when this is carried out over memory dumps collected after a cyber-incident. Moreover, it can enable behavioral analysis in a fully static way, by identifying sequences of API calls even in non executable binaries. We provide a series of experiments that are instrumental (1) in demonstrating that the same fingerprints computed for specific APIs can be observed across different binaries and (2) in iden- tifying a subset of the Windows APIs whose usage can be detected by Apìcula with sufficient precision and sensitivity, focusing in particular on malicious binaries. Furthermore, we illustrate two techniques that can be used to validate different fingerprint databases in case someone wants to detect APIs belonging to libraries different from those that we consider in this work. In particular, we prove that fingerprints associated with different APIs are remarkably dissimilar and therefore can be employed for distinguishing between APIs. More specifically, we find that fingerprint sets associated with different APIs present on average a Jaccard index value of 0.000125; in comparison, the average similarity between fingerprint sets associated with the same API is 0.29 (Jaccard index) for binaries compiled with the same optimization level and 0.07 (Jaccard index) for binaries compiled with different optimization levels. Moreover, we show that we can build databases of fingerprints that are sufficiently comprehensive to identify specific APIs in unseen binaries. More precisely, we identify 228 different APIs among the Windows APIs (including the C run-time libraries) whose usage can be detected by Apìcula with sensitivity greater than 80% and a false discovery rate lower than 5%.

Apícula: Static Detection of API Calls in Generic Streams of Bytes

D’Onghia, Mario;Salvadore, Matteo;Carminati, Michele;Polino, Mario;Zanero, Stefano
2022

Abstract

API functions often require the crafting of specific inputs and may return some output that is usually processed by the code that immediately follows their invocation. In this work, we claim that - for some APIs - those two stages are both frequently similar across different binaries and sufficiently unique to be fingerprinted. We build upon this intuition and present Apìcula, a static analysis tool for identifying API calls in generic streams of bytes, such as memory dumps, network traffic, or object code files. In a nutshell, Apìcula leverages the control flow graph of a binary to generate a set of fingerprints for all basic blocks that end with a call instruction. Those sets are then compared against a database of pre-computed fingerprints to establish whether any known API is being invoked. Due to its applicability to unstructured byte streams, Apìcula can complement the reverse engineering process when this is carried out over memory dumps collected after a cyber-incident. Moreover, it can enable behavioral analysis in a fully static way, by identifying sequences of API calls even in non executable binaries. We provide a series of experiments that are instrumental (1) in demonstrating that the same fingerprints computed for specific APIs can be observed across different binaries and (2) in iden- tifying a subset of the Windows APIs whose usage can be detected by Apìcula with sufficient precision and sensitivity, focusing in particular on malicious binaries. Furthermore, we illustrate two techniques that can be used to validate different fingerprint databases in case someone wants to detect APIs belonging to libraries different from those that we consider in this work. In particular, we prove that fingerprints associated with different APIs are remarkably dissimilar and therefore can be employed for distinguishing between APIs. More specifically, we find that fingerprint sets associated with different APIs present on average a Jaccard index value of 0.000125; in comparison, the average similarity between fingerprint sets associated with the same API is 0.29 (Jaccard index) for binaries compiled with the same optimization level and 0.07 (Jaccard index) for binaries compiled with different optimization levels. Moreover, we show that we can build databases of fingerprints that are sufficiently comprehensive to identify specific APIs in unseen binaries. More precisely, we identify 228 different APIs among the Windows APIs (including the C run-time libraries) whose usage can be detected by Apìcula with sensitivity greater than 80% and a false discovery rate lower than 5%.
File in questo prodotto:
File Dimensione Formato  
apicula_final.pdf

accesso aperto

Descrizione: Articolo principale
: Pre-Print (o Pre-Refereeing)
Dimensione 978.7 kB
Formato Adobe PDF
978.7 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11311/1216855
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact