Retrieval of files without the support of file system structures is arguably essential for digital forensics. Files are typically stored as sequences of data blocks, which have to be reconstructed in the retrieval process. This is commonly performed, among other approaches, through file carving, in general detecting the original block sequences by means of signatures of known headers and footers of files. Of course, this creates challenges with fragmented files, where blocks belonging to different files may be interleaved. Ways to classify file blocks into file types relying on their content may provide a support to achieve a successful reconstruction. We propose to classify file blocks using Support Vector Machines (SVMs), and we do so by studying in-depth the impact of an appropriate selection of the features used in the classification process. We analyze several potential features and test their performance over a large and representative collection of file blocks and file types. We find out that SVM classifiers can achieve a good accuracy and that a specific type of features (based on byte frequency distribution) performs well across almost all of the examined file types.
File Block Classification by Support Vector Machines
SPORTIELLO, LUIGI;ZANERO, STEFANO
2011-01-01
Abstract
Retrieval of files without the support of file system structures is arguably essential for digital forensics. Files are typically stored as sequences of data blocks, which have to be reconstructed in the retrieval process. This is commonly performed, among other approaches, through file carving, in general detecting the original block sequences by means of signatures of known headers and footers of files. Of course, this creates challenges with fragmented files, where blocks belonging to different files may be interleaved. Ways to classify file blocks into file types relying on their content may provide a support to achieve a successful reconstruction. We propose to classify file blocks using Support Vector Machines (SVMs), and we do so by studying in-depth the impact of an appropriate selection of the features used in the classification process. We analyze several potential features and test their performance over a large and representative collection of file blocks and file types. We find out that SVM classifiers can achieve a good accuracy and that a specific type of features (based on byte frequency distribution) performs well across almost all of the examined file types.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.