Feature Selection for Imbalanced Data with Deep Sparse Autoencoders Ensemble

Class imbalance is a common issue in many domain applications of learning algorithms. Oftentimes, in the same domains it is much more relevant to correctly classify and profile minority class observations. This need can be addressed by Feature Selection (FS), that offers several further advantages, s.a. decreasing computational costs, aiding inference and interpretability. However, traditional FS techniques may become sub-optimal in the presence of strongly imbalanced data. To achieve FS advantages in this setting, we propose a filtering FS algorithm ranking feature importance on the basis of the Reconstruction Error of a Deep Sparse AutoEncoders Ensemble (DSAEE). We use each DSAE trained only on majority class to reconstruct both classes. From the analysis of the aggregated Reconstruction Error, we determine the features where the minority class presents a different distribution of values w.r.t. the overrepresented one, thus identifying the most relevant features to discriminate between the two. We empirically demonstrate the efficacy of our algorithm in several experiments on high-dimensional datasets of varying sample size, showcasing its capability to select relevant and generalizable features to profile and classify minority class, outperforming other benchmark FS methods. We also briefly present a real application in radiogenomics, where the methodology was applied successfully.


INTRODUCTION
A well-known problem of many real life applications of statistical models and machine learning algorithms is class imbalance [5]. Examples can be found in many sensitive domains such as medicine [36], especially in case of rare disease classification tasks [21], fraud detection [44], fault detection [52], cyber security [47] and many others [2]. All these domains share the same peculiarity: the importance of correctly identifying and profiling the minority class. In these contexts, a false negative is usually much more expensive w.r.t. a false positive. A straightforward example comes from the medical field, where a missed diagnosis in many cases is extremely risky for the patient's health and costly for the healthcare system [4,42]. Moreover, on top of precise classification of minority class observations, domain experts are oftentimes interested in understanding which specific features (i.e. characteristics of their patients, or customers, etc.) should be kept under control or to investigate to drive decisions or invest in future research. The importance of identyfying the discriminant characteristics of the minority class is particularly evident in the clinical field, where an inaccurate feature selection can lead to an inaccurate diagnosis [24]. This observation holds for Genome Wide Association Studies for precision medicine [23], where the clinical interest lies on detecting the traits that are associated to a specific disease [6]. Answering to this question, rather than merely classifying observations, gets harder as the number of features and the non-linearity of their interrelationships rises, driving growth in models' complexity. One way of addressing this need is through Feature Selection (FS) techniques.
In general, FS helps in identifying highly influential features that provide intrinsic information and discriminant property for class separability, while decreasing computational costs, aiding inference and giving better understanding on model representation [17,43]. However, it has been argued that traditional FS techniques become sub-optimal or even prejudicial to classification effectiveness when the classes are strongly imbalanced [51,46]. In [46], the authors demonstrate through a simulation study how the overlapping of the classes' distributions after feature selection increases because of the strong bias towards the majority class, hindering classification performance. Therefore, to achieve the advantages granted by FS, a method tailored to address imbalanced settings without affecting classification accuracy is desirable. Indeed, we argue that a FS robust to class imbalance can address both needs for accurate classification of the underrepresented class and for the identification of the specific pieces of information that are the most relevant for its identification. In other words, by selecting the most informative features to discriminate between classes, such a FS method can serve as a useful tool for the task of minority class profiling. In Section 4.5 we will briefly describe a real case study where the methodology presented in this work successfully played this role in a complex research setting. Nonetheless, although FS for imbalanced classification is recently gaining momentum, the number of reported works on the subject is still limited [2]. Few contributions dealt with this multi-faceted problem [21].
For these reasons, in this work we focused on developing a novel FS method tailored to identify relevant features to discriminate the minority from the majority class in strongly imbalanced binary classification settings. In order to accomplish this task, in this paper we propose a filtering algorithm that ranks feature importance on the basis of a Deep Sparse AutoEncoders Ensemble (DSAEE).
From a methodological standpoint, the value provided by our proposal comes from the combination of two aspects: on the one hand, the choice of a particular type of AutoEncoder (AE) [22] as underlying model, on the other, the inclusion of this model within an ensemble algorithm. Indeed, AEs are Neural Network (NN) models capable of flexibly capturing non-linear relationships among features [20]. These models have been exploited as feature selectors but, to the best of our knowledge, never tailored to class imbalance (cfr. Section 2.2). Here we claim they can be effectively exploited as feature selectors specifically for an imbalanced setting if we consider the duality between imbalanced minority class classification and outlier detection. Indeed, as the minority class is rare w.r.t. the majority one, its observations might be considered outliers w.r.t. the normal population (inliers) constituted by the overrepresented class. AEs were previously recognized as powerful reconstruction-based outlier detection methods [33,39,10,12,26,41] that rely on scoring outliers by aggregating the Reconstruction Error (RE) for each observation. In this work, we propose to repurpose this reconstruction-based outlier detection approach to solve the problem of feature selection in imbalanced setting instead. Indeed, we apply an AE trained only on majority class observations to reconstruct both majority and minority classes: from the aggregation of the REs for each feature within each class, we determine where the minority class has a different distribution of values w.r.t. the majority class -thus identifying the most relevant features to discriminate between the two classes. However, there exist the risk that a single AE fails in capturing the correlations among features, expecially in high dimensional settings [12], and a natural variance in results that might depend on the data, the design of the model and the local search for parameters typical of many Machine Learning (ML) methods. By using an ensemble approach as the one proposed in this work, and taking a central estimator of the RE, like the mean or the median, this variance is reduced [14,10]. Nevertheless, in order to make ensemble learning methods work, the individual ensemble components must be adequately diverse [10,41]. This is achieved in our proposition by designing the algorithm s.t. each ensemble component can capture different aspects of the underliyng majority class distribution. In particular, the novelty of our approach resides in fostering this diversity among components through (i) a sampling procedure tailored for imbalanced settings that builds different training and test sets to supply to each learner, and (ii) a sparsity constraint imposed on the models.
In light of the above, the contributions of this work are multiple. We enlarge the limited literature on FS tailored to deal with the daunting real-life issue of class imbalance. We do that presenting an algorithm that repurposes the power of AEs as outlier detectors for reconstruction-based minority class-specific feature selection, which is a novelty for AE-based feature selectors in general. Finally, we robustify the selection thanks to its ensemble approach, designed to foster diversity of components and accuracy on minority class.
The remainder of the paper is organized as follows. In Section 2 we discuss some related works, strenghtening our positioning w.r.t. other approaches; in Section 3 we provide some background on DSAEs, then we describe and discuss the proposed DSAEE algorithm in detail. In Section 4 we describe a series of experiments and proof of concepts developed on several datasets of varying sample size and dimensionality: firstly we empirically validate the good performance of the selected feature subset despite the dimensionality reduction (Section 4.2), then, we compare our proposed methodology with other state-ofthe-art and more traditional FS methods (Section 4.3). Additionally, we display some visualizations of the selected features to demonstrate their meaningfulness in discriminating minority from majority class (Section 4.4) and finally we briefly describe an application on real clinical data (Section 4.5). In Section 5 we highlight some relevant considerations on the proposed approach, and conclude with some final remarks and possible extensions.

RELATED WORKS
As stated in the introduction, in this paper we aim at presenting a novel FS method tailored to tackle class imbalance. Indeed, the method is designed to select a subset of informative features to reduce the impact of the strong imbalance between minority and majority classes on the classification performance. To frame the position of our proposal from a methodological point of view, in this section we will first describe other works developing methods to this aim. Then, as we are exploiting AEs as building blocks of our ensemble method for FS, we will report on studies that utilized these models for this task, irrespectively of the classes' distribution.

Feature Selection for Imbalanced Data
In general, there are three approaches to apply FS algorithms in classification: wrapper, embedded and filter methods [49]. Wrapper methods [30] make the FS revolve around the optimization of the performance of a predetermined classifier: the feature subset that maximises the defined performance metric is selected. In an imbalanced setting, the choice of the optimization metric is crucial. Indeed, among the available examples in the literature, some exploited the area under the ROC curve as a metric to select the best mix of features [11], others the F-measure [2,49,32], while in [34] they exploit, among others, a balanced loss function which takes the weighted average of false positives and false negatives. Despite their optimal results in terms of classification accuracy, wrapper methods are generally computationally expensive, and there is no guarantee of reaching a global optimum. Embedded methods [27] overcome this issue by determining the feature subset autonomously during classifier learning, by including for instance a regularization term in the loss function [37]. However, to the best of our knowledge, no embedded method has been designed specifically to tackle class imbalance. An hybrid embedded and wrapper approach is instead proposed in [31]. Nonetheless, all the aforementioned methods are strictly bounded to a specific classifier. Filter methods [40] are pre-processing algorithms that measure the usefulness of the feature subset for classification by working on the original data without involving any classifier. They usually rank features' importance on the basis of suitable metrics, some specifically tailored for imbalanced classification problems [46,51,13]. Our proposal belongs to this classifier-agnostic type of algorithms.

AutoEncoder-based Feature Selection
We will now provide a brief overview of how AutoEncoders (AEs) were employed as feature selectors in the available literature. As mentioned, AEs [22] are a particular class of NNs widely used for learning of data representations [7], dimensionality reduction [22] and outlier/anomaly detection [1,33,39,10,12,26,41]. This powerful representation learning method has been recently exploited for reconstruction-based feature selection as well. For instance, in [9] AEs are exploited as an unsupervised feature selection method, masking input features and using the Reconstruction Error (RE) of masked input features to compute feature weights in a moving average manner. In [20] the authors combine AE regression and a weight penalization on the input layer: feature importance is then derived from the value of the weights associated to each feature. Another sparsity-based unsupervised approach can be found in [16] and [48]. Finally, in the most recent work in [8], the authors propose the Concrete AutoEncoder Feature Selector (CAEFS), that exploits the Concrete distribution to differentiate through the reconstruction loss and selects input features to minimize it. All these approaches share an unsupervised setting and have demonstrated their potential as feature selectors against other state of the art techniques. Nonetheless, they all train one AE model only, incurring in the risks discussed in Section 1. Moreover, they all are FS methods designed for balanced classification. This balanced selection of features was argued potentially harmful in strongly imbalanced settings [51,46]. What distinguishes our DSAEE from the available examples of AE-based feature selectors, is the ensemble approach to the problem -where each of the AE is one of a set of weak learners -and the tailoring of each model's training procedure inspired by outlier detection methods, to approach specifically imbalanced datasets.

DSAE ENSEMBLE (DSAEE) FOR MINORITY CLASS FEATURE SELECTION
In Section 3.1 we provide some background on the DSAE components and we detail the regularization we impose on the models to foster the diversity among each component. In Section 3.2 we detail how the proposed algorithm encapsulates each component into a tailored training procedure to identify the most relevant features to discriminate minority class in imbalanced settings.

Background: AutoEncoders and Deep Sparse AutoEncoders
An AE [22] is a NN trained to attempt to copy its input to its output. Let the matrix X ∈ IR N×J be the input data, X = {x 1 , ..., x } set of training vectors ( ∈ {1, ..., }), characterized by features. The shallow version of an AE is constituted by an input layer with nodes, a hidden layer with (with usually smaller than ) nodes that describes a code used to represent the input, and an output layer of size . The network can be seen as constituted by two parts: an encoder and a decoder. The encoder function h = (Wx + b), encodes each input vector x into an encoded version of itself of size . Here is usually non-linear and is referred to as activation function, W ∈ IR H×J is called weight matrix and b is a -dimensional bias vector. The decoder maps back the encoded vector to the -dimensional space in most cases using a squashing non-linear functionx = ( ′ + ′ ), with parameters ′ ∈ IR J×H and ′ ∈ IR J . The model is trained through gradient descent of the loss function (x,x); where is typically the Mean Squared Reconstruction Error (MSRE), i.e. the mean squared Euclidean distance between the input values and the reconstructed values for each observation. Each training observation x is thus mapped to a corresponding h which is then mapped to a reconstructionx s.t.x ≈ x .
To expand the shallow network to a deep version, the formulation is similar, with the output of one layer being the input of the following layer. Usually, AEs are built with constraints that force them not only to replicate the input, but to learn effective representations of such input in the hidden layer. One way to obtain useful representations from the autoencoer is to introduce sparsity in the code layer (Sparse AutoEncoders -SAE) by imposing a regularization term in the loss function. In order to do that, the model includes a sparsity penalty Ω(h) on the hidden layer (or the most internal layer in case of deep architectures) h additionally to the reconstruction error: The regularization can take various forms. In a deep architecture (Deep Sparse AutoEncoder -DSAE), let us consider h ( ) as the activation of the most internal hidden layer ( ) for the -th observation vector x , i.e. the value of the function h ( ) = ( ) (W ( ) h ( −1) + b ( ) ). One way of obtaining a sparse representation is to add a penalty term that penalizes the 1 norm of the vector h ( ) for each observation , controlled by a parameter , i.e.
The parameter can be optimized through grid search or can be arbitrarily chosen in the design phase of the model. This penalization term forces the model to activate the minimum number of hidden nodes to reconstruct the input. Paired with the input sampling described below, it increases the diversity among each learner in the ensemble. Moreover, it reduces the need for tailored choices or expensive optimization to define the proper architecture.

The Ensemble Algorithm
Let us consider the binary supervised learning setup with a training set of (input, target) pairs where is the target that takes values is {0, 1} and X ∈ IR N×J is the input matrix. We consider the supervised learning to be imbalanced, thus the number of observations in the minority class ( = {x | = 1}) is relevantly smaller than the number of observations in the majority class ( = {x | = 0}). Our final objective consists in building a feature set F, with | | < (from now on the notation | ⋅ | will represent the cardinality of a set), selecting the most relevant features to discriminate the minority from the majority class. We therefore define X ∈ IR |O|×J as the minority class observations and X ∈ IR |M|×J as the majority class ones.
With the intention of building an ensemble of different learners from which to aggregate information to rank features, we first develop a tailored sampling procedure, inspired by the outlier detection approaches, to train each learner on a different sample of data selected with the rationale detailed in the following, and schematized in Figure 1 (a). In particular, from X and X and the respective outcomes y , and y we generate a training set X and a test set X . The test set contains 2| | data points, including all the minority class observations, and an equal number of majority ones randomly drawn from . The training set is instead composed by the majority class data excluded from the test set. This structure of the two datasets allows us to train each DSAE learner in an unsupervised fashion only on the overrepresented population, and to test their performance when facing both majority and minority class examples, so that we can compare the RE made on the two populations. The rationale behind this sampling procedure is based on the fact that DSAEs trained to reconstruct normal observations only (i.e. the majority class) will make higher RE when tested on outlier observations (i.e. minority class examples) never experienced during training. Indeed, once the two datasets are built, we train each DSAE on X to minimize the loss function formulated in (1). Then, we supply X , we collect the reconstructed matrixX , and for each x in the test set, with ∈ {1, .., = 2| |} we compute the vector of RE as the element-wise squared difference: We thus obtain a matrix of RE, R = (X ,X ), R ∈ IR P×J that has one row per observation , and the features on the columns, that we label including y . For ensemble learners included in the algorithm, we will produce sampled training and test sets, and concatenate the matrices R, building the final RE matrix Q = {(l 1 , 1 ), ..., (l , )} ∈ IR K×(J+1) , where = is the total number of tested observations (now ∈ {1, ..., }) and ( + 1) is the number of features plus the label associated to each observation. As previously mentioned, we expect each AE to make higher average RE on the observations originally belonging to the group of minority class observations not evaluated during training. We can also consider each value l in the vector l , i.e. the RE committed on feature for the observation . In this case, if the observation belongs to , we would expect the model to make higher RE on the features where the minority class has a significantly different distribution of values w.r.t. the majority one. For a schema of the algorithm described in the following, refer to Figure 1 .(b). In order to select the most representative features to discriminate between minority and majority class, we subdivide the vectors l ∈ Q in two matrices: one composed by minority class RE (Q ) and the other by majority class RE (Q ). From these sets we can estimate the vectors of average RE per feature per group: l and l , both belonging to IR J , where each element is computed as and = ∕2 is the number of both minority and majority class examples in Q. Once we have computed the class specific average REs per feature, we can proceed to the feature selection by studying how the RE of each feature varies between classes.
To select only the features where the difference in RE is remarkable (i.e. where the minority class is notably distant from the majority one), we first compute the vector of Δ RE as We observe the distribution of values taken by Δ, to understand how distant minority class features are w.r.t. majority class ones. We can therefore define a quantile threshold on the Δ distribution. The observed Δ values above the defined quantile (Δ ( ) ) are considered relevant, and are therefore selected by the algorithm. In other words, we build the set of selected features including only the features with the highest difference in RE between the classes: From the original dataset X we can therefore extract a subset of features to either analyze per se or feed to any classifier. There is an inverse relation between and the number of selected features: the higher the , the lower the numer of selected features. Algorithm 1 in Figure 1 .(c) reports the pseudo-code of the whole FS procedure.

Computational Complexity
Each DSAEE component has a complexity ( ) dependent on (the number of observations in the data matrix), (the number of weigths in the network) and (the number of epochs, or iterations in the training). The complexity of the training of an Ensemble of DSAEs becomes ∼ ( ), growing linearly with the number of trained models. Both the number of employed components and the architectural choices impacting and can be optimized to reduce training time and improve results as well. Moreover, the ensemble training can be easily parallelized, thus significantly cutting training time.

EXPERIMENTS
To study the performance of the DSAEE and to test the validity of the claims raised in the previous sections we carried out several empirical evaluations. In particular, we were interested in testing the capability of our algorithm to select even extremely small subsets of features while keeping the classification performance sufficiently high, especially on the minority class. This evaluation was carried out in settings of varying dimensionality and sample size (see Section 4.2). Moreover, we compared the classification performance of our method against some benchmark FS algorithms (Section 4.3) and finally, we investigated in an interpretable and visual way the meaningfulness of the selected features and their capability to provide useful insights to discriminate between minority and majority classes (Section 4.4). To conclude, we also provide a brief description of a real data application in the challenging field of radiogenomics (see Section 4.5). Through this analysis, we highlight the relevant impact that we are bringing in terms of minority class profiling in complex real-life research scenarios.

Datasets and Performance Measures
For all the aforementioned numerical experiments we decided to adopt freely distributed datasets to make results accessible and reproducible. Moreover, some peculiar characteristic of each of the exploited data allowed us to showcase different aspects of our algorithm and discuss its potential when applied to multifaceted scenarios. Note that the datasets exploited in our experiments were not originally imbalanced and in most cases they were meant for multiclass classification problems. As a consequence, a preliminary subsetting of the chosen data was conducted. In the following, we will list the adopted datasets and describe in details the dataset-building choices we made for each of them. For all datasets, we selected one of the classes as the majority class, and we undersampled another class to represent the minority category. From the derived datasets, we extracted one subset on which we applied our feature selection method (Feature Selection DataSet -FSDS), while the remaining was held out to evaluate the classification accuracy of the selected features (Classification DataSet, CDS). In Table 1 we report all datasets, their composition, and the type of experiment they were exploited for.
1. ISOLET [15] (number of observations = 370; number of features = 617). It consists of preprocessed speech data of people pronouncing the names of the letters in the English alphabet, and is widely used as a benchmark in the feature selection literature. Each feature is one of the 617 quantities produced as a result of the preprocessing. We chose class 'A' as the majority class, and 'B' as the minority one. Given the small number of observations available per class, this dataset allowed us to test the applicability of our algorithm in high dimensionality and small sample size settings. [18] ( = 3, 300; = 5, 000) This dataset was built for NIPS2003 feature selection challenge. The whole dataset contained 6,000 observations equally split between classes, with 5,000 features (50% of which are probes with no predictive power). We created 5 datasets including all 3,000 majority class observations and 300 randomly sampled minority class observations (9.05%), and we splitted them into FSDS and CDS according to a 75/25 ratio.

GISETTE
3. Epileptic Seizure [3] ( = 11, 500; 7, 300; = 178). In this functional dataset, each data point represents 178 seconds of EEG recording for one of the 500 patients in the study. Each of the 178 features is the value of the EEG at that timestamp. The label indicates whether the EEG is recording seizure activity ('Y') or not ('N'). This dataset was originally imbalanced, but we decided to increase the complexity by subsampling minority class further (cfr. Table 1 ). [45] ( = [7, 350; 7, 300]; = 784). This dataset is composed by 28x28 grayscale images of clothing. To test our model we built two datasets with different imbalance rates. T-shirts were selected as the majority class, and coats as the minority one for the first dataset (∼ 5% of the whole dataset, with 7,350 total observations), while pullovers for the second (∼ 4% with = 7, 300). [28] ( = 8, 292; = 784). This dataset is composed by 28x28 grayscale images of hand-written digits. We selected two quite overlapping classes to test our model: the '7' digit class as the minority class and the '1' digit class as the majority one. This dataset, together with the two extracted from Fashion MNIST, simulate a setting of extreme imbalance (below 95 ∶ 5 ratio) and moderately high dimensionality, but with a large sample size.

MNIST
It should be noted that the proposed algorithm is meant to be applied to features that do not present any dependence (i.e. the order of the features is irrelevant). Its applicability to image datasets is guaranteed by the fact that all images are centered, allowing us to meaningfully treat each pixel as an independent feature. The choice to add image datasets to these experiments derives from both their dimensionality and the clear readability of their results, that allow for visually investigating the selected features by representing them as pixels.
To evaluate the classification performance in an imbalanced setting, we decided not to adopt the classical accuracy on both classes. Instead, we chose the Sensitivity metric (i.e. the ratio of true positives and the sum of true positives and false negatives for observations belonging to the minority class) and the Area Under the Receiver Operating Characteristic (AUROC), that estimates the performance of a binary classifier comparing false positive rates with true positive rates and is a widely used metric to evaluate model's capability to correctly classify both classes, especially in imbalanced settings.

Classification Performance of Selected Feature Subsets
Dimensionality reduction has impacts on computational time and complexity, noise reduction, model significance and results interpretability, but all these improvements should not come at the cost of a good classification performance on the classes of interest. In particular, in research scenarios as those presented in Section 1, a minimum level of precision on Minority Class observations is desirable. On the obtained datasets we trained and tested five classifiers: Logistic Regression (LR), Decision Tree (DT), Support Vector Machines (SVM), Naive Bayes (NB) and Nearest Neighbor (NN) classifier. We chose to test different classifiers to verify whether our model-agnostic feature selection approach provided good results indepently of the subsequent classifier adopted. All algorithms were drawn from scikit-learn library for Python [38] and their hyperparameters were kept in default mode, unless differently stated. Note that we applied the same classifiers to all experiments without tailoring their parameters to the data at hand. This choice does not resemble a traditional classification process in a real-life scenario, where classifiers are optimized to improve the performance on the data at hand, but aimed at showcasing the impact of the feature subset selection alone. Details on the code, the implementation and the specific architectural choices for the DSAEE are described and discussed in Appendix.
We tested the DSAEE feature selector on Isolet, Fashion-MNIST, Gisette and Epileptic Seizure datasets. Results for FMNIST and Isolet datasets are reported in Figure 2 , where performance metrics are averaged over 5 trials and the x-axes display the size of the selected feature sets. On FMNIST Dataset (Figure2 , first two panels) most classifiers suffered the dimensionality reduction up until smaller subsets, when their performance started growing again. On the contrary, NB classifier had a steep improvement on both AUROC and Sensitivity, reaching almost perfect scores for subsets of extremely small dimensionality (8 features, ∼ 1% of the original 28x28 image). On Isolet Data, where the sample size is extremely small compared to the number of features and the minority class in the training set contains 52 observations only, the five classifiers performed most of the times as good as the baseline performance   with all variables, despite the reducing size of the features subset. Sensitivity (Figure 2 fourth panel) increased substantially for KNN and SVM, while the LR classifier kept attaining an almost perfect score even as the cardinality (| |) of the selected features set decreased substantially (while improving on AUROC score, as shown in Figure 2 third panel). In many cases, the classifiers obtained their best results as | | decreased.
In Table 2 we report the classification performance on the Epileptic Seizure Dataset. The first line summarizes baseline results. Note that we chose to include only NB and SVM classifiers, as LR, KNN and DT demonstrated a baseline performance that was too poor to meaningfully consider them for classification on this data. On the contrary, NB and SVM showed a high baseline performance despite the strong imbalance. Decreasing the amount of features used in classification did not hinder the performance, while reducing the dimensionality of the problem. For example, by reducing it to a third (| | = 54), NB did not significantly reduce AUROC or Sensitivity metrics, while the performance for much smaller subsets (| | = {18, 13, 9}) remains comparable with the baseline. In Figure 3 we report the results of the experiment on Gisette data. Note that this dataset was designed for feature selection benchmarking experiments, by including 2,500 predictive features and 2,500 probes. By looking at | | values on the x-axis, one can note that the feature subsets selected by the DSAEE are way smaller than the original number of noisy features. However, irrespectively of the baseline performance with | | = 5, 000, all classifiers showed an increase in performance for some | | values. This could mean that the algorithm is first correctly excluding noisy features; then, among the informative predictors, it is progressively excluding correlated or reduntant features, identifying the most useful for the classification task at hand. This hypothesis is well supported by the behavior of NB classifier, that by design requires conditional independence to reach optimal classification [50]. In this experiment, NB yields a steep increase on both metrics for smaller | | values. Only LR suffered a steep decrease in Specificity, that was however balanced by the significant improvement in AUROC (meaning that the performance is better balanced between the two classes) for subsets between 1,000 and 250 features.

Feature Selection Benchmarking Experiment
We selected the benchmark feature selection methods for the performance comparison with our DSAEE approach s.t. they would be representative of different types of algorithms. In particular, we included (i) Chi-squared, a supervised filtering feature selection method based on univariate 2 statistical tests, and (ii) Recursive Feature Elimination (RFE) [19] -supervised wrapper method, that when combined with SVM classifier (RFE-SVM) was proven one of the best performing methods in [34] for feature selection in imbalanced settings. Finally, we also included (iii) Concrete AutoEncoder Feature Selector (CAEFS), an unsupervised feature selection method based on AEs 1 , that in [8] was proven superior to most related algorithms mentioned in Section 2.2.
All benchmark methods were applied to the FSDS imposing a number of selected features equal to the features selected by DSAEE for the different levels, then the subsets of selected features were extracted from the CDS to test classification accuracy. We compared the performance on Isolet dataset and Fashion MNIST dataset averaging on 5 trials for each experiment. In both cases we trained an ensemble of = 25 DSAEs. In Figure 4 we report the results on Isolet using four different classifiers, on Sensitivity and AUROC. Varying the threshold we selected a different subset of variables: the cardinality (| |) of such subsets is reported on the x axes. For what concerns NB and SVM classifiers, the DSAEE performed better than the competitors for almost all variables subsets on both indicators. In particular, it significantly outperformed the unsupervised CAEFS for smaller subsets, while the major competition on the smallest dimensionalities was represented by the supervised RFE. Note that RFE-SVM is a feature selection method proved among the best performers for imbalanced settings [34], and the DSAEE either surpasses or reaches comparable performance levels in most cases (see the two plots in the left bottom part of Figure 4 ). Similar results were obtained with the NB classifier. Regarding LR classifier, all methods seemed to perform well on this dataset, but our methodology reaches an almost perfect score on Sensitivity irrespectively of the threshold level, up until to only 7 variables, where the other AE-based FS method (CAEFS) lowered its average performance. These levels of Sensitivity and AUROC -irrespectively of the adopted classifieron a dataset with significantly small sample size and extremely high dimensionality testify in favour of the applicability of our methodology in many real-life scenarios where the collection of observations might be costly or difficult.
In Figure 5 we compare the performance of the DSAEE on FMNIST dataset using the best performing classifier in terms of performance improvement (Figure 2 ). Our algorithm confirmed its superiority w.r.t. the competing AE-based FS method, while keeping a comparable performance to the other benchmark algorithms, all set to a very high performance up until an extremely small feature subset (8 pixels from the original 784). Note that, as can be noticed from Figure 2 , both datasets allowed for high prediction accuracy on both classes even before feature selection. This indicates that probably, despite the imbalanced setting, the two classes are sufficiently separated and consistently characterized to allow classifier to correctly separate them and generalize well just by seeing few examples of the underrepresented class. For this reason, it is not surprising to see all algorithms (especially the supervised ones) perform quite well on this feature selection and classification task. Nonetheless, although our ensemble algorithm is based on unsupervised learners, it consistently reached or surpassed the supervised approaches, and performed significantly better than the unsupervised FIGURE 4 Classification benchmarking against other FS methods for ISOLET Datasets, for NB, LR, SVM and DT classifiers. Each classifier has one plot per metric (AUROC on the left, Sensitivity on the right) one. In Table 3 we report the total process runtime to complete all feature subsets selections (for all values) for the different algorithms, averaged over all trials. In the first column each of the trained DSAEs are processed in sequence, while in the last one we report the estimated average time to perform the algorithm's training in parallel. Even though the sequential training time is not prohibitive per se, its parallelized version outperforms the wrapper RFE and the other AE-based algorithm (CAEFS) by far, while enjoying the beneficial robustness of an ensemble framework.

Interpretability
One advantage of FS for classification lies in the increased interpretability of the subsequent algorithms and results. Indeed, identifying features that are the most informative, w.r.t. a target class within a dataset is an insightful information by itself in many application contexts. In the era of black-box classifiers, a reduction in the amount of information fed to these algorithms is per se a way of improving the interpretability of (and the control over) the obtained classifications. In the case of our proposed algorithm, the selected features are the subset of variables where the minority class distances the majority one the most. In Figure 6 we report some visualizations from the MNIST Dataset that help in understanding the feature selection process performed by our algorithm. The small set of selected features for = 0.90 (Figure 6 .f) is then overlapped (in gray scale) to  the average representation of the two classes (Figure 7 .a). This visualization allows us to recognize how the selected features include all pixels where the minority class ('7' digits) have different characteristics w.r.t. the '1' digits class. In Figure 7 .b we propose the same visualization for the Fashion MNIST dataset. Note that these features subsets were obtained in an highly imbalanced setting, as reported in Table 1 , but the selected features are extremely meaningful nonetheless.

Case Study application in Radiogenomics
Class imbalance is a daunting issue in many real life applications, especially when dealing with medical and biological data [42] (cfr. Section 1). So far, we presented simulation studies and proofs of concept to demonstrate the generalizable potential of the proposed algorithm. However, the value of the presented approach lies in its demonstrated applicability to complex scenarios arising from real life research settings. Indeed, in this section we present a real data application of the DSAEE FS algorithm in the field of radiogenomics. A detailed report on the study can be found in [35]. However, because of the aforementioned reasons, we were interested in providing here a brief description nonetheless. Specifically, we focused on the long term outcomes of radiotherapy on patients suffering from prostate cancer. The final aim was to validate genetic locations (in the form of Single Nucleotide Polymorphisms, or SNPs) that can be associated with Late Toxicity (LT) outcomes. Experts were indeed interested in finding whether among the features (i.e. the SNPs) with high association to the 5 considered LT endpoints in previous studies on different cohorts, some could be validated as relevant for the cohort at hand (∼ 1,700 patients with an incidence of the positive class always below 10% for each endpoint and a total number of 43 SNPs to evaluate). We applied our DSAEE on each of the 5

FIGURE 6
Results of the experiment on the 7 (minority) and 1 (majority) classes. In these 28x28 pixels images each pixel represents a feature. Subfigures (a) and (b) represent the mean of all values the two classes take in the FS dataset. The color scale is shared across all six subfigures. Subfigure (c) reports the average RE for the minority class, while (d) is the representation of the majority class average RE -Note that being the '1's class the majority one, the model learns to reconstruct precisely the center of the vertical line that draws the digit. The vector Δ is reported in (e), while (f) depicts the selected variables with a threshold = 0.9. endpoints separately, and we selected SNPs with different thresholds ( = {0.7, 0.8, 0.9, 0.95}). This being an unsupervised setting it is hard to comment on precision of the results without the required clinical expertise. However, notably, for one of the endpoints (i.e. Late Urinary Frequency) 3 SNPs identified as relevant by our method for all values were previously mentioned in literature [25] as the most strongly associated to this endpoint. This is an interesting application case in which FS methods are useful to profile minority class, and provide useful insights to researchers. As introduced in Section 1, our FS algorithm is indeed tailored to respond to similar needs and to deal with complex scenarios where the class of interest is extremely rare.

DISCUSSION AND CONCLUSIONS
In this paper we presented a Deep Learning-based ensemble approach to select features for highly imbalanced classification tasks. The proposed approach exploits Deep Sparse AutoEncoders as weak learners, each trained to learn the normal patterns in majority class observations, and tested on both majority and minority class data. Diversity among components of the ensemble is fostered by a tailored sampling procedure and the sparsity constraint on the training loss function. Features are ranked averaging on the RE of the ensemble of learners to identify the most informative ones, where minority class distribution differs from majority class the most. We performed a series of experiments to test the potential of our DSAEE. First, we verified the capability of our algorithm to avoid the degradation of classification performance induced by selecting feature subsets in a setting of strong imbalance [46]. We compared baseline performances with that obtained with subsets of increasingly small dimensionality, using a wide range of datasets with different characteristics to simulate diverse research application scenarios. Then, we benchmarked our method against other feature selection methods, demonstrating the superior or comparable performance of the DSAEE feature selector. Note that most of the algorithms we compared the DSAEE with had the advantage of being supervised, or even tailored to maximize prediction accuracy on minority class (RFE-SVM).
Our FS algorithm is tailored to manage extremely imbalanced settings with the aim of attaining all the advantages of FS methods without sacrificing too much on the classification performance by reducing the amount of information supplied to classifiers. In some cases, the algorithm was capable of identifying subsets of the original features yielding an improved performance in terms of AUROC and/or Specificity (cfr. Figure 2 and Figure 3 ). In particular, the improved Specificity might be induced by the training procedure of each ensemble DSAE component: indeed, AEs by nature represent an approximation of the identity function and the applied model is compelled to learn the common characteristics of the data [41]. By training on majority class only, the learnt data distribution does not include the characterization aspects of minority class instances, thus generating higher reconstruction errors on those features. Moreover, the initial data sampling, once included in an ensemble framework, allows to extract reliable information even when the observations belonging to the minority class are limited. While creating the different sampled training and test set for each ensemble component, the minority class is indeed studied against various subsets of the majority one, thus enhancing the informative power of the small underrepresented sample. On top of the sampling procedure we included to the training loss function of our components a sparsity penalty term, that besides fostering components' diversity reduces the need for lengthy optimization of the DSAEs' architecture. Indeed, the penalty term forces the number of active nodes in the hidden layer to adapt to the sample of training data, reducing autonomously the risk of learning trivial representations. Besides all the above, the DSAEE Feature Selection algorithm is a filtering method, meaning that it is agnostic to the classifier exploited to discriminate between classes. This may slightly hinder classification accuracy compared for instance to wrapper methods, but gains generalizability of the identified features. Moreover, when compared to wrapper methods, our approach does not incur in the risk of sub-optimal solutions in high-dimensional settings, where evaluating all possible combinations of features would be computationally intractable. When compared to embedded methods, our AE-based approach is capable of capturing nonlinear relationships among features. Kernel-based embedded feature selection methods were proposed to learn nonlinear representations [29], but they are limited by the fixed kernel, and the choice of the optimal kernel or combination of kernels is not straightforward.
In conclusion, with this work we are taking inspiration from different methodological domains to develop a novel filtering feature selection algorithm that is (i) robust thanks to its ensemble nature, (ii) capable to learn complex patterns in data because of its AE components, (iii) provides interpretable insights and (iv) is specifically tailored to tackle class imbalance. All these considerations promote the usefulness of our DSAEE feature selector in real-life contexts where data are imbalanced, minority class observations have great relevance, sample size is small, and interpretability of results is crucial. We provided a direct example in Section 4.5, where a real data application is briefly described. Future works might be devoted to studying the applicability of the DSAE feature selector to imbalanced multi-class classification problems or to further develop the analysis of the RE distributions to select features.

ACKNOWLEDGMENTS
The Authors thank the ERA PerMed Cofund program, grant agreement No ERAPERMED2018-44, RADprecise -Personalized radiotherapy: incorporating cellular response to irradiation in personalized treatment planning to minimize radiation toxicity.

TABLE A1
Details of the architectural and of the implementation are reported here for the four analysed datasets. The function tanh is the hyperbolic tangent, the ReLu is the Rectified Linear Unit function. Enc. section of the table reports the encoder architecture, while Dec.