Randomly perturbed random forests

Laura Anderlucci; Angela Montanari

Conferences CIMPA, 8th Latin American Conference on Statistical Computing (LACSC)

Laura Anderlucci, Angela Montanari

Last modified: 2024-05-14

Abstract

In supervised classification, a change in the distribution of a single feature, a combination of features, or the class boundaries, may be observed between the training and the test set. This situation is known as dataset shift. As a result, in real data applications, the common assumption that the training and testing data follow the same distribution is often violated.In order to address dataset shift we propose to randomly introduce more variability in the training set by sketching the input data matrix resorting to random projections of units. We then modify the random forests algorithm to involve sketched, rather than bootstrapped, versions of the original data.Results on real data show that perturbing the training data via matrix sketching improves the prediction accuracy of test units that have a different distribution in terms of variance structure.

Keywords

classification, dataset shift, data perturbation

References

Moreno-Torres, J. G., Raeder, T., Alaiz-Rodríguez, R., Chawla, N. V., & Herrera, F. (2012). A unifying view on dataset shift in classification. Pattern recognition, 45(1), 521-530.Ahfock, D. C., Astle, W. J., & Richardson, S. (2021). Statistical properties of sketching algorithms. Biometrika, 108(2), 283-297.Falcone, R., Anderlucci, L., & Montanari, A. (2022). Matrix sketching for supervised classification with imbalanced classes. Data Mining and Knowledge Discovery, 36(1), 174-208.