Nearest neighbors for mixed type data: an inter-dependency based approach

Alfonso Iodice D'Enza; Carlo Cavicchia; Michel van de Velden; Angelos Markos

Conferences CIMPA, 18th International Federation of Classification Societies

Alfonso Iodice D'Enza, Carlo Cavicchia, Michel van de Velden, Angelos Markos

Last modified: 2024-04-24

Abstract

K-Nearest Neighbors (KNN, [1]) is a relatively simple supervised method for both classification and regression tasks. It is referred to as a lazy-learner since the training observations are only used to identify the neighbors of the test observation. The prediction for each test observation is given by either neighborhood averaging (regression) or majority vote (classification). The neighbors identification is, therefore, the core of KNN and it critically depends on how the proximity (or, distance) between a test observation and a candidate neighbor is measured. There is a wide range of distance measures to choose from, specially for non continuous (categorical) data (see, e.g., [2]). In case of mixed type data, practitioners convert all the variable to a same type, or use ad-hoc combinations of distance measures, such as the Gower (dis)similarity index [3].We propose a KNN implementation that takes into account the inter-dependency structure among variables of mixed-type and consider two different approaches: association-based and entropy-based.

Keywords

nearest neighbors classification, mixed data

References

1. Cover, T., and Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1), 21–27

2. Van De Velden, M., Iodice D’Enza, A. , Markos, A., and Cavicchia, C. (2023). A general framework for implementing distances for categorical variables. arXiv preprint arXiv:2301.02190.

3. Gower, J. (1971). A general coefficient of similarity and some of its properties. Biometrics 27(4), 857–871