Last modified: 2024-06-18
Abstract
We introduce the Symbolic Principal Component Analysis method for Histogram-valued symbolic variables (HI-PCA), which represents an extension of the Symbolic Principal Component Analysis method for interval-valued symbolic variables (I-PCA). HI-PCA is based in classical set theory and probability theory, where we utilize the notions of bins and intervals to project histograms onto the principal components. The paper also presents several theorems that provide theoretical support for the HI-PCA method.
We aim to explore the geometric aspects of projecting histogram bins onto Principal Components. The central question is whether, when applying PCA, if the intervals that support two bins are disjoint in the original data also they result in disjoint projections of the supporting intervals onto the principal components. Understanding these projections is crucial to us because they can impact the projected frequencies. To introduce HI-PCA based in set theory, we have developed definitions and theorems that offer a generalized approach based on the equations initially proposed in . Finally, we will illustrate the implementation of our proposed method in the R programming language using the RSDA package.
Keywords
References
Billard, L. and Diday, E. (2006). Symbolic Data Analysis: Conceptual Statistics and DataMining (United Kingdom: John Wiley & Sons Ltd)
Rodriguez, O. (2023). RSDA: R to Symbolic Data Analysis. R package version 3.2.1
Brito, P. and Dias, S. (2022). Analysis of Distributional Data CRC Press, United States ofAmerica.
Rodriguez, O. (2000). Classification et Mod`eles Lin`eaires en Analyse des Donn´ees Symboliques.Ph.D Thesis, Paris IX-Dauphine University.
Verde, R., Irpino, A. and Balzanella A. (2016). Dimension Reduction Techniques for DistributionalSymbolic Data. IEEE Transactions on Cybernetics 46(2)/ 344-355.