TabText: A Flexible and Contextual Approach to Tabular Data Representation

Kimberly Villalobos Carballo

Conferences CIMPA, 18th International Federation of Classification Societies

Kimberly Villalobos Carballo

Last modified: 2024-05-15

Abstract

In collaboration with Hartford HealthCare (HHC), we have developed highly accurate machine learning (ML) models that predict nine inpatient outcomes (e.g. short-term discharges, ICU transfers, mortality, etc.) using tabular data from electronic medical records. Hundreds of medical staff currently use our models, resulting in a significant reduction in patient-average length of stay and projected annual benefits of $55-$72 million for HHC. Given this successful implementation, the question arises: how could we extend these tools for the benefit of hospitals with limited resources, small patient populations, and/or non-standardized healthcare records? To address these challenges, we introduce TabText, a systematic framework that leverages Large Language Models to process and extract contextual information from tabular structures, resulting in more complete and flexible data representations. We show that 1) applying our TabText framework enables the generation of high-performing predictive models with minimal data processing, and 2) augmenting tabular data with TabText representations can significantly improve the performance of standard ML models across all nine prediction tasks, especially when trained with small-size datasets.

Keywords

Large Language Models, Healthcare Analytics, Data Augmentation

References

Carballo, K. V., Na, L., Ma, Y., Boussioux, L., Zeng, C., Soenksen, L. R., & Bertsimas, D. (2022). TabText: A Flexible and Contextual Approach to Tabular Data Representation. arXiv preprint arXiv:2206.10381.