Combining Topic Modeling and Word Embedding to Predict Match Outcomes in Association Football

Sourav Adhikari

Conferences CIMPA, 18th International Federation of Classification Societies

Sourav Adhikari

Last modified: 2024-04-08

Abstract

This study proposes a novel approach for predicting association football (soccer) match outcomes by leveraging text data from newspaper previews published in The Guardian, historical results and bookmakers' odds. Match results are categorized into a home team win, draw, or an away team win. Both Word2Vec and Llama 2 are used
to extract features of the teams in form of word embedding. Topic modeling is subsequently applied to obtain match-specific features. Additionally, predictions from the Dixon and Coles model and pre-match bookmakers' odds are integrated to create an ensemble model. The feature set thus formed is utilized for a multi class classification using random forest classifier. Classification performance is assessed using several evaluation metrics such as precision, recall and F1 values after extensive cross validation. Upon comparison with existing approaches, the proposed method achieved an accuracy of 58.33% using text-based features and 64.5% with an ensemble approach. The presented methodology aims to provide an alternative way of extraction of team-specific and match-specific features along with enhancing prediction accuracy by integrating text and structured data.

Keywords

text analysis, topic modeling, large language models, feature engineering, classification