Call it data science for good — or for the love of the game.
For Philly Tech Week 2022 presented by Comcast, Freya Systems intended to create an event that would cause people to think differently about data. A hackathon, “Can You Predict Major League Soccer Game Outcomes?” employed a unique data set, making the challenge more difficult to solve.
The virtual challenge attracted data science professionals, data science enthusiasts and college students, locally and from as far as California. Participants were required to use open-source programming languages Python or R.
There are several differences between the two languages, and often people will gravitate toward one or the other based on their backgrounds. People with statistical experience tend to use R as the syntax is very familiar to other classic statistical programming languages like MATLAB or SAS. Python is a multi-purpose programming language like C++ or Java, but in data science, it is preferred for machine learning models, especially when talking about deep learning or neural networks. Both programming languages have thousands of packages that contain specialized functions to manipulate data, plot data and construct models. A few popular packages for Python that were used were Pandas and Sklearn; for R, some of the packages used were Dplyr, Data.Table and Ranger.
Participants were provided a training set of games already played from the 2018 to 2022 Major League Soccer (MLS) seasons. The test set included games already played in the 2022 season but without related statistics. The challenge was to understand how well a model would perform with data from fully completed seasons to predict the outcome of games that have yet to occur. A variety of techniques were applied but a few popular techniques used were: hyperparameter searching, tree-based models, and exploitation of the test or evaluation set.
Hyperparameter searching is a technique where a particular function or library automatically attempts to find ideal parameters by running trials and experimenting. A hyperparameter is a model-dependent parameter that controls how the model is constructed and how the model learns. Certain hyperparameter values can drastically change how well the model performs in its predictions. There are many libraries and techniques, such as “grid searching,” a brute force that will run every combination of trials. Other libraries, such as Optuna, have their own searching algorithm that looks to exploit certain combinations. Overall, hackathon participants who used this strategy tended to do better with their models.
Another technique used was tree-based models such as Gradient Boosting Classifiers and Random Forests. They use decision tree-based structures to construct their overall models. Gradient Boosted Classifier tends to focus on simple trees to make weak predictions but then focuses on improving the structure of the trees based on misclassifications. A Random Forest Classifier focuses on constructing a group of parallel trees (hence the name forest) and uses a voting system to identify the winner based on a random set of variable inputs used for each tree. Random Forests tend to be quite robust, perform very well for various datasets, and do not require much hyperparameter tuning for solid results. Hackathon participants that used one of these algorithms were also quite successful.
The third strategy that led to successful performance but wasn’t as popular as the other techniques was an exploitation of the test set. The evaluation set was provided to each participant with the outcomes. This was done to allow participants to self-report their final prediction scores. Other hackathons typically do not include the actual outcome in the evaluation dataset, so each participant has no idea how well they performed against the actual outcome. Here, true outcomes were included, and those that exploited this decision had the best results.
The exploitation of the test set data approach exceeded the performance of other strategies because this competition was based on how well each participant performed on a test set. Traditional data science approaches call for a model that performs best on all the inputs or datasets, not just a single test set. The reason is to have a model that isn’t biased toward a specific set of inputs. However, with this hackathon, the teams that biased their models to perform well on the 113-row test set had the best results. Of course, it may not mean that those teams would have the best predictive model for the entire 2022 MLS season because the results for the first 113 games could have been biased somehow.
Philly Tech Week 2022 allowed Freya Systems to host our first hackathon. It was successful and quite fun. Of course, we’d be remiss if we didn’t use the experience to optimize future efforts. First, we won’t include the results in the test set and provide more time for participants to work on the problem. Participants only had 72 hours to complete the challenge. As a result, only six of the 18 teams submitted results. More time would have allowed us to see more unique techniques. Finally, we would love to have more data, as is the struggle with most data science problems. More data allows models to be more robust and learn more intricate details.
At the end of the day, this hackathon represented an everyday challenge in data science — how to balance the priorities of budget versus quality versus time.
Knowledge is power!
Subscribe for free today and stay up to date with news and tips you need to grow your career and connect with our vibrant tech community.