The 2024 U.S. Open (source analytics)

A quick and dirty way to fuel my obsession with tennis, for free.

I’m a huge fan of open source technology as a means of producing real business value. In my career as a data professional, I’ve used open source technologies across the data lifecycle: Airflow for orchestration and ingestion, Preset for business intelligence, dbt for transformation, and others.1  

Recently, I've been scoping an embedded data application that enables user self-service for several gigabytes of data. As part of this system, we're evaluating newer, open source tools. Wanting more hands-on exposure to these tools, and with the 2024 U.S. Open looming, I set out to build an open source, low fidelity data app centered around… tennis.

The biggest surprise? I don’t foresee Carlos Alcaraz leaving New York with a trophy.

Kaylan Bigun (USA) scores a point against Valentin Vacherot (MCO) in the qualifying rounds.

Scoping the available data

If you’ve watched a professional tennis match on television, you’ve seen the Hawkeye technology that dynamically calls balls in or out.2 I quickly learned that Hawkeye “sits on their data like a dragon on gold” as one Redditor phrased it, and I needed other means of collecting match and shot-by-shot data.

Luckily, I came across Jeff Sackmann’s tennis blog, Heavy Topspin (would highly recommend). He runs an open source data collection project for all Grand Slam matches. Pro: the data exists! (and is open source). Con: because of the manual data collection process, it is inconsistent.

Let the fun begin…

Jacob Matson’s MDS in a box inspired the wireframe for tool potential. A short evaluation led to the following plan:

  1. DuckDB for local warehousing

  2. Python scripts for ingestion and light transformation

  3. Superset for visualization

Duck what?

DuckDB is an open-source, in-process analytics database that has received a lot of hype over the past 1-2 years. It is integrated across enterprise SaaS; Omni, for example, uses DuckDB as its in-browser analytics engine to move processing locally and limit data warehouse compute.

I ended up installing DuckDB via pip as a way to manage packages within a Python virtual environment, and I relied on the Python API to persist storage, create schemas, and run queries:

import duckdb

conn = duckdb.connect('tennis_slam.db')

Python scripts for ingestion

Once the database was created, I manually loaded a csv from Sackmann’s repo into the db:

conn.sql("create table raw_2011_usopen_matches as select * from read_csv('https://raw.githubusercontent.com/JeffSackmann/tennis_slam_pointbypoint/master/2011-usopen-matches.csv')")

And then could easily explore a subset of 2011 U.S. Open match results:

conn.sql("select * from raw_2011_usopen_matches limit 100")

To no surprise, Djokovic is the first result.

I eventually polished a script that ran through the normal data-y stuff:

  1. Looped through and ingested all csvs in the repo

  2. Wrapped commands in functions

  3. Setup schemas for raw and intermediate models

  4. Unioned tables and points data across years

The script ingested and transformed close to 100 files in just under 2 minutes. Not too shabby 🤙

Clone & Configure Superset

Given my experience with Preset, I chose Superset as the BI layer. It was easy to clone the repo as a submodule and spin up the demo environment with Docker. However, I ran into a bit of trouble connecting Superset to the DuckDB database I created.

Because Superset does not include a DuckDB SQLAlchemy driver out of the box, it is required to add an additional database connection. This is as simple as adding the duckdb-engine package to a ./docker/requirements-local.txt file and composing the Docker image.

But, I ran into problems telling Docker where to find the database. I eventually discovered that I needed to mount the directory containing the DuckDB database to the Superset container by adding the database path to the docker_compose.yml file:

x-superset-volumes: &superset-volumes
## default volumes ##
  - /Users/willdecesare/Documents/GitHub/tennis-analytics-duckdb:/app/duckdb

Furthermore, after trying to rebuild the Docker container, my machine was left in a never ending npm run build cycle. I was limited by my machine’s memory (M1 Mac; 8GB of memory): even after increasing Docker’s allowable memory and swap, I still could not rebuild the docker container.

Thankfully, I’m not the only person who encountered this issue. After this quick fix, I was able to successfully rebuild the container, navigate to the local Superset UI, and add the DuckDB database with the SQLAlchemy URI of duckdb:////app/duckdb/tennis_slam.db.

Explore some data!

I admittedly did not perform any quality checks on the data before ingesting. Hence, the following visuals are not entirely accurate. But, you get a sense of the final product assuming better quality data.

European players have conquered the U.S. Open of late.

Djokovic displaying sheer dominance across all surfaces.

Would prompting Claude 3.5 Sonnet to “create a bar chart of top 10 most matches played in the tennis U.S. Open by player from 2018-2023” saved a significant amount of time for the same output? Yes:

Do women not play in the U.S. Open? 🤔

But as Ralph Waldo Emerson once wrote: “it’s not about how you built the thing, it’s about over-engineering a way to get there".3

Closing Thoughts

DuckDB was easy to set up and query, and integrates well with Python. Little configuration was needed, the ingestion process was smooth, and it was an overall great experience spinning up a database outside of your typical Postgres and Snowflake databases. However, there’s still a need to pressure test DuckDB’s speed, as I only realized halfway through development that the data was not large enough to warrant DuckDB vs. other means of local visualization.

A UI wrapper is needed for effective development. Consuming results in terminal is not an ideal developer experience. I’m sure a Snowflake-like IDE exists, but I did not spend the time searching for it.

My rustiness with Docker showed. I refreshed my knowledge on containerization, which changes pass through Superset live and which do not, when to rebuild images vs. just fire up the service, etc. It ended up being a great refresher, and I likely learned more Docker than DuckDB out of this project.

The value of data is alive and well. Analytics drive decision-making in most sports nowadays. Sackmann did not predict an analytics revolution ten years ago, but I believe Hawkeye has an opportunity to build a SaaS product and market it to top professional players. The TAM might be small, but I could foresee top players leveraging a premium recommendation engine on top of this data.4

And lastly, predictions 🏆

Men’s
My pick: Novak Djokovic
Sleeper pick: Taylor Fritz
Rooting for: Carlos Alcaraz, Tommy Paul

Women’s
My pick: Aryna Sabalenka
Sleeper pick: Qinwen Zheng
Rooting for: Iga Swiatek, Danielle Collins (go hoos)5

Have an interesting data problem to solve, interested in the repo for the project above, or want to chat about why Carlos Alcaraz has the tools to become the greatest tennis player ever? Reach out at [email protected].

1  SQLFluff for SQL standardization and Datafold for data diff (RIP) are (were) a few other favorites in the space.

2  For a quick explainer on how Hawkeye works, check out this video.

3  The actual quote was “Its the not the Destination, It's the journey.” but I’m sure Emerson was thinking about app development when he wrote it, right? Right?

4  TAM = Total Addressable Market, or the Intro to Business means of sizing business opportunities.

5  “Wahoos, often shortened to 'Hoos, is a nickname for sports teams of the University of Virginia (officially the Cavaliers), and more generally, a nickname for University students and alumni.” - Wikipedia