Utilizing Graph to Make Sense of March Madness

Finding Correlation in March Madness Data with TigerGraph, Plotly and Seaborn

McKenzie Steenson
10 min readApr 12, 2022
Photo by Todd Greene on Unsplash

March has come and gone, and that only means one thing; another March Madness Tournament left most of us sad and disappointed with the results of our brackets. According to the NCAA, a perfect bracket has never been verifiably picked, so it technically left all of us disappointed with our picks. This was my very first year creating a bracket. I came into it with an unreasonable amount of simultaneous confidence and doubt in my choices. I of course had Gonzaga chosen to win it all. Why wouldn’t they win it all? They had the best season of all the teams, and they had a number one seed going into the tournament. But I soon learned that this meant nothing at all as they lost in an upset during the round of 16 to number four seeded Arkansas.

This got the engineer and data scientist in me interested to see if there was any way to make better choices when filling out next year’s bracket. Could any of a team’s statistics over the season point to them making it further in the tournament?

Objective

In this blog, we will walk through creating our TigerGraph march madness solution in TigerGraph GraphStudio— complete with a schema, loading data, and querying the data with GSQL. Then, we will walk through visualizing and exploring our queries utilizing Plotly and Seaborn. This will give us insight into how a team’s regular season statistics and outcomes impacted their tournament run.

Tools Used

  • TigerGraph Cloud
  • TigerGraph GraphStudio
  • Jupyter Notebook
  • pyTigerGraph
  • Plotly Express
  • Seaborn
  • MatPlotLib
  • Cufflinks

Data

The Kaggle Dataset is a culmination of all 355 NCAA Men’s Basketball team’s regular season statistics and post season outcomes from the 2013, 2014, 2015, 2016, 2017, 2018, and 2019 Division 1 seasons.

TigerGraph Setup

If you are not previously familiar, review the following resources to learn how to create a new TigerGraph solution, getting the solution started, and accessing the solution through GraphStudio:

Through the next few sections of the blog, I will be walking through the high level specifics of my solution. For a more detailed overview of creating your own graph solution, check out the TigerGraph GraphStudio documentation:

Schema

Navigating within TG GraphStudio, we click on the left navbar to our basketball solution:

We can now go to our Design Schema page. Here we can create vertices and edges, and publish our graph. I’ve created vertices for a Conference, a Team and a Season. The edges that connect each of the vertices denote their relationships:

I have added attributes to the ‘RESULTS_OF’ edge that connects a Team to a Season. These are the results of the statistics of each Season (year) in our dataset:

Mapping Data

Next, we navigate to Map Data To Graph on the navbar. Here, we upload our .csv file and make all the connects from the data file to our schema:

Load Data

Before we can get to explore and query our data, the data must be uploaded into our schema:

Explore Data

Now, we can explore our data! Below is a simple search of Gonzaga. We can show all connection in our data that Gonzaga has! This shows they are in the West Coast Conference, and we can get their results from each season:

Write Queries

Finally, we can navigate to the Write Queries tab and create queries to explore our data:

Here is an example query to get the average offensive and defensive efficiencies of each team over the seven seasons:

TG GraphStudio offers a few different ways to view the output of a query. The image below shows the vertex output:

The JSON output:

And chart output:

Our basketball solution is ready to explore and visualize!

Install Libraries and Create a Connection

Before getting to work in a Jupyter Notebook, the correct libraries must be installed prior to use. You can run the below commands within a notebook’s cells or you can install them each with a pip install [LIBRARYNAME]

!pip install pyTigerGraph 
!pip install plotly
!pip install seaborn
!pip install matplotlib
!pip install cufflinks

Once those libraries have been installed, you can open up a Jupyter Notebook and get them imported. I’ve also included numpy and pandas for dataframe manipulation:

import pyTigerGraph as tg
import plotly.express as px
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import cufflinks as cf
import warnings

Create the connection to the graph:

conn =tg.TigerGraphConnection(host="https://SUBDOMAIN.i.tgcloud.io", password="PASSWORD", graphname="basketball")conn.apiToken = conn.getToken(conn.createSecret())

Substitute SUBDOMAIN and PASSWORD with their respective values specific to your solution.

Finally, check that the connection was successfully made:

print(conn.gsql(‘ls’, options=[]))

Here is a quick snippet of the expected outcome of that function:

We are ready to run our installed queries!

Run Installed Queries

Installed queries used in the Notebook:

  • conferenceAppearanceAndPostseasonOutcome:
  • marchMadnessAppearances:
  • teamsThroughElite8:
  • topWinPercentages:
  • totalSeasonStats: this query is a culmination of accumulators for all the statistics for each team. Check it out by importing my solution here!

Note: @ denotes local accumulators in TigerGraph, while @@ denotes global variables. Local accumulators are used in these queries to accumulate the statistics specific to each team within each select statement.

Let’s check out how many appearances each conference has made in the tournament and how they have performed.

Run the installed query for the conferences appearances:

conferences = conn.runInstalledQuery("conferenceAppearanceAndPostseasonOutcome")[0]["result"]

The output of running the installed query will be in JSON format, so we must transform the output into a data frame:

conference = []
appearances = []
pastElite8 = []
for d in conferences:
for key, value in d['attributes'].items():
if(key == 'id'):
conference.append(value)
elif (key == '@numOtherAppearances'):
appearances.append(value)
elif (key == '@pastElie8'):
pastElite8.append(value)
d = {‘conference’: conference, ‘appearances’: appearances, ‘pastElite8’: pastElite8}df = pd.DataFrame(data=d)df.head()

And the resulting dataframe:

Let’s plot this data for all the conferences and show the Linear Regression Line for the data:

fig = px.scatter(df, x="appearances", y="pastElite8", title="Conference Appearances and Who Has Made it Passed the Elite8", hover_name=conference, hover_data=['conference'], trendline="ols")
fig.show()

As we can see, teams in the Big East Conference have made 52 appearances in the tournament, while making through the Elite 8 11 times. Now we can rinse and repeat this process to visualize the rest of our installed queries:

  • Top winning teams throughout the seven seasons:
  • How many times teams have actually made it to the tournament:
  • Win loss percentage compared to number of tournament appearances over seven seasons:
  • Regular season wins compared to tournament appearances:
  • Win/loss percentage compared to Number of games through the Elite 8 made:

The above charts were created with the installed queries and dataframe merging. All of the query calls and data manipulation can be found in this Jupyter Notebook!

Checking Out the Correlation

Is there something in the regular season statistics that differentiates the last 8 teams in the tournament? Let’s see if we can find any sort of correlation between the win/loss percentage of the team during the regular season and how many games through the Elite 8 they have made.

Much like before, we want to call the correct installed query, store it in a dataframe, but now we will want to merge it with the Elite 8 Appearances data frame. Once that is finished, we will create a heat map to visualize the correlation:

matrix = df6.corr().round(2)
sns.heatmap(matrix, annot=True)
plt.show()

A strong relationship/correlation is typically considered > .75 percent. For our data and how reduced the count becomes, I would assume that anything > .60 would show a pretty strong correlation. From the data we currently have, there isn’t much correlation between the regular season outcome and wether or not a team will make it to the Elite 8. Interestingly enough the amount of games won matters more than a team’s overall win/loss percentage.

Let’s check another piece of data, team efficiency. We’ll be including the Average Offensive and Defensive Efficiencies of the teams over the 7 seasons to see what kind of impact that has on the tournament outcomes through the Elite 8:

Regular Season offensive and defensive efficiency does not seem to be a major indicator of tournament performance either, but it looks like offense is more important than defense. Team’s have to score to win!

Let’s call a different query, one that will get all of the average of the regular season statistics, so we can compare the correlation of all of them in one heat map:

As we can see from the heat map above, the power rating has the only clear and strong correlation to the a team’s Elite 8 Appearances at .46. The power ratings is made up of three components: a team’s winning percentage, the average opponent’s winning percentage, and the average opponent’s opponent’s winning percentage. Offensive efficiency has the next strongest correlation at .32. This still isn’t the strongest correlation, but it is better than all of the other factors.

Conclusion

Kansas and UNC were the two teams that made it to the final of this year’s tournament. Prior to the tournament, FiveThirtyEight gave Kansas a 19% chance of making the final game and a 9% chance of winning the tournament, while UNC was given a 1% chance to make the final and a very improbable 0.4% chance of winning the game. UNC was leading the game by 15 points at half time and lost to Kansas by only 3 points. Saint Peter’s, a 15 seeded team with a 0.2% chance of making it to the Elite 8, defied those odds and were lowest seeded team to ever make it that far in the tournament. Talk about going against the odds.

Through my data exploration utilizing TigerGraph and visualization tools, I was able to get better insight into why choosing a winning bracket and predicting the tournament outcomes is so difficult, the historical regular season data has little to no impact on the outcomes because of the unpredictable nature of the tournament itself. Next year, I’ll make sure to pay attention to how many games a team wins through their season as well as their power ranking, but I will also need to throw some upsets in there just to be safe.

March madness is just that, madness.

Whats Next + Resources

Thank you for finishing this blog! You now have the tools and the knowledge to create your own graph, write a few queries, and visualize correlation in a dataset of your choosing. Check out sports data found on Kaggle and data.world! Utilize your new skills and knowledge and take TigerGraph’s Million Dollar Challenge for your chance to win grand cash prizes for your graph solutions!

If you need any help or you’d like to showcase what you create, join the TigerGraph Discord Community:

Checkout the repository below that contains the Jupyter Notebook and my TigerGraph solution export:

Thanks again, and good luck with your next bracket!

--

--

McKenzie Steenson

Developer Relations Intern at TigerGraph, lover of music, golf and video games. Connect with me: https://www.linkedin.com/in/mckenzie-steenson/