Danblog

Medical Resident, Software Developer, Basketball Fan

(More) Fun with NFL Stats


I came across this blog post from J253, which explores . The author uses a Kaggle dataset that contains play-by-play data for every NFL game between 2009 and 2017--pretty cool! The post explores play type by down, yards to go, and position on the field. It's a great introduction to the dataset and using Bokeh for visualization.

With 9 seasons of data available, the obvious next step to me was to attempt to visualize changes in these trends over time. Any football fan has heard about the explosion of the passing game over the last 1-2 decades ad nauseam. In this post, I'll adapt a couple of these visualizations to allow us to compare data year-to-year. In the process, I'll demonstrate a method for creating dynamic visualizations in Bokeh using custom javascript.

In [1]:
import pandas as pd
from collections import Counter
import itertools
pd.set_option('display.max_columns', 150)
In [2]:
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, CustomJS, Slider, Legend, FixedTicker
from bokeh.io import output_notebook
from bokeh.resources import CDN
from bokeh.embed import file_html
from collections import Counter
from bokeh.transform import factor_cmap
from bokeh.palettes import Paired, Spectral
from bokeh.layouts import row, column, widgetbox
# from IPython.core.display import HTML
output_notebook()
Loading BokehJS ...
In [3]:
path = '~/Downloads/NFL Play by Play 2009-2017 (v4).csv'
df = pd.read_csv(path, low_memory=False)
In [4]:
# filter by team if desired
TEAM = 'all'
if TEAM == 'all':
    team_df = df
else:
    team_df = df.loc[df['posteam'] == TEAM]

team_df = team_df.loc[df['down'].notnull()]

play_types = ('Pass', 'Run', 'Punt', 'Field Goal')

I'll skip the boilerplate, as it's unchanged from J253's post. We'll add a seasons variable: a set of integers representing the seasons of interest, as they are represented in the dataset.

In [5]:
seasons = (2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017)

The key difference in implementation is that we will be creating ColumnDataSources for each season. We will (re)create two different visualizations:

  1. Third Down Play Type by Yards To Go
  2. Play Type by Yard Line

Let's start with the first. Here we define our yards-to-go range of interest and filter the dataset to include only third down plays.

In [6]:
# define range of yards to go I want to look at
y2g = range(1,25)

# filter down the total team_df to just third downs
team_df_d3 = team_df.loc[team_df['down'] == 3]

# Our x axis will be yards to go, defined above as ytg
x = y2g

Next we'll make a quick and dirty function that filters the pandas datasets by season, then formats the data in a way that we can use as a "column":

In [7]:
def plays_by_y2g(t:str, season:int):
    # t must be either 'Run' or 'Pass' or 'Field Goal' or 'Punt'
    # filter play types by season
    team_df_season = team_df_d3.loc[team_df_d3['Season'] == season]
    plays_on_d3 = [Counter(team_df_season.loc[team_df_season['ydstogo'] == yrd]['PlayType']) for yrd in y2g]

    # extract the count of a particular play type for each 'ydstogo' value
    return [play[t] for play in plays_on_d3]

For example:

In [8]:
plays_by_y2g("Run", 2009)
Out[8]:
[619,
 192,
 123,
 97,
 78,
 59,
 59,
 44,
 49,
 68,
 22,
 24,
 24,
 20,
 22,
 14,
 9,
 11,
 6,
 4,
 6,
 6,
 2,
 3]

Now we can create a ColumnDataSource for each season. I'll use list comprehension and unpacking to do this in one line:

In [9]:
(source09, source10, source11, source12, source13,
source14, source15, source16, source17) = [
    ColumnDataSource(dict(x=x, y_runs=plays_by_y2g('Run', season), y_pass=plays_by_y2g('Pass', season)))
                     for season in seasons
]

To visualize changes over time, we'll use a Bokeh slider. The frontend can't update dynamically without running a dedicated Bokeh server, meaning the client needs to have all of the data it needs to draw the visualization up front. Thankfully, with some custom javascript1, we can create a callback that switches the source data used for the visualization as we manipulate a slider:

// define the data sources passed to the callback as arguments
var data = source.data;
var nine = nine.data;
var ten = ten.data;
var eleven = eleven.data;
var twelve = twelve.data;
var thirteen = thirteen.data;
var fourteen = fourteen.data;
var fifteen = fifteen.data;
var sixteen = sixteen.data;
var seventeen = seventeen.data;

// bokeh makes the object triggering the callback available as cb_obj
var f = cb_obj.value;

// switch on the season selected in the slider (cb_obj.value)
if (f == 2009) {
      for (var e in data) delete data[e];  // clears the dummy datasource
      data['x'] = nine['x'];
      data['y_runs'] = nine['y_runs'];
      data['y_pass'] = nine['y_pass'];
}

if (f == 2010) {
  for (var e in data) delete data[e];
  data['x'] = ten['x'];
  data['y_runs'] = ten['y_runs'];
  data['y_pass'] = ten['y_pass'];
}
// etc...
source.change.emit();

We'll use the following function to load the above javascript, define a callback with CustomJS, and create a season slider:

In [10]:
def create_season_slider(js_path):
    # read our custom javascript
    with open(js_path) as f:
        code = f.read()
    # define a callback function that will be used each time the slider value changes
    callback = CustomJS(args=dict(source=source09, nine=source09, ten=source10, eleven=source11,
                              twelve=source12, thirteen=source13, fourteen=source14,
                              fifteen=source15, sixteen=source16, seventeen=source17), code=code)
    # create a slider (see http://bokeh.pydata.org/en/0.10.0/docs/user_guide/interaction.html#slider)
    season_slider = Slider(start=2009, end=2017, value=2009, step=1,
                    title="Season", callback=callback)
    # finally, associate our season slider with the callback
    callback.args["season"] = season_slider
    return season_slider

Finally, we'll display our plot and slider within a Bokeh layout, initializing with data from the 2009 season.

In [11]:
def show_plot_with_slider(plot, slider):
    layout = column(p, widgetbox(slider))
    show(layout)
    
p = figure(title='Third Down Play Type by Yard to Go', toolbar_location=None, tools='',
           plot_height=350, plot_width=750, y_range=(0, 600))

p.line(source=source09, x="x", y="y_pass", color='#2b83ba', legend='Pass', line_width=4)
p.line(source=source09, x="x", y="y_runs", color='#abdda4', legend='Run', line_width=4)
p.legend.location = 'top_left'

show_plot_with_slider(p, create_season_slider("./js/modify_bokeh_slider_3dytg.js"))

We'll follow the same steps to generate a dynamic, year-by-year visualization of play type by position on the field.

In [12]:
# define range of yard lines and set it as our x-axis
yrdline = range(0,101)
x = yrdline

# helper function from J253
def total_plays(i, play, cou):
    total_plays = sum([cou[i][play] for play in play_types])
    if total_plays == 0:
        return 1
    else:
        return total_plays

def play_by_yrd(t:str, season:int):
    # normalized for number of plays on each yard line
    # t must be either 'Run' or 'Pass' or 'Field Goal' or 'Punt'
    team_df_season = team_df.loc[team_df['Season'] == season]
    # create list of Counters for every yard line on the field.
    plays_on_yrd = [Counter(team_df_season.loc[team_df_season['yrdline100'] == yrd]['PlayType']) for yrd in yrdline]
    return [play[t]/total_plays(i, play, plays_on_yrd) for i, play in enumerate(plays_on_yrd)]

(source09, source10, source11, source12, source13,
source14, source15, source16, source17) = [
    ColumnDataSource(dict(x=x, y_runs=play_by_yrd('Run', season), y_pass=play_by_yrd('Pass', season),
                          y_punt=play_by_yrd('Punt', season), y_fg=play_by_yrd('Field Goal', season)))
                     for season in seasons
]

p = figure(title='Play Type by Yard Line', toolbar_location=None, tools='',
           plot_height=350, plot_width=750, x_range=(1,99), y_range=(0, .75))
y_pass = p.line(source=source09, x="x", y="y_pass", color='#2b83ba', line_width=4)
y_runs = p.line(source=source09, x="x", y="y_runs", color='#abdda4', line_width=4)
y_punt = p.line(source=source09, x="x", y="y_punt", color='#fdae61', line_width=4)
y_fg = p.line(source=source09, x="x", y="y_fg", color='#d7191c', line_width=4)
p.xaxis.axis_label = 'Yard Line (100 is team\'s own goal line)'
p.yaxis.axis_label = 'Number of Plays'
p.xaxis.ticker = FixedTicker(ticks=list(range(0, 101, 5)))

# Displaying the legend outside of the plot, otherwise it blocks some of the data
legend = Legend(items=[
    ("Pass"   , [y_pass]),
    ("Run" , [y_runs]),
    ("Punt" , [y_punt]),
    ("Field Goal", [y_fg]),

], location=(0, -30))
p.add_layout(legend, 'right')

# The JS is mostly the same as above, but here also sets the "y_punt" and "y_fg" columns
# each time the slider is updated
show_plot_with_slider(p, create_season_slider("./js/modify_bokeh_slider_yrd.js"))

Footnotes

1 Thanks to this stack overflow answer for the custom javascript callback