Documenting the NBA Stats API

Fri 01 July 2016
Python
#nba, #stats, #swagger

Introduction

Those of you who have played around with stats.nba.com have likely experienced the same mix of excitement and frustration that I have. Clearly, there is an awesome amount of data that is lurking just beneath the surface: boxscores, advanced stats, shot charts, SportVU tracking data. The NBA tempts us by making it possible to take advantage of their underlying stats API if you can figure it out, but (I assume purposely) makes it difficult due to the lack of any documentation. Extracting any meaningful data can quickly become a huge time sink.

There are a number of good clients for the NBA Stats API (seemethere/nba_py, nickb1080/nba), but whether you're writing a client or just trying to use and understand an existing client, the biggest barrier to entry is the time needed to manually inspect requests and responses in an effort to find the right endpoint and parameters.

Probably the best effort to document the NBA Stats API can be found in the nba_py wiki on GitHub. This document was first posted in September, and was updated again in March. Big kudos to seemethere for creating this, as it's undoubtedly the hardest part of creating any documentation for the API. But it does leave some questions unanswered.

In an initial effort to improve the available documentation for the NBA Stats API, I aimed to answer a few of those questions, and to provide a different format for the documentation that would be machine-readable (allowing for easy client generation) and user friendly (allowing anyone to quickly view the documentation in the browser). In this post, I'll show how I used the documentation from the nba_py wiki to generate a JSON representation of the NBA Stats API that complies with the Swagger Specification. I used Python to parse the wiki and generate Swagger-compatible JSON.

The code that was used to generate this documentation can be found at github.com/danielwelch/little-pynny, and the current browsable version of the documenation, generated using Swagger UI, can be found at the Github Pages project page.

Parsing the nba_py Wiki

To start, let's parse the documentation from the nba_py wiki to create a JSON file that contains the name of each endpoint, and whether or not that endpoint is required. Most of the creative heavy-lifting was already done here; I just adapted this javascript from nickb1080 to Python:

def parse_endpoint(block):
    name = block[0][3:-1]
    param_names = [line[4:] for line in block[2:]
                   if line[4:] is not '']
    params = [
        {"name": p, "required": False}
        for p in param_names
    ]
    return {"endpoint": name, "params": params, "description": None}


def get_endpoints(path):
    with open(path, 'r') as f:
        md = ''.join(f.read())

    endpoints = [parse_endpoint(block)
                 for block in map(lambda s: s.split("\n"), md.split("\n\n"))
                 if block[0].startswith("##")]

    return endpoints

Running get_endpoints(PATH_TO_NBA_PY_DOCS) will return a list of dictionaries, with each containing the endpoint name (same as its path) and information about the parameters for that endpoint. Using this information, we can programatically hit these endpoints and inspect the responses to determine if the endpoint is active. I did this using requests:

BASE_URL = 'http://stats.nba.com/stats/{endpoint}'
HEADERS = {'user-agent': ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
                          'AppleWebKit/537.36 (KHTML, like Gecko) '
                          'Chrome/45.0.2454.101 Safari/537.36'),
           'referer': 'http://stats.nba.com/scores/'}
for d in data:
    method = d["endpoint"].lower()
    r = requests.get(
        BASE_URL.format(endpoint=method),
        headers=HEADERS
    )

    # Inspect r.text to determine if endpoint is active
    # and, if so, what params are required.

For now, the "required" key in each parameter dictionary is just a placeholder. The next step is to determine what parameters are actually required, and update our data structure accordingly. Thankfully, the server will actually tell us what parameters are required for each endpoint if we send it a bad request. For example, here's the response body when we send a GET request to stats.nba.com/stats/boxscoreadvancedv2/ without parameters:

GameID is required; The StartPeriod property is required.; The EndPeriod property is required.; The StartRange property is required.; The EndRange property is required.; The RangeType property is required.

Well, that's nice. These responses can be parsed with regular expressions. Of course, in the tradition of the NBA Stats API, there are some inconsistencies. Some parameters ("Game Scope", "Player Scope", "Stat Category") are two words instead of one word in the response body, so we need to account for this in our regexes.

def define_required_params(endpoint, response_phrases):
    required_params = []
    for phrase in response_phrases:
        m1 = re.match("^(?P<param>(\w*)|(Game Scope)|(Player Scope)|(Stat Category)) is required.?", phrase)
        m2 = re.match("^The (?P<param>(\w*)|(Game Scope)|(Player Scope)|(Stat Category)) property is required.?", phrase)
        if m1 is None:
            if m2 is None:
                raise ParamMatchException(method, phrase)
            else:
                param = m2.group("param")
        else:
            param = m1.group("param")
        # deal with the random params nba puts a space in for no reason
        if param is "Game Scope":
            required_params.append("GameScope")
        if param is "Player Scope":
            required_params.append("PlayerScope")
        if param is "Stat Category":
            required_params.append("StatCategory")
        else:
            required_params.append(param)

    for p in endpoint["params"]:
        if p["name"] in required_params:
            p["required"] = True

    return endpoint

So, we can split the response text around "; " and pass the phrases and our original data structure to the define_required_params function.

At this point, if we dumped this data into JSON, we would have an array of objects like this:

{
    "endpoint": "boxscoreadvancedv2",
    "description": null,
    "active": true,
    "params": [
      {
        "name": "GameID",
        "required": true
      },
      {
        "name": "StartPeriod",
        "required": true
      },
      {
        "name": "EndPeriod",
        "required": true
      },
      {
        "name": "StartRange",
        "required": true
      },
      {
        "name": "EndRange",
        "required": true
      },
      {
        "name": "RangeType",
        "required": true
      }
    ]
  }

Converting to Swagger

While you could use the above example JSON to generate a simple client using something like Jinja2 in Python, it would probably be better to put this information into a standardized format. The Swagger Specification is a popular choice and has useful tools for generating clients as well as good-looking documentation. Reformatting our current structure to comply with the Swagger Specification is easy enough, and the code I used to do this can be found in the repo (swaggerify.py). The Swagger UI representation of this documentation is hosted on the little-pynny project page.

Limitations and Drawbacks

There are some pretty clear limitations to this approach. First, the documentation contains no description or explanation of what these endpoints actually do, only a placeholder for a description to be added in the future. While some endpoints are fairly straightforward, with meaning that can be inferred from a response, other endpoints return responses that are more opaque.

Second, the fact that this documentation is generated from a static starting point means that any new endpoints that are added to the NBA API will have to be manually discovered and added. Similarly, this approach relies on the fact that the nba_py documentation is accurate and exhaustive. On the bright side, regenerating the documentation should identify newly deprecated endpoints, and update the documentation to reflect that.

What's Next?

Next, it would be nice to fill out the Swagger representation with more useful information. Each endpoint would benefit from a description or an example of what it returns. Additionally, it would be great to document the correct use of each parameter for an endpoint. The first step in further documenting the parameters could be inspecting the server's response to bad parameters. For example, a request to stats.nba.com/stats/homepagev2/ with inappropriate parameters will receive the following response:

The field StatType must match the regular expression '^(Traditional)|(Advanced)|(Tracking)$'.; The field LeagueID must match the regular expression '^\d{2}$'.; The field GameScope must match the regular expression '^(Season)|(Last 10)|(Yesterday)|(Finals)$'.; The field PlayerScope must match the regular expression '^(All Players)|(Rookies)$'.;

Some parameter documentation could be generated quickly and easily by parsing these sorts of responses.