NBA Team Box Scores (Part 1 of ?): Web Scraping with BeautifulSoup

  • Tuesday, Oct 20, 2020
blog-image

Introduction

This project uses Python and BeautifulSoup to scrape ESPN for the box scores of regular season games for NBA teams. The box score is a selected set of statistics which summarize the results of a game.

This data will be useful for answering questions such as:

  • Which NBA statistics are actually useful for determining wins?
  • Can we predict whether a team will make the playoffs?
  • Can we predict whether a team will win a game?

The answers to these questions will be the most useful to a team or a coach when trying to determine what aspects of gameplay to improve upon. If securing more offensive rebounds appears to be more important to winning a game than the percentage of successful free throws, then teams should be drilling rebounds instead of practicing free throws.

REWRITE …… In the sport of basketball, the box score is used to summarize/average the data of Games played (GP), Games started (GS), Minutes Played (MIN or MPG), Field-goals made (FGM), Field-goals attempted (FGA), Field-goal percentage (FG%), 3-pointers made (3PM), 3-pointers attempted (3PA), 3-point field goal (3P%), Free throws made (FTM), Free throws attempted (FTA), Free throw percentage (FT%), Offensive Rebounds (OREB), Defensive Rebounds (DREB), Total rebounds (REB), Assists (AST), Turnovers (TOV), Steals (STL), Blocked shots (BLK), Personal fouls (PF), Points scored (PTS), and Plus/Minus for Player efficiency (+/-).”

Scrape ESPN Team Page for Team Names

First, we’re going to need a list of the NBA teams, so let’s look at the strucure of the ESPN website. The teams page looks like the following:

Inspection of the page shows that links for each team are in a section container with a ‘pl3’ class.

So let’s open the team page, parse it, and find all containers of that class.

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from urllib import request
from urllib.error import HTTPError, URLError
import time

team_url = request.urlopen('http://www.espn.com/nba/teams').read()
team_soup = BeautifulSoup(team_url,'lxml')
team_containers = team_soup.find_all('div',{'class':'pl3'})

To get the team name, we grab the text inside the <a> tag inside each team’s section container. Then we’ll get the abbreviation for each team by splitting on ‘/’ in a team’s URL and grabbing the 7th element. Next, we’ll create a dictionary of the teams and their abbreviations. This will come in handy later.

team_names = [team.a.text for team in team_containers]
team_abbrs = [team.find_all('a',href=True)[1]['href'].split('/')[6] for team in team_containers]
teams = dict(zip(team_abbrs,team_names))

Let’s make sure we got all 30 teams.

print(len(teams))
teams
30

{'bos': 'Boston Celtics',
 'bkn': 'Brooklyn Nets',
 'ny': 'New York Knicks',
 'phi': 'Philadelphia 76ers',
 'tor': 'Toronto Raptors',
 'chi': 'Chicago Bulls',
 'cle': 'Cleveland Cavaliers',
 'det': 'Detroit Pistons',
 'ind': 'Indiana Pacers',
 'mil': 'Milwaukee Bucks',
 'den': 'Denver Nuggets',
 'min': 'Minnesota Timberwolves',
 'okc': 'Oklahoma City Thunder',
 'por': 'Portland Trail Blazers',
 'utah': 'Utah Jazz',
 'gs': 'Golden State Warriors',
 'lac': 'LA Clippers',
 'lal': 'Los Angeles Lakers',
 'phx': 'Phoenix Suns',
 'sac': 'Sacramento Kings',
 'atl': 'Atlanta Hawks',
 'cha': 'Charlotte Hornets',
 'mia': 'Miami Heat',
 'orl': 'Orlando Magic',
 'wsh': 'Washington Wizards',
 'dal': 'Dallas Mavericks',
 'hou': 'Houston Rockets',
 'mem': 'Memphis Grizzlies',
 'no': 'New Orleans Pelicans',
 'sa': 'San Antonio Spurs'}

Scraping ESPN Team Schedule Page for Game IDs

The page containing a game’s box score is of the following format: https://www.espn.com/nba/matchup?gameId=401070218. So we need to get a list of all the game IDs. We do this by looking at the schedule page of each team.

A team’s regular season schedule is listed on a URL with the following format: https://www.espn.com/nba/team/schedule/_/name/ABBR/season/YEAR/seasontype/2, where ABBR (team abbreviation) and YEAR are our variables of interest.

The 2018-2019 Regular Season Schedule for the Golden State Warriors looks like the following:

# For the 2018-2019 regular season
year = '2019'

team_schedules = dict()
for team_abbr in list(teams.keys()):
    # Open the Team Schedule page and parse it.
    url = 'https://www.espn.com/nba/team/schedule/_/name/'+team_abbr+'/season/'+year+'/seasontype/2'
    schedule_url = request.urlopen(url).read()
    time.sleep(1)
    schedule_soup = BeautifulSoup(schedule_url,'lxml')

    # The class for each table row of a team's schedule contains the string "Table__TR".
    # So, search for that and exclude the first row which contains column names.
    game_rows = schedule_soup.select('tr[class*="Table__TR"]')[1:]

    # Grab the dates of each game.
    game_dates = [game.td.span.text for game in game_rows]
    # Append the team_abbr to each game ID. These will be used as dictionary keys later on.
    # For each row, get the all the <a> tag hyperlinks, select the third hyperlink,
    # split on the equals sign, and take the second element of the split to get the gameID.
    game_ids = [team_abbr+'_'+game.find_all('a',href=True)[2]['href'].split('=')[1] for game in game_rows]

    # The game locations are stored in a section element of class "flex items-center opponent-logo".
    game_logos = [game.find_all('div', attrs={'class':'flex items-center opponent-logo'}) for game in game_rows]
    # Game locations are indicated by "@" or "vs" for away and home, respectively.
    game_locs = [game[0].span.text for game in game_logos]

    # The result of a game is stored in a span element with a class containing the string "fw-bold clr-".
    game_results = [game.text for game in schedule_soup.select('span[class*="fw-bold clr-"]')]

    # Put table rows into a dictionary with a key of format teamABBR_gameID
    team_rows = list(zip(game_dates, game_locs, game_results))
    team_schedules.update(zip(game_ids,team_rows))

Scraping the Box Scores

There are three tables of class “mod-data” on the Team Matchup page. Recall, the Team Matchup URL is of the format: https://www.espn.com/nba/matchup?gameId=401070218, and the page looks like the following:

We’re interested in the largest table which contains the team box score (FG, FG%, 3PT, etc.) for this particular game.

def get_box_score(table_data):
    """ Get all the cell data in a table. """
    box_data = []
    for tr in table_data:
        td = tr.find_all('td')
        row = [tr.text.strip('\t\n') for tr in td]
        box_data.append(row)
    return box_data
final_data = []
for key in list(team_schedules.keys()):
    team_abbr = key.split('_')[0]
    game_id = key.split('_')[1]

    # Make sure the URL opens
    try:
        box_url = request.urlopen('https://www.espn.com/nba/matchup?gameId='+game_id).read()
        time.sleep(1)
    # If there is an HTTP error, wait 2 mins and try again.
    except HTTPError as e:
        print(team_abbr, game_id, e.reason)
        time.sleep(120)
        box_url = request.urlopen('https://www.espn.com/nba/matchup?gameId='+game_id).read()
        pass

    # Make sure the URL has the same table format.
    try:
        box_soup = BeautifulSoup(box_url,'lxml')

        # Table containing the box score data
        box_table = box_soup.find_all('table',attrs={'class':'mod-data'})[0]

        # Get abbreviation and game points of the home team.
        home_team_container = box_soup.find_all('div', attrs={'class':'team home'})
        home_team_abbr = home_team_container[0].find_all('span', attrs={'class':'abbrev'})[0].text.lower()
        home_team_pts = box_soup.find_all('div', attrs={'class': 'score icon-font-before'})[0].text

        # Get abbreviation and game points of the away team.
        away_team_container = box_soup.find_all('div', attrs={'class':'team away'})
        away_team_abbr = away_team_container[0].find_all('span', attrs={'class':'abbrev'})[0].text.lower()
        away_team_pts = box_soup.find_all('div', attrs={'class': 'score icon-font-after'})[0].text

        # Put the game points and a column name in a list.
        points = ['PTS', away_team_pts, home_team_pts]

        # Table rows with an indent class
        indents = box_table.find_all('tr', attrs={'class':'indent'})
        # Table rows with a highlight class
        highlights = box_table.find_all('tr', attrs={'class':'highlight'})

        # Get the box scores
        highlights_data = get_box_score(highlights)
        indents_data = get_box_score(indents)

        # Concatenate points, highlights, and indents lists. Then transpose.
        box_data = np.concatenate(([points],highlights_data,indents_data)).T

        # List the team's data first then the opponent's data.
        # Also include the gameID, game date, game location, and game result.
        if team_abbr == away_team_abbr:
            row = np.concatenate(([team_abbr,teams[team_abbr],game_id],team_schedules[key],box_data[1],
                                  [home_team_abbr,teams[home_team_abbr]],box_data[2]))
        else:
            row = np.concatenate(([team_abbr,teams[team_abbr],game_id],team_schedules[key],box_data[2],
                                  [away_team_abbr,teams[away_team_abbr]],box_data[1]))
        final_data.append(row)

    # If the URL does not have the box score, skip this game ID and continue to the next one.
    except IndexError as e:
        print(team_abbr, game_id, e)
        continue
mil 401070856 list index out of range
lac 401070856 list index out of range

Looks like gameID = 401070856 between the Milwaukee Bucks and the L.A. Clippers is the only game which did not have a MatchUp page.

Let’s get a list of the box score stats.

box_data[0]
array(['PTS', 'FG', 'Field Goal %', '3PT', 'Three Point %', 'FT',
       'Free Throw %', 'Rebounds', 'Assists', 'Steals', 'Blocks',
       'Total Turnovers', 'Fast Break Points', 'Points in Paint', 'Fouls',
       'Largest Lead', 'Offensive Rebounds', 'Defensive Rebounds',
       'Points Off Turnovers', 'Technical Fouls', 'Flagrant Fouls'],
      dtype='<U20')

The final data frame will contain columns for a team (‘teamPTS’, ‘teamDREB’, etc.) and for the opponent of that team (‘opptPTS’, ‘opptDREB’, etc.) for a given game. It will also include the gameID, the date the game was played, whether the game was home (‘vs’) or away ('@'), and the result of the game for the team (‘W’ or ‘L’).

col_names = ['teamABBR', 'teamName','gameID', 'gameDate', 'gameLoc', 'teamResult', 'teamPTS', 'teamFG', 'teamFG%',
             'team3PT', 'team3PT%', 'teamFT', 'teamFT%', 'teamTREB', 'teamASST', 'teamSTL', 'teamBLK',
             'teamTO', 'teamFB_PTS', 'teamPNT_PTS', 'teamFOUL', 'teamLG_LEAD', 'teamOREB', 'teamDREB',
             'teamTO_PTS', 'teamFOUL_T', 'teamFOUL_F',
             'opptABBR', 'opptName', 'opptPTS', 'opptFG', 'opptFG%',
             'oppt3PT', 'oppt3PT%', 'opptFT', 'opptFT%', 'opptTREB', 'opptASST', 'opptSTL', 'opptBLK',
             'opptTO', 'opptFB_PTS', 'opptPNT_PTS', 'opptFOUL', 'opptLG_LEAD', 'opptOREB', 'opptDREB',
             'opptTO_PTS', 'opptFOUL_T', 'opptFOUL_F']

# Create the data frame
df = pd.DataFrame(final_data, columns=col_names)

Let’s check the data frame for a random gameID.

df[df.gameID=='401071855']
teamABBR teamName gameID gameDate gameLoc teamResult teamPTS teamFG teamFG% team3PT opptTO opptFB_PTS opptPNT_PTS opptFOUL opptLG_LEAD opptOREB opptDREB opptTO_PTS opptFOUL_T opptFOUL_F
2045 wsh Washington Wizards 401071855 Fri, Apr 5 vs L 112 42-88 47.7 9-32 9 2 56 18 24 12 30 9 1 0
2455 sa San Antonio Spurs 401071855 Fri, Apr 5 @ W 129 51-91 56.0 10-25 10 0 54 16 4 9 25 8 0 0

Looks like we’re successful! Now we just save our data to a CSV file.

df.to_csv('../data/nba_team_box_scores_'+str(year)+'.csv')