NBA Box Scores: Data Cleaning with Pandas (Part 2 of?)

  • Wednesday, Oct 28, 2020
blog-image

I’ll use Pandas to clean the 2018-2019 Regular Season NBA Team Box Score data scraped from ESPN.com. The web scraping code can be found here and the CSV file of the raw data can be found here.

TL/DR: To prepare the data for future data analysis and predictive modeling, I’ll be doing the following:

  1. Drop redundant/unnecessary columns.
  2. Create new columns for ‘shots made’ and ‘shots attempted’ for Field Goals, 3 Pointers, and Free Throws.
  3. Create boolean columns for game wins and game losses.
  4. Create boolean columns for home games and away games.
import pandas as pd
import numpy as np

Looking at the Data

Let’s inspect the data to get a general overview of what was scraped from the web.

def display_df(df):
    """ Set column and row display options for a data frame. """
    with pd.option_context("display.min_rows", 10, "display.max_rows", 20, "display.max_columns", 100):
        display(df)

Read in the CSV file of the scraped NBA data and load it as a Pandas data frame.

year = '2019'
df = pd.read_csv('../data/nba_team_box_scores_'+str(year)+'.csv')

Let’s look at all of the columns of the first 5 and the last 5 rows of the data.

display_df(df)
Unnamed: 0 teamABBR teamName gameID gameDate gameLoc teamResult teamPTS teamFG teamFG% team3PT team3PT% teamFT teamFT% teamTREB teamASST teamSTL teamBLK teamTO teamFB_PTS teamPNT_PTS teamFOUL teamLG_LEAD teamOREB teamDREB teamTO_PTS teamFOUL_T teamFOUL_F opptABBR opptName opptPTS opptFG opptFG% oppt3PT oppt3PT% opptFT opptFT% opptTREB opptASST opptSTL opptBLK opptTO opptFB_PTS opptPNT_PTS opptFOUL opptLG_LEAD opptOREB opptDREB opptTO_PTS opptFOUL_T opptFOUL_F
0 0 bos Boston Celtics 401070213 Tue, Oct 16 vs W 105 42-97 43.3 11-37 29.7 10-14 71.4 64 21 7 5 15 6 34 20 18 12 43 11 0 0 phi Philadelphia 76ers 87 34-87 39.1 5-26 19.2 14-23 60.9 57 18 8 5 16 16 50 20 4 6 41 14 0 0
1 1 bos Boston Celtics 401070219 Fri, Oct 19 @ L 101 40-99 40.4 14-36 38.9 7-10 70.0 59 29 4 6 16 18 34 27 8 16 33 12 0 0 tor Toronto Raptors 113 43-92 46.7 13-35 37.1 14-19 73.7 57 24 3 3 13 14 46 19 12 12 37 11 0 0
2 2 bos Boston Celtics 401070711 Sat, Oct 20 @ W 103 33-82 40.2 9-25 36.0 28-33 84.8 55 21 8 2 15 9 36 26 16 11 36 17 1 0 ny New York Knicks 101 36-87 41.4 12-35 34.3 17-27 63.0 60 23 8 1 16 11 40 25 1 11 35 15 0 0
3 3 bos Boston Celtics 401070721 Mon, Oct 22 vs L 90 37-91 40.7 9-40 22.5 7-9 77.8 57 21 6 3 12 23 48 15 0 9 42 9 0 0 orl Orlando Magic 93 38-93 40.9 10-35 28.6 7-10 70.0 57 28 6 2 9 6 42 15 13 8 41 9 0 0
4 4 bos Boston Celtics 401070746 Thu, Oct 25 @ W 101 33-86 38.4 11-32 34.4 24-33 72.7 60 19 7 1 12 5 32 27 6 10 35 13 1 0 okc Oklahoma City Thunder 95 37-94 39.4 7-28 25.0 14-25 56.0 70 19 5 6 16 10 42 27 16 16 41 16 1 0
2453 2453 sa San Antonio Spurs 401071835 Tue, Apr 2 vs W 117 42-80 52.5 14-30 46.7 19-22 86.4 51 23 6 6 17 9 36 17 8 4 41 12 1 0 atl Atlanta Hawks 111 43-94 45.7 12-34 35.3 13-20 65.0 48 26 13 2 11 21 48 18 9 11 32 17 0 0
2454 2454 sa San Antonio Spurs 401071846 Wed, Apr 3 @ L 85 31-81 38.3 5-27 18.5 18-25 72.0 45 15 9 7 11 3 44 14 0 11 27 20 2 0 den Denver Nuggets 113 47-87 54.0 12-33 36.4 7-12 58.3 57 41 6 3 15 10 58 20 30 8 39 11 0 0
2455 2455 sa San Antonio Spurs 401071855 Fri, Apr 5 @ W 129 51-91 56.0 10-25 40.0 17-20 85.0 49 34 3 4 9 2 56 18 24 12 30 9 1 0 wsh Washington Wizards 112 42-88 47.7 9-32 28.1 19-22 86.4 43 22 2 4 10 0 54 16 4 9 25 8 0 0
2456 2456 sa San Antonio Spurs 401071869 Sun, Apr 7 @ W 112 43-82 52.4 9-22 40.9 17-22 77.3 54 27 10 5 18 19 34 18 27 11 38 22 1 0 cle Cleveland Cavaliers 90 31-79 39.2 8-29 27.6 20-27 74.1 45 16 14 1 18 10 30 16 6 10 25 22 2 0
2457 2457 sa San Antonio Spurs 401071900 Wed, Apr 10 vs W 105 41-88 46.6 8-24 33.3 15-18 83.3 60 22 6 2 11 9 46 14 22 8 45 13 0 0 dal Dallas Mavericks 94 37-91 40.7 11-37 29.7 9-12 75.0 47 27 5 4 7 12 36 19 1 7 35 9 0 0

2458 rows × 51 columns

Let’s see if there are any missing values, the data type of each column, the shape of the data frame, and the size of the data.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2458 entries, 0 to 2457
Data columns (total 51 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   Unnamed: 0   2458 non-null   int64
 1   teamABBR     2458 non-null   object
 2   teamName     2458 non-null   object
 3   gameID       2458 non-null   int64
 4   gameDate     2458 non-null   object
 5   gameLoc      2458 non-null   object
 6   teamResult   2458 non-null   object
 7   teamPTS      2458 non-null   int64
 8   teamFG       2458 non-null   object
 9   teamFG%      2458 non-null   float64
 10  team3PT      2458 non-null   object
 11  team3PT%     2458 non-null   float64
 12  teamFT       2458 non-null   object
 13  teamFT%      2458 non-null   float64
 14  teamTREB     2458 non-null   int64
 15  teamASST     2458 non-null   int64
 16  teamSTL      2458 non-null   int64
 17  teamBLK      2458 non-null   int64
 18  teamTO       2458 non-null   int64
 19  teamFB_PTS   2458 non-null   int64
 20  teamPNT_PTS  2458 non-null   int64
 21  teamFOUL     2458 non-null   int64
 22  teamLG_LEAD  2458 non-null   int64
 23  teamOREB     2458 non-null   int64
 24  teamDREB     2458 non-null   int64
 25  teamTO_PTS   2458 non-null   int64
 26  teamFOUL_T   2458 non-null   int64
 27  teamFOUL_F   2458 non-null   int64
 28  opptABBR     2458 non-null   object
 29  opptName     2458 non-null   object
 30  opptPTS      2458 non-null   int64
 31  opptFG       2458 non-null   object
 32  opptFG%      2458 non-null   float64
 33  oppt3PT      2458 non-null   object
 34  oppt3PT%     2458 non-null   float64
 35  opptFT       2458 non-null   object
 36  opptFT%      2458 non-null   float64
 37  opptTREB     2458 non-null   int64
 38  opptASST     2458 non-null   int64
 39  opptSTL      2458 non-null   int64
 40  opptBLK      2458 non-null   int64
 41  opptTO       2458 non-null   int64
 42  opptFB_PTS   2458 non-null   int64
 43  opptPNT_PTS  2458 non-null   int64
 44  opptFOUL     2458 non-null   int64
 45  opptLG_LEAD  2458 non-null   int64
 46  opptOREB     2458 non-null   int64
 47  opptDREB     2458 non-null   int64
 48  opptTO_PTS   2458 non-null   int64
 49  opptFOUL_T   2458 non-null   int64
 50  opptFOUL_F   2458 non-null   int64
dtypes: float64(6), int64(32), object(13)
memory usage: 979.5+ KB

Doesn’t look like there are any missing values, which is nice.

Now let’s get an idea of the distributions of the data columns.

display_df(df.describe())
Unnamed: 0 gameID teamPTS teamFG% team3PT% teamFT% teamTREB teamASST teamSTL teamBLK teamTO teamFB_PTS teamPNT_PTS teamFOUL teamLG_LEAD teamOREB teamDREB teamTO_PTS teamFOUL_T teamFOUL_F opptPTS opptFG% oppt3PT% opptFT% opptTREB opptASST opptSTL opptBLK opptTO opptFB_PTS opptPNT_PTS opptFOUL opptLG_LEAD opptOREB opptDREB opptTO_PTS opptFOUL_T opptFOUL_F
count 2458.000000 2.458000e+03 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000 2458.000000
mean 1228.500000 4.010713e+08 111.195688 46.143124 35.520342 76.712449 53.513832 24.583401 7.633035 4.951587 14.080146 13.800651 48.593979 20.903173 12.553702 10.345403 34.813670 16.433686 0.638324 0.043938 111.195688 46.143124 35.520342 76.712449 53.513832 24.583401 7.633035 4.951587 14.080146 13.800651 48.593979 20.903173 12.553702 10.345403 34.813670 16.433686 0.638324 0.043938
std 709.707804 3.712529e+02 12.648481 5.333076 8.682060 10.237144 7.451704 5.119945 2.947996 2.485877 3.927467 7.001434 10.047117 4.293987 9.376523 3.759008 5.467135 6.216017 0.861955 0.208932 12.648481 5.333076 8.682060 10.237144 7.451704 5.119945 2.947996 2.485877 3.927467 7.001434 10.047117 4.293987 9.376523 3.759008 5.467135 6.216017 0.861955 0.208932
min 0.000000 4.010702e+08 68.000000 27.800000 11.500000 26.300000 32.000000 10.000000 0.000000 0.000000 3.000000 0.000000 16.000000 9.000000 0.000000 1.000000 18.000000 2.000000 0.000000 0.000000 68.000000 27.800000 11.500000 26.300000 32.000000 10.000000 0.000000 0.000000 3.000000 0.000000 16.000000 9.000000 0.000000 1.000000 18.000000 2.000000 0.000000 0.000000
25% 614.250000 4.010710e+08 103.000000 42.600000 29.600000 70.000000 48.000000 21.000000 6.000000 3.000000 11.000000 9.000000 42.000000 18.000000 5.000000 8.000000 31.000000 12.000000 0.000000 0.000000 103.000000 42.600000 29.600000 70.000000 48.000000 21.000000 6.000000 3.000000 11.000000 9.000000 42.000000 18.000000 5.000000 8.000000 31.000000 12.000000 0.000000 0.000000
50% 1228.500000 4.010713e+08 111.000000 46.000000 35.300000 77.100000 53.000000 24.000000 7.000000 5.000000 14.000000 13.000000 48.000000 21.000000 11.000000 10.000000 35.000000 16.000000 0.000000 0.000000 111.000000 46.000000 35.300000 77.100000 53.000000 24.000000 7.000000 5.000000 14.000000 13.000000 48.000000 21.000000 11.000000 10.000000 35.000000 16.000000 0.000000 0.000000
75% 1842.750000 4.010716e+08 119.000000 49.500000 40.900000 84.000000 58.000000 28.000000 9.000000 6.000000 17.000000 18.000000 56.000000 24.000000 18.000000 13.000000 38.000000 20.000000 1.000000 0.000000 119.000000 49.500000 40.900000 84.000000 58.000000 28.000000 9.000000 6.000000 17.000000 18.000000 56.000000 24.000000 18.000000 13.000000 38.000000 20.000000 1.000000 0.000000
max 2457.000000 4.010719e+08 168.000000 64.900000 84.200000 100.000000 83.000000 42.000000 20.000000 19.000000 29.000000 53.000000 84.000000 38.000000 58.000000 26.000000 55.000000 44.000000 5.000000 2.000000 168.000000 64.900000 84.200000 100.000000 83.000000 42.000000 20.000000 19.000000 29.000000 53.000000 84.000000 38.000000 58.000000 26.000000 55.000000 44.000000 5.000000 2.000000

1. Drop “Unnamed: 0” Column

The “Unnamed: 0” column appears to only contain row numbers which is redundant with the data frame index. So we can drop that column.

df = df.drop(columns='Unnamed: 0')
df.head()
teamABBR teamName gameID gameDate gameLoc teamResult teamPTS teamFG teamFG% team3PT opptTO opptFB_PTS opptPNT_PTS opptFOUL opptLG_LEAD opptOREB opptDREB opptTO_PTS opptFOUL_T opptFOUL_F
0 bos Boston Celtics 401070213 Tue, Oct 16 vs W 105 42-97 43.3 11-37 16 16 50 20 4 6 41 14 0 0
1 bos Boston Celtics 401070219 Fri, Oct 19 @ L 101 40-99 40.4 14-36 13 14 46 19 12 12 37 11 0 0
2 bos Boston Celtics 401070711 Sat, Oct 20 @ W 103 33-82 40.2 9-25 16 11 40 25 1 11 35 15 0 0
3 bos Boston Celtics 401070721 Mon, Oct 22 vs L 90 37-91 40.7 9-40 9 6 42 15 13 8 41 9 0 0
4 bos Boston Celtics 401070746 Thu, Oct 25 @ W 101 33-86 38.4 11-32 16 10 42 27 16 16 41 16 1 0

5 rows × 50 columns

2. Create new columns for Shots Made and Shots Attempted

The stats scraped from ESPN.com list shots made and shots attempted as a single string delienated by a hyphen. Field Goals (teamFG or opptFG), for example:

df.teamFG[0], df.opptFG[0]
('42-97', '34-87')

We’ll split on the hyphen and create new columns for shots made and attempted for Field Goals, 3 Pointers, and Free Throws. We’ll also convert the new columns to integers.

df[['teamFGM', 'teamFGA']] = df.teamFG.str.split('-', expand=True).astype(int)

Check to make sure the new columns were created.

df.T.tail(3).T
opptFOUL_F teamFGM teamFGA
0 0 42 97
1 0 40 99
2 0 33 82
3 0 37 91
4 0 33 86
2453 0 42 80
2454 0 31 81
2455 0 51 91
2456 0 43 82
2457 0 41 88

2458 rows × 3 columns

The new columns have been appended to the end of the data frame. Also, we’ll check that the data type was converted.

df.teamFGM.dtype
dtype('int64')

Looks good, so we’ll split the remaining columns.

df[['opptFGM', 'opptFGA']] = df.opptFG.str.split('-', expand=True).astype(int)

df[['team3PTM', 'team3PTA']] = df.team3PT.str.split('-', expand=True).astype(int)
df[['oppt3PTM', 'oppt3PTA']] = df.oppt3PT.str.split('-', expand=True).astype(int)

df[['teamFTM', 'teamFTA']] = df.teamFT.str.split('-', expand=True).astype(int)
df[['opptFTM', 'opptFTA']] = df.opptFT.str.split('-', expand=True).astype(int)
df.T.tail(12).T
teamFGM teamFGA opptFGM opptFGA team3PTM team3PTA oppt3PTM oppt3PTA teamFTM teamFTA opptFTM opptFTA
0 42 97 34 87 11 37 5 26 10 14 14 23
1 40 99 43 92 14 36 13 35 7 10 14 19
2 33 82 36 87 9 25 12 35 28 33 17 27
3 37 91 38 93 9 40 10 35 7 9 7 10
4 33 86 37 94 11 32 7 28 24 33 14 25
2453 42 80 43 94 14 30 12 34 19 22 13 20
2454 31 81 47 87 5 27 12 33 18 25 7 12
2455 51 91 42 88 10 25 9 32 17 20 19 22
2456 43 82 31 79 9 22 8 29 17 22 20 27
2457 41 88 37 91 8 24 11 37 15 18 9 12

2458 rows × 12 columns

3. Handling Categorical Variables with One-Hot Encoding

Categorical variables are variables that have two or more discrete values. They come in 2 flavors: nominal, where there is no particular order (e.g. ‘red’, ‘green’, ‘blue’), and ordinal, where there is order (e.g. ‘small’, ‘medium’, ‘large’).

Most machine learning models require numerical data as input. So, it’s necessary to convert categorical data to numerical data using one of two methods:

  1. Label Encoding assigns an integer value to each category (e.g. red = 1, green = 2, blue =3). One draw back to this type of encoding is that the machine learning model may assume there is some kind of rank amongst the variables (e.g. red > green > blue). But sometimes a ranked order does not exist.
  2. One-Hot Encoding can circumvent the implicit ranking by adding new features (e.g. is_red = 0, is_green = 1, is_blue = 0). If a data point is green, then a boolean value of 1 (True) is listed as an is_green feature, while is_red and is_blue are listed as 0 (False). It’s important to note that if the cardinality were high (i.e. if there were a lot of different values, think lots of postal codes), one-hot encoding could lead to high memory consumption and poor modeling efficiency.

For this data we’ll convert the categorical variables (game location and team result) to numerical data using one-hot encoding.

df = pd.get_dummies(df, columns=['gameLoc', 'teamResult'])
df.T.tail(5).T
opptFTA gameLoc_@ gameLoc_vs teamResult_L teamResult_W
0 23 0 1 0 1
1 19 1 0 1 0
2 27 1 0 0 1
3 10 0 1 1 0
4 25 1 0 0 1

Now we’ll save our cleaned data frame.

df.to_csv('../data/nba_team_box_scores_'+str(year)+'_cleaned.csv')