I’ll use Pandas to clean the 2018-2019 Regular Season NBA Team Box Score data scraped from ESPN.com. The web scraping code can be found here and the CSV file of the raw data can be found here.
TL/DR: To prepare the data for future data analysis and predictive modeling, I’ll be doing the following:
- Drop redundant/unnecessary columns.
- Create new columns for ‘shots made’ and ‘shots attempted’ for Field Goals, 3 Pointers, and Free Throws.
- Create boolean columns for game wins and game losses.
- Create boolean columns for home games and away games.
import pandas as pd
import numpy as np
Looking at the Data
Let’s inspect the data to get a general overview of what was scraped from the web.
def display_df(df):
""" Set column and row display options for a data frame. """
with pd.option_context("display.min_rows", 10, "display.max_rows", 20, "display.max_columns", 100):
display(df)
Read in the CSV file of the scraped NBA data and load it as a Pandas data frame.
year = '2019'
df = pd.read_csv('../data/nba_team_box_scores_'+str(year)+'.csv')
Let’s look at all of the columns of the first 5 and the last 5 rows of the data.
display_df(df)
Unnamed: 0 | teamABBR | teamName | gameID | gameDate | gameLoc | teamResult | teamPTS | teamFG | teamFG% | team3PT | team3PT% | teamFT | teamFT% | teamTREB | teamASST | teamSTL | teamBLK | teamTO | teamFB_PTS | teamPNT_PTS | teamFOUL | teamLG_LEAD | teamOREB | teamDREB | teamTO_PTS | teamFOUL_T | teamFOUL_F | opptABBR | opptName | opptPTS | opptFG | opptFG% | oppt3PT | oppt3PT% | opptFT | opptFT% | opptTREB | opptASST | opptSTL | opptBLK | opptTO | opptFB_PTS | opptPNT_PTS | opptFOUL | opptLG_LEAD | opptOREB | opptDREB | opptTO_PTS | opptFOUL_T | opptFOUL_F | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | bos | Boston Celtics | 401070213 | Tue, Oct 16 | vs | W | 105 | 42-97 | 43.3 | 11-37 | 29.7 | 10-14 | 71.4 | 64 | 21 | 7 | 5 | 15 | 6 | 34 | 20 | 18 | 12 | 43 | 11 | 0 | 0 | phi | Philadelphia 76ers | 87 | 34-87 | 39.1 | 5-26 | 19.2 | 14-23 | 60.9 | 57 | 18 | 8 | 5 | 16 | 16 | 50 | 20 | 4 | 6 | 41 | 14 | 0 | 0 |
1 | 1 | bos | Boston Celtics | 401070219 | Fri, Oct 19 | @ | L | 101 | 40-99 | 40.4 | 14-36 | 38.9 | 7-10 | 70.0 | 59 | 29 | 4 | 6 | 16 | 18 | 34 | 27 | 8 | 16 | 33 | 12 | 0 | 0 | tor | Toronto Raptors | 113 | 43-92 | 46.7 | 13-35 | 37.1 | 14-19 | 73.7 | 57 | 24 | 3 | 3 | 13 | 14 | 46 | 19 | 12 | 12 | 37 | 11 | 0 | 0 |
2 | 2 | bos | Boston Celtics | 401070711 | Sat, Oct 20 | @ | W | 103 | 33-82 | 40.2 | 9-25 | 36.0 | 28-33 | 84.8 | 55 | 21 | 8 | 2 | 15 | 9 | 36 | 26 | 16 | 11 | 36 | 17 | 1 | 0 | ny | New York Knicks | 101 | 36-87 | 41.4 | 12-35 | 34.3 | 17-27 | 63.0 | 60 | 23 | 8 | 1 | 16 | 11 | 40 | 25 | 1 | 11 | 35 | 15 | 0 | 0 |
3 | 3 | bos | Boston Celtics | 401070721 | Mon, Oct 22 | vs | L | 90 | 37-91 | 40.7 | 9-40 | 22.5 | 7-9 | 77.8 | 57 | 21 | 6 | 3 | 12 | 23 | 48 | 15 | 0 | 9 | 42 | 9 | 0 | 0 | orl | Orlando Magic | 93 | 38-93 | 40.9 | 10-35 | 28.6 | 7-10 | 70.0 | 57 | 28 | 6 | 2 | 9 | 6 | 42 | 15 | 13 | 8 | 41 | 9 | 0 | 0 |
4 | 4 | bos | Boston Celtics | 401070746 | Thu, Oct 25 | @ | W | 101 | 33-86 | 38.4 | 11-32 | 34.4 | 24-33 | 72.7 | 60 | 19 | 7 | 1 | 12 | 5 | 32 | 27 | 6 | 10 | 35 | 13 | 1 | 0 | okc | Oklahoma City Thunder | 95 | 37-94 | 39.4 | 7-28 | 25.0 | 14-25 | 56.0 | 70 | 19 | 5 | 6 | 16 | 10 | 42 | 27 | 16 | 16 | 41 | 16 | 1 | 0 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2453 | 2453 | sa | San Antonio Spurs | 401071835 | Tue, Apr 2 | vs | W | 117 | 42-80 | 52.5 | 14-30 | 46.7 | 19-22 | 86.4 | 51 | 23 | 6 | 6 | 17 | 9 | 36 | 17 | 8 | 4 | 41 | 12 | 1 | 0 | atl | Atlanta Hawks | 111 | 43-94 | 45.7 | 12-34 | 35.3 | 13-20 | 65.0 | 48 | 26 | 13 | 2 | 11 | 21 | 48 | 18 | 9 | 11 | 32 | 17 | 0 | 0 |
2454 | 2454 | sa | San Antonio Spurs | 401071846 | Wed, Apr 3 | @ | L | 85 | 31-81 | 38.3 | 5-27 | 18.5 | 18-25 | 72.0 | 45 | 15 | 9 | 7 | 11 | 3 | 44 | 14 | 0 | 11 | 27 | 20 | 2 | 0 | den | Denver Nuggets | 113 | 47-87 | 54.0 | 12-33 | 36.4 | 7-12 | 58.3 | 57 | 41 | 6 | 3 | 15 | 10 | 58 | 20 | 30 | 8 | 39 | 11 | 0 | 0 |
2455 | 2455 | sa | San Antonio Spurs | 401071855 | Fri, Apr 5 | @ | W | 129 | 51-91 | 56.0 | 10-25 | 40.0 | 17-20 | 85.0 | 49 | 34 | 3 | 4 | 9 | 2 | 56 | 18 | 24 | 12 | 30 | 9 | 1 | 0 | wsh | Washington Wizards | 112 | 42-88 | 47.7 | 9-32 | 28.1 | 19-22 | 86.4 | 43 | 22 | 2 | 4 | 10 | 0 | 54 | 16 | 4 | 9 | 25 | 8 | 0 | 0 |
2456 | 2456 | sa | San Antonio Spurs | 401071869 | Sun, Apr 7 | @ | W | 112 | 43-82 | 52.4 | 9-22 | 40.9 | 17-22 | 77.3 | 54 | 27 | 10 | 5 | 18 | 19 | 34 | 18 | 27 | 11 | 38 | 22 | 1 | 0 | cle | Cleveland Cavaliers | 90 | 31-79 | 39.2 | 8-29 | 27.6 | 20-27 | 74.1 | 45 | 16 | 14 | 1 | 18 | 10 | 30 | 16 | 6 | 10 | 25 | 22 | 2 | 0 |
2457 | 2457 | sa | San Antonio Spurs | 401071900 | Wed, Apr 10 | vs | W | 105 | 41-88 | 46.6 | 8-24 | 33.3 | 15-18 | 83.3 | 60 | 22 | 6 | 2 | 11 | 9 | 46 | 14 | 22 | 8 | 45 | 13 | 0 | 0 | dal | Dallas Mavericks | 94 | 37-91 | 40.7 | 11-37 | 29.7 | 9-12 | 75.0 | 47 | 27 | 5 | 4 | 7 | 12 | 36 | 19 | 1 | 7 | 35 | 9 | 0 | 0 |
2458 rows × 51 columns
Let’s see if there are any missing values, the data type of each column, the shape of the data frame, and the size of the data.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2458 entries, 0 to 2457
Data columns (total 51 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 2458 non-null int64
1 teamABBR 2458 non-null object
2 teamName 2458 non-null object
3 gameID 2458 non-null int64
4 gameDate 2458 non-null object
5 gameLoc 2458 non-null object
6 teamResult 2458 non-null object
7 teamPTS 2458 non-null int64
8 teamFG 2458 non-null object
9 teamFG% 2458 non-null float64
10 team3PT 2458 non-null object
11 team3PT% 2458 non-null float64
12 teamFT 2458 non-null object
13 teamFT% 2458 non-null float64
14 teamTREB 2458 non-null int64
15 teamASST 2458 non-null int64
16 teamSTL 2458 non-null int64
17 teamBLK 2458 non-null int64
18 teamTO 2458 non-null int64
19 teamFB_PTS 2458 non-null int64
20 teamPNT_PTS 2458 non-null int64
21 teamFOUL 2458 non-null int64
22 teamLG_LEAD 2458 non-null int64
23 teamOREB 2458 non-null int64
24 teamDREB 2458 non-null int64
25 teamTO_PTS 2458 non-null int64
26 teamFOUL_T 2458 non-null int64
27 teamFOUL_F 2458 non-null int64
28 opptABBR 2458 non-null object
29 opptName 2458 non-null object
30 opptPTS 2458 non-null int64
31 opptFG 2458 non-null object
32 opptFG% 2458 non-null float64
33 oppt3PT 2458 non-null object
34 oppt3PT% 2458 non-null float64
35 opptFT 2458 non-null object
36 opptFT% 2458 non-null float64
37 opptTREB 2458 non-null int64
38 opptASST 2458 non-null int64
39 opptSTL 2458 non-null int64
40 opptBLK 2458 non-null int64
41 opptTO 2458 non-null int64
42 opptFB_PTS 2458 non-null int64
43 opptPNT_PTS 2458 non-null int64
44 opptFOUL 2458 non-null int64
45 opptLG_LEAD 2458 non-null int64
46 opptOREB 2458 non-null int64
47 opptDREB 2458 non-null int64
48 opptTO_PTS 2458 non-null int64
49 opptFOUL_T 2458 non-null int64
50 opptFOUL_F 2458 non-null int64
dtypes: float64(6), int64(32), object(13)
memory usage: 979.5+ KB
Doesn’t look like there are any missing values, which is nice.
Now let’s get an idea of the distributions of the data columns.
display_df(df.describe())
Unnamed: 0 | gameID | teamPTS | teamFG% | team3PT% | teamFT% | teamTREB | teamASST | teamSTL | teamBLK | teamTO | teamFB_PTS | teamPNT_PTS | teamFOUL | teamLG_LEAD | teamOREB | teamDREB | teamTO_PTS | teamFOUL_T | teamFOUL_F | opptPTS | opptFG% | oppt3PT% | opptFT% | opptTREB | opptASST | opptSTL | opptBLK | opptTO | opptFB_PTS | opptPNT_PTS | opptFOUL | opptLG_LEAD | opptOREB | opptDREB | opptTO_PTS | opptFOUL_T | opptFOUL_F | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 2458.000000 | 2.458000e+03 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 | 2458.000000 |
mean | 1228.500000 | 4.010713e+08 | 111.195688 | 46.143124 | 35.520342 | 76.712449 | 53.513832 | 24.583401 | 7.633035 | 4.951587 | 14.080146 | 13.800651 | 48.593979 | 20.903173 | 12.553702 | 10.345403 | 34.813670 | 16.433686 | 0.638324 | 0.043938 | 111.195688 | 46.143124 | 35.520342 | 76.712449 | 53.513832 | 24.583401 | 7.633035 | 4.951587 | 14.080146 | 13.800651 | 48.593979 | 20.903173 | 12.553702 | 10.345403 | 34.813670 | 16.433686 | 0.638324 | 0.043938 |
std | 709.707804 | 3.712529e+02 | 12.648481 | 5.333076 | 8.682060 | 10.237144 | 7.451704 | 5.119945 | 2.947996 | 2.485877 | 3.927467 | 7.001434 | 10.047117 | 4.293987 | 9.376523 | 3.759008 | 5.467135 | 6.216017 | 0.861955 | 0.208932 | 12.648481 | 5.333076 | 8.682060 | 10.237144 | 7.451704 | 5.119945 | 2.947996 | 2.485877 | 3.927467 | 7.001434 | 10.047117 | 4.293987 | 9.376523 | 3.759008 | 5.467135 | 6.216017 | 0.861955 | 0.208932 |
min | 0.000000 | 4.010702e+08 | 68.000000 | 27.800000 | 11.500000 | 26.300000 | 32.000000 | 10.000000 | 0.000000 | 0.000000 | 3.000000 | 0.000000 | 16.000000 | 9.000000 | 0.000000 | 1.000000 | 18.000000 | 2.000000 | 0.000000 | 0.000000 | 68.000000 | 27.800000 | 11.500000 | 26.300000 | 32.000000 | 10.000000 | 0.000000 | 0.000000 | 3.000000 | 0.000000 | 16.000000 | 9.000000 | 0.000000 | 1.000000 | 18.000000 | 2.000000 | 0.000000 | 0.000000 |
25% | 614.250000 | 4.010710e+08 | 103.000000 | 42.600000 | 29.600000 | 70.000000 | 48.000000 | 21.000000 | 6.000000 | 3.000000 | 11.000000 | 9.000000 | 42.000000 | 18.000000 | 5.000000 | 8.000000 | 31.000000 | 12.000000 | 0.000000 | 0.000000 | 103.000000 | 42.600000 | 29.600000 | 70.000000 | 48.000000 | 21.000000 | 6.000000 | 3.000000 | 11.000000 | 9.000000 | 42.000000 | 18.000000 | 5.000000 | 8.000000 | 31.000000 | 12.000000 | 0.000000 | 0.000000 |
50% | 1228.500000 | 4.010713e+08 | 111.000000 | 46.000000 | 35.300000 | 77.100000 | 53.000000 | 24.000000 | 7.000000 | 5.000000 | 14.000000 | 13.000000 | 48.000000 | 21.000000 | 11.000000 | 10.000000 | 35.000000 | 16.000000 | 0.000000 | 0.000000 | 111.000000 | 46.000000 | 35.300000 | 77.100000 | 53.000000 | 24.000000 | 7.000000 | 5.000000 | 14.000000 | 13.000000 | 48.000000 | 21.000000 | 11.000000 | 10.000000 | 35.000000 | 16.000000 | 0.000000 | 0.000000 |
75% | 1842.750000 | 4.010716e+08 | 119.000000 | 49.500000 | 40.900000 | 84.000000 | 58.000000 | 28.000000 | 9.000000 | 6.000000 | 17.000000 | 18.000000 | 56.000000 | 24.000000 | 18.000000 | 13.000000 | 38.000000 | 20.000000 | 1.000000 | 0.000000 | 119.000000 | 49.500000 | 40.900000 | 84.000000 | 58.000000 | 28.000000 | 9.000000 | 6.000000 | 17.000000 | 18.000000 | 56.000000 | 24.000000 | 18.000000 | 13.000000 | 38.000000 | 20.000000 | 1.000000 | 0.000000 |
max | 2457.000000 | 4.010719e+08 | 168.000000 | 64.900000 | 84.200000 | 100.000000 | 83.000000 | 42.000000 | 20.000000 | 19.000000 | 29.000000 | 53.000000 | 84.000000 | 38.000000 | 58.000000 | 26.000000 | 55.000000 | 44.000000 | 5.000000 | 2.000000 | 168.000000 | 64.900000 | 84.200000 | 100.000000 | 83.000000 | 42.000000 | 20.000000 | 19.000000 | 29.000000 | 53.000000 | 84.000000 | 38.000000 | 58.000000 | 26.000000 | 55.000000 | 44.000000 | 5.000000 | 2.000000 |
1. Drop “Unnamed: 0” Column
The “Unnamed: 0” column appears to only contain row numbers which is redundant with the data frame index. So we can drop that column.
df = df.drop(columns='Unnamed: 0')
df.head()
teamABBR | teamName | gameID | gameDate | gameLoc | teamResult | teamPTS | teamFG | teamFG% | team3PT | … | opptTO | opptFB_PTS | opptPNT_PTS | opptFOUL | opptLG_LEAD | opptOREB | opptDREB | opptTO_PTS | opptFOUL_T | opptFOUL_F | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | bos | Boston Celtics | 401070213 | Tue, Oct 16 | vs | W | 105 | 42-97 | 43.3 | 11-37 | … | 16 | 16 | 50 | 20 | 4 | 6 | 41 | 14 | 0 | 0 |
1 | bos | Boston Celtics | 401070219 | Fri, Oct 19 | @ | L | 101 | 40-99 | 40.4 | 14-36 | … | 13 | 14 | 46 | 19 | 12 | 12 | 37 | 11 | 0 | 0 |
2 | bos | Boston Celtics | 401070711 | Sat, Oct 20 | @ | W | 103 | 33-82 | 40.2 | 9-25 | … | 16 | 11 | 40 | 25 | 1 | 11 | 35 | 15 | 0 | 0 |
3 | bos | Boston Celtics | 401070721 | Mon, Oct 22 | vs | L | 90 | 37-91 | 40.7 | 9-40 | … | 9 | 6 | 42 | 15 | 13 | 8 | 41 | 9 | 0 | 0 |
4 | bos | Boston Celtics | 401070746 | Thu, Oct 25 | @ | W | 101 | 33-86 | 38.4 | 11-32 | … | 16 | 10 | 42 | 27 | 16 | 16 | 41 | 16 | 1 | 0 |
5 rows × 50 columns
2. Create new columns for Shots Made and Shots Attempted
The stats scraped from ESPN.com list shots made and shots attempted as a single string delienated by a hyphen. Field Goals (teamFG or opptFG), for example:
df.teamFG[0], df.opptFG[0]
('42-97', '34-87')
We’ll split on the hyphen and create new columns for shots made and attempted for Field Goals, 3 Pointers, and Free Throws. We’ll also convert the new columns to integers.
df[['teamFGM', 'teamFGA']] = df.teamFG.str.split('-', expand=True).astype(int)
Check to make sure the new columns were created.
df.T.tail(3).T
opptFOUL_F | teamFGM | teamFGA | |
---|---|---|---|
0 | 0 | 42 | 97 |
1 | 0 | 40 | 99 |
2 | 0 | 33 | 82 |
3 | 0 | 37 | 91 |
4 | 0 | 33 | 86 |
… | … | … | … |
2453 | 0 | 42 | 80 |
2454 | 0 | 31 | 81 |
2455 | 0 | 51 | 91 |
2456 | 0 | 43 | 82 |
2457 | 0 | 41 | 88 |
2458 rows × 3 columns
The new columns have been appended to the end of the data frame. Also, we’ll check that the data type was converted.
df.teamFGM.dtype
dtype('int64')
Looks good, so we’ll split the remaining columns.
df[['opptFGM', 'opptFGA']] = df.opptFG.str.split('-', expand=True).astype(int)
df[['team3PTM', 'team3PTA']] = df.team3PT.str.split('-', expand=True).astype(int)
df[['oppt3PTM', 'oppt3PTA']] = df.oppt3PT.str.split('-', expand=True).astype(int)
df[['teamFTM', 'teamFTA']] = df.teamFT.str.split('-', expand=True).astype(int)
df[['opptFTM', 'opptFTA']] = df.opptFT.str.split('-', expand=True).astype(int)
df.T.tail(12).T
teamFGM | teamFGA | opptFGM | opptFGA | team3PTM | team3PTA | oppt3PTM | oppt3PTA | teamFTM | teamFTA | opptFTM | opptFTA | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 42 | 97 | 34 | 87 | 11 | 37 | 5 | 26 | 10 | 14 | 14 | 23 |
1 | 40 | 99 | 43 | 92 | 14 | 36 | 13 | 35 | 7 | 10 | 14 | 19 |
2 | 33 | 82 | 36 | 87 | 9 | 25 | 12 | 35 | 28 | 33 | 17 | 27 |
3 | 37 | 91 | 38 | 93 | 9 | 40 | 10 | 35 | 7 | 9 | 7 | 10 |
4 | 33 | 86 | 37 | 94 | 11 | 32 | 7 | 28 | 24 | 33 | 14 | 25 |
… | … | … | … | … | … | … | … | … | … | … | … | … |
2453 | 42 | 80 | 43 | 94 | 14 | 30 | 12 | 34 | 19 | 22 | 13 | 20 |
2454 | 31 | 81 | 47 | 87 | 5 | 27 | 12 | 33 | 18 | 25 | 7 | 12 |
2455 | 51 | 91 | 42 | 88 | 10 | 25 | 9 | 32 | 17 | 20 | 19 | 22 |
2456 | 43 | 82 | 31 | 79 | 9 | 22 | 8 | 29 | 17 | 22 | 20 | 27 |
2457 | 41 | 88 | 37 | 91 | 8 | 24 | 11 | 37 | 15 | 18 | 9 | 12 |
2458 rows × 12 columns
3. Handling Categorical Variables with One-Hot Encoding
Categorical variables are variables that have two or more discrete values. They come in 2 flavors: nominal, where there is no particular order (e.g. ‘red’, ‘green’, ‘blue’), and ordinal, where there is order (e.g. ‘small’, ‘medium’, ‘large’).
Most machine learning models require numerical data as input. So, it’s necessary to convert categorical data to numerical data using one of two methods:
- Label Encoding assigns an integer value to each category (e.g. red = 1, green = 2, blue =3). One draw back to this type of encoding is that the machine learning model may assume there is some kind of rank amongst the variables (e.g. red > green > blue). But sometimes a ranked order does not exist.
- One-Hot Encoding can circumvent the implicit ranking by adding new features (e.g. is_red = 0, is_green = 1, is_blue = 0). If a data point is green, then a boolean value of 1 (True) is listed as an is_green feature, while is_red and is_blue are listed as 0 (False). It’s important to note that if the cardinality were high (i.e. if there were a lot of different values, think lots of postal codes), one-hot encoding could lead to high memory consumption and poor modeling efficiency.
For this data we’ll convert the categorical variables (game location and team result) to numerical data using one-hot encoding.
df = pd.get_dummies(df, columns=['gameLoc', 'teamResult'])
df.T.tail(5).T
opptFTA | gameLoc_@ | gameLoc_vs | teamResult_L | teamResult_W | |
---|---|---|---|---|---|
0 | 23 | 0 | 1 | 0 | 1 |
1 | 19 | 1 | 0 | 1 | 0 |
2 | 27 | 1 | 0 | 0 | 1 |
3 | 10 | 0 | 1 | 1 | 0 |
4 | 25 | 1 | 0 | 0 | 1 |
Now we’ll save our cleaned data frame.
df.to_csv('../data/nba_team_box_scores_'+str(year)+'_cleaned.csv')