Introduction
This data science project analyzes Super Bowl TV ratings data. The purpose is to gain insight into the following questions:
1. What are the most extreme game outcomes?
2. How does the game affect television viewership?
3. How have viewership, TV ratings, and ad cost evolved over time?
We will briefly discuss the data, tools & methodologies before diving into analyzing the Super Bowl TV ratings data.
Data
The data used in this project was scraped and curated from Wikipedia. The data set is comprised of 3 CSV files:
1. Super Bowl game data – superbowls.csv
2. TV data – tv.csv
3. Halftime musician data – halftime_musicians.csv
In assessing the data set, there were some issues with missing data, however, we were able to focus the analysis around areas where the data was whole. Summaries of the data, notes on the data set issues, and code are available in Appendix for those that want to take a deeper dive into the technical details.
Tools & Methodologies
Python was used for this project including the following packages:
– Pandas to analyze the data
– Matplotlib & Seaborn to visualize the data
Analysis & Insights
1. What are the most extreme game outcomes?
Let’s start by looking at combined points for each Super Bowl by visualizing the distribution. Let’s also pinpoint the Super Bowls with the highest and lowest scores.
Distribution of Combined Scores
Most combined scores are around 40-50 points, with the extremes being roughly equal distance away in opposite directions.
Highest Combined Points
Date | Super Bowl | Combined Points | Winning Team | Winning Pts | Losing Team | Losing Pts | |
1995-01-29 | 29 | 75 | San Francisco 49ers | 49 | San Diego Chargers | 26 | |
2018-02-04 | 52 | 74 | Philadelphia Eagles | 41 | New England Patriots | 33 |
The two highest highest combined scores are 75 and 74 from Super Bowls 29 and 52 respectively. Each of these two games featured dominant quarterback performances. One of the highest scoring games happened recently in 2018 when the underdog Philadelphia Eagles, led by Nick Foles, beat Tom Brady’s New England Patriots 41-33.
Lowest Combined Points
Date | Super Bowl | Combined Points | Winning Team | Winning Pts | Losing Team | Losing Pts |
1973-01-14 | 7 | 21 | Miami Dolphins | 14 | Washington Redskins | 7 |
1975-01-12 | 9 | 22 | Pittsburgh Steelers | 16 | Minnesota Vikings | 6 |
1969-01-12 | 3 | 23 | New York Jets | 16 | Baltimore Colts | 7 |
The lowest combined scores occurred in Super Bowl 3 and 7, which featured tough defenses that dominated. In 1975, Super Bowl 9’s 16-6 low score can be attributed to inclement weather. The field was slick from overnight rain, and it was cold at 46 °F (8 °C), making it hard for the Pittsburgh Steelers and Minnesota Vikings to do much offensively.
UPDATE: In Super Bowl LIII in 2019, the Patriots and Rams broke the record for the lowest-scoring Super Bowl with a combined score of 16 points (13-3 for the Patriots).
Let’s examine point difference next:
Point Difference
Lowest Margin of Victories
Date | Super Bowl | Margin of Victory | Winning Team | Winning Pts | Losing Team | Losing Pts |
1991-01027 | 25 | 1 | New York Giants | 20 | Buffalo Bills | 19 |
The vast majority of Super Bowls are close games which makes sense since both teams are likely strong since they have made it to the championship game. The closest game ever was Super Bowl 25 in 1991 when the New York Giants defeated the Buffalo Bills by one point.
Highest Margin of Victories
Date | Super Bowl | Margin of Victory | Winning Team | Winning Pts | Losing Team | Losing Pts |
1990-01-28 | 24 | 45 | San Francisco 49ers | 55 | Denver Broncos | 10 |
1986-01-26 | 20 | 36 | Chicago Bears | 46 | New England Patriots | 10 |
1993-01-31 | 27 | 35 | Dallas Cowboys | 52 | Buffalo Bills | 17 |
2014-02-02 | 48 | 35 | Seattle Seahawks | 43 | Denver Broncos | 8 |
The biggest margin of victory was in Super Bowl 24 in 1991 when Hall of Famer Joe Montana led the the San Francisco 49ers to a 55-10 rout over the Denver Broncos.
2. How does the game affect television viewership?
Do large point differences translate to lost viewers? Let’s combine the game data and TV data to find out. We can plot household share (average percentage of U.S. households with a TV in use that were watching for the entire broadcast) vs. point difference to find out.
Point Difference vs TV Ratings
The downward sloping regression line and the 95% confidence interval for that regression suggest that bailing on a game if it is a blowout is common. However, we must take this with a grain of salt because the linear relationship in the data is weak due to our small sample size of 52 games.
3. How have viewership, TV ratings, and ad cost evolved over time?
Regardless of the score though, it’s likely that most people stick it out for the halftime show which is good for the TV networks and advertisers. A 30-second ad spot costs a hefty $5 million now, but has it always been that way? And how have the number of viewers and household ratings trended alongside ad cost?
We can see viewers increased before ad costs did. Maybe the networks weren’t very data savvy and were slow to react? Another hypothesis: maybe Super Bowl halftime shows weren’t that good in the earlier years? It turns out Michael Jackson’s Super Bowl 27 performance, one of the most watched events in American TV history, was when the NFL realized the value of Super Bowl airtime and decided they needed to sign big name musicians from then on out.
Conclusion
In this project, we loaded, cleaned, then explored Super Bowl game, television, and halftime show data. We visualized the distributions of combined points, point differences, and halftime show performances using histograms. We discovered that blowouts do appear to lead to a drop in viewers but also caution that the linear relationship in the data is weak due to our small sample size of 52 games. We used line plots to see how ad cost increases lagged behind viewership increases. Ad costs have risen significantly since Michael Jackson’s highly watched Super Bowl 27 performance when the NFL seems to have realized that big name musicians can increase viewership along with the value of advertising airtime.
————————————————————————————————————————————————————–
Appendices
A. Data
# Summary of the Super Bowl game data to inspect super_bowls.info() print('\n') # Summary of the TV data to inspect tv.info() print('\n') # Summary of the halftime musician data to inspect halftime_musicians.info()
RangeIndex: 52 entries, 0 to 51 Data columns (total 18 columns): Column Non-Null Count Dtype 0 date 52 non-null object 1 super_bowl 52 non-null int64 2 venue 52 non-null object 3 city 52 non-null object 4 state 52 non-null object 5 attendance 52 non-null int64 6 team_winner 52 non-null object 7 winning_pts 52 non-null int64 8 qb_winner_1 52 non-null object 9 qb_winner_2 2 non-null object 10 coach_winner 52 non-null object 11 team_loser 52 non-null object 12 losing_pts 52 non-null int64 13 qb_loser_1 52 non-null object 14 qb_loser_2 3 non-null object 15 coach_loser 52 non-null object 16 combined_pts 52 non-null int64 17 difference_pts 52 non-null int64 dtypes: int64(6), object(12) memory usage: 7.4+ KB RangeIndex: 53 entries, 0 to 52 Data columns (total 9 columns): Column Non-Null Count Dtype 0 super_bowl 53 non-null int64 1 network 53 non-null object 2 avg_us_viewers 53 non-null int64 3 total_us_viewers 15 non-null float64 4 rating_household 53 non-null float64 5 share_household 53 non-null int64 6 rating_18_49 15 non-null float64 7 share_18_49 6 non-null float64 8 ad_cost 53 non-null int64 dtypes: float64(4), int64(4), object(1) memory usage: 3.9+ KB RangeIndex: 134 entries, 0 to 133 Data columns (total 3 columns): Column Non-Null Count Dtype 0 super_bowl 134 non-null int64 1 musician 134 non-null object 2 num_songs 88 non-null float64 dtypes: float64(1), int64(1), object(1) memory usage: 3.3+ KB
B. Notes on Dataset Issues
For the Super Bowl game data, we can see the dataset appears whole except for missing values in the backup quarterback columns (qb_winner_2 and qb_loser_2), which make sense given most starting QBs in the Super Bowl (qb_winner_1 and qb_loser_1) play the entire game.
From the visual inspection of TV and halftime musicians data, there is only one missing value displayed, but there are likely more. The Super Bowl goes all the way back to 1967, and the more granular columns (e.g. the number of songs for halftime musicians) probably weren’t tracked reliably over time. Wikipedia is great resource but not perfect.
An inspection of the .info() output for tv and halftime_musicians shows us that there are multiple columns with null values.
For the TV data, the following columns have a significant number of missing values:
1. total_us_viewers
(amount of U.S. viewers who watched at least some part of the broadcast)
2. rating_18_49
(average % of U.S. adults 18-49 who live in a household with a TV that were watching for the entire broadcast)
3. share_18_49
(average % of U.S. adults 18-49 who live in a household with a TV in use that were watching for the entire broadcast)
For the halftime musician data, there are missing numbers of songs performed (num_songs
) for about a third of the performances. There are a lot of potential reasons for these missing values. Was the data ever tracked? Was it lost in history? We have taken note of where the dataset isn’t perfect then focused the analysis on portions of the data set where it is whole.
C. Code
# Import pandas import pandas as pd # Load the CSV data into DataFrames super_bowls = pd.read_csv('datasets/super_bowls.csv') tv = pd.read_csv('datasets/tv.csv') halftime_musicians = pd.read_csv('datasets/halftime_musicians.csv') # Summary of the Super Bowl game data to inspect super_bowls.info() print('\n') # Summary of the TV data to inspect tv.info() print('\n') # Summary of the halftime musician data to inspect halftime_musicians.info() # Import matplotlib and set plotting style from matplotlib import pyplot as plt %matplotlib inline plt.style.use('seaborn') # Plot a histogram of combined points plt.hist(super_bowls.combined_pts) plt.xlabel('Combined Points') plt.ylabel('Number of Super Bowls') plt.show() # Display the Super Bowls with the highest and lowest combined scores display(super_bowls[super_bowls['combined_pts'] > 70]) display(super_bowls[super_bowls['combined_pts'] < 25]) # Plot a histogram of point differences plt.hist(super_bowls.difference_pts) plt.xlabel('Point Difference') plt.ylabel('Number of Super Bowls') plt.show() # Display the closest game(s) and biggest blowouts display(super_bowls[super_bowls['difference_pts'] == 1]) display(super_bowls[super_bowls['difference_pts'] >= 35]) # Join game and TV data, filtering out SB I because it was split over two networks games_tv = pd.merge(tv[tv['super_bowl'] > 1], super_bowls, on='super_bowl') # Import seaborn import seaborn as sns # Create a scatter plot with a linear regression model fit sns.regplot(x='difference_pts', y='share_household', data=games_tv) plt.xlabel("Point Difference") plt.ylabel("TV Ratings (household share)") # Create a figure with 3x1 subplot and activate the top subplot plt.subplot(3, 1, 1) plt.plot(tv.super_bowl, tv.avg_us_viewers, color='#648FFF') plt.title('Average Number of US Viewers') # Activate the middle subplot plt.subplot(3, 1, 2) plt.plot(tv.super_bowl, tv.rating_household, color='#DC267F') plt.title('Household Rating') # Activate the bottom subplot plt.subplot(3, 1, 3) plt.plot(tv.super_bowl, tv.ad_cost, color='#FFB000') plt.title('Ad Cost') plt.xlabel('SUPER BOWL') # Improve the spacing between subplots plt.tight_layout()