TV Ratings & Ad Costs Data Analytics using Python

Introduction

This data science project analyzes Super Bowl TV ratings data. The purpose is to gain insight into the following questions:

1. What are the most extreme game outcomes?
2. How does the game affect television viewership?
3. How have viewership, TV ratings, and ad cost evolved over time?

We will briefly discuss the data, tools & methodologies before diving into analyzing the Super Bowl TV ratings data.

Data

The data used in this project was scraped and curated from Wikipedia. The data set is comprised of 3 CSV files:

1. Super Bowl game data – superbowls.csv
2. TV data – tv.csv
3. Halftime musician data – halftime_musicians.csv

In assessing the data set, there were some issues with missing data, however, we were able to focus the analysis around areas where the data was whole. Summaries of the data, notes on the data set issues, and code are available in Appendix for those that want to take a deeper dive into the technical details.

Tools & Methodologies

Python was used for this project including the following packages:
– Pandas to analyze the data
– Matplotlib & Seaborn to visualize the data

Analysis & Insights

1. What are the most extreme game outcomes?

Let’s start by looking at combined points for each Super Bowl by visualizing the distribution. Let’s also pinpoint the Super Bowls with the highest and lowest scores.

Distribution of Combined Scores

This image has an empty alt attribute; its file name is combined-points.png

Most combined scores are around 40-50 points, with the extremes being roughly equal distance away in opposite directions.

Highest Combined Points

Date	Super Bowl	Combined Points	Winning Team	Winning Pts	Losing Team	Losing Pts
1995-01-29	29	75	San Francisco 49ers	49	San Diego Chargers	26
2018-02-04	52	74	Philadelphia Eagles	41	New England Patriots	33

The two highest highest combined scores are 75 and 74 from Super Bowls 29 and 52 respectively. Each of these two games featured dominant quarterback performances. One of the highest scoring games happened recently in 2018 when the underdog Philadelphia Eagles, led by Nick Foles, beat Tom Brady’s New England Patriots 41-33.

Lowest Combined Points

Date	Super Bowl	Combined Points	Winning Team	Winning Pts	Losing Team	Losing Pts
1973-01-14	7	21	Miami Dolphins	14	Washington Redskins	7
1975-01-12	9	22	Pittsburgh Steelers	16	Minnesota Vikings	6
1969-01-12	3	23	New York Jets	16	Baltimore Colts	7

The lowest combined scores occurred in Super Bowl 3 and 7, which featured tough defenses that dominated. In 1975, Super Bowl 9’s 16-6 low score can be attributed to inclement weather. The field was slick from overnight rain, and it was cold at 46 °F (8 °C), making it hard for the Pittsburgh Steelers and Minnesota Vikings to do much offensively.

UPDATE: In Super Bowl LIII in 2019, the Patriots and Rams broke the record for the lowest-scoring Super Bowl with a combined score of 16 points (13-3 for the Patriots).

Let’s examine point difference next:

Point Difference

Lowest Margin of Victories

Date	Super Bowl	Margin of Victory	Winning Team	Winning Pts	Losing Team	Losing Pts
1991-01027	25	1	New York Giants	20	Buffalo Bills	19

The vast majority of Super Bowls are close games which makes sense since both teams are likely strong since they have made it to the championship game. The closest game ever was Super Bowl 25 in 1991 when the New York Giants defeated the Buffalo Bills by one point.

Highest Margin of Victories

Date	Super Bowl	Margin of Victory	Winning Team	Winning Pts	Losing Team	Losing Pts
1990-01-28	24	45	San Francisco 49ers	55	Denver Broncos	10
1986-01-26	20	36	Chicago Bears	46	New England Patriots	10
1993-01-31	27	35	Dallas Cowboys	52	Buffalo Bills	17
2014-02-02	48	35	Seattle Seahawks	43	Denver Broncos	8

The biggest margin of victory was in Super Bowl 24 in 1991 when Hall of Famer Joe Montana led the the San Francisco 49ers to a 55-10 rout over the Denver Broncos.

2. How does the game affect television viewership?
Do large point differences translate to lost viewers? Let’s combine the game data and TV data to find out. We can plot household share (average percentage of U.S. households with a TV in use that were watching for the entire broadcast) vs. point difference to find out.

Point Difference vs TV Ratings

The downward sloping regression line and the 95% confidence interval for that regression suggest that bailing on a game if it is a blowout is common. However, we must take this with a grain of salt because the linear relationship in the data is weak due to our small sample size of 52 games.

3. How have viewership, TV ratings, and ad cost evolved over time?
Regardless of the score though, it’s likely that most people stick it out for the halftime show which is good for the TV networks and advertisers. A 30-second ad spot costs a hefty $5 million now, but has it always been that way? And how have the number of viewers and household ratings trended alongside ad cost?

We can see viewers increased before ad costs did. Maybe the networks weren’t very data savvy and were slow to react? Another hypothesis: maybe Super Bowl halftime shows weren’t that good in the earlier years? It turns out Michael Jackson’s Super Bowl 27 performance, one of the most watched events in American TV history, was when the NFL realized the value of Super Bowl airtime and decided they needed to sign big name musicians from then on out.

Conclusion

In this project, we loaded, cleaned, then explored Super Bowl game, television, and halftime show data. We visualized the distributions of combined points, point differences, and halftime show performances using histograms. We discovered that blowouts do appear to lead to a drop in viewers but also caution that the linear relationship in the data is weak due to our small sample size of 52 games. We used line plots to see how ad cost increases lagged behind viewership increases. Ad costs have risen significantly since Michael Jackson’s highly watched Super Bowl 27 performance when the NFL seems to have realized that big name musicians can increase viewership along with the value of advertising airtime.

————————————————————————————————————————————————————–

Appendices

A. Data

# Summary of the Super Bowl game data to inspect
super_bowls.info()
print('\n')
# Summary of the TV data to inspect
tv.info()
print('\n')
# Summary of the halftime musician data to inspect
halftime_musicians.info()

RangeIndex: 52 entries, 0 to 51
Data columns (total 18 columns):
Column Non-Null Count Dtype

0 date 52 non-null object
1 super_bowl 52 non-null int64
2 venue 52 non-null object
3 city 52 non-null object
4 state 52 non-null object
5 attendance 52 non-null int64
6 team_winner 52 non-null object
7 winning_pts 52 non-null int64
8 qb_winner_1 52 non-null object
9 qb_winner_2 2 non-null object
10 coach_winner 52 non-null object
11 team_loser 52 non-null object
12 losing_pts 52 non-null int64
13 qb_loser_1 52 non-null object
14 qb_loser_2 3 non-null object
15 coach_loser 52 non-null object
16 combined_pts 52 non-null int64
17 difference_pts 52 non-null int64
dtypes: int64(6), object(12)
memory usage: 7.4+ KB

RangeIndex: 53 entries, 0 to 52
Data columns (total 9 columns):
Column Non-Null Count Dtype

0 super_bowl 53 non-null int64
1 network 53 non-null object
2 avg_us_viewers 53 non-null int64
3 total_us_viewers 15 non-null float64
4 rating_household 53 non-null float64
5 share_household 53 non-null int64
6 rating_18_49 15 non-null float64
7 share_18_49 6 non-null float64
8 ad_cost 53 non-null int64
dtypes: float64(4), int64(4), object(1)
memory usage: 3.9+ KB

RangeIndex: 134 entries, 0 to 133
Data columns (total 3 columns):
Column Non-Null Count Dtype

0 super_bowl 134 non-null int64
1 musician 134 non-null object
2 num_songs 88 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 3.3+ KB

B. Notes on Dataset Issues
For the Super Bowl game data, we can see the dataset appears whole except for missing values in the backup quarterback columns (qb_winner_2 and qb_loser_2), which make sense given most starting QBs in the Super Bowl (qb_winner_1 and qb_loser_1) play the entire game.

From the visual inspection of TV and halftime musicians data, there is only one missing value displayed, but there are likely more. The Super Bowl goes all the way back to 1967, and the more granular columns (e.g. the number of songs for halftime musicians) probably weren’t tracked reliably over time. Wikipedia is great resource but not perfect.

An inspection of the .info() output for tv and halftime_musicians shows us that there are multiple columns with null values.

For the TV data, the following columns have a significant number of missing values:

1. total_us_viewers
(amount of U.S. viewers who watched at least some part of the broadcast)

2. rating_18_49
(average % of U.S. adults 18-49 who live in a household with a TV that were watching for the entire broadcast)

3. share_18_49
(average % of U.S. adults 18-49 who live in a household with a TV in use that were watching for the entire broadcast)

For the halftime musician data, there are missing numbers of songs performed (num_songs) for about a third of the performances. There are a lot of potential reasons for these missing values. Was the data ever tracked? Was it lost in history? We have taken note of where the dataset isn’t perfect then focused the analysis on portions of the data set where it is whole.

C. Code

# Import pandas
import pandas as pd

# Load the CSV data into DataFrames
super_bowls = pd.read_csv('datasets/super_bowls.csv')
tv = pd.read_csv('datasets/tv.csv')
halftime_musicians = pd.read_csv('datasets/halftime_musicians.csv')

# Summary of the Super Bowl game data to inspect
super_bowls.info()
print('\n')

# Summary of the TV data to inspect
tv.info()
print('\n')

# Summary of the halftime musician data to inspect
halftime_musicians.info()

# Import matplotlib and set plotting style
from matplotlib import pyplot as plt
%matplotlib inline
plt.style.use('seaborn')

# Plot a histogram of combined points
plt.hist(super_bowls.combined_pts)
plt.xlabel('Combined Points')
plt.ylabel('Number of Super Bowls')
plt.show()

# Display the Super Bowls with the highest and lowest combined scores
display(super_bowls[super_bowls['combined_pts'] > 70])
display(super_bowls[super_bowls['combined_pts'] < 25])

# Plot a histogram of point differences
plt.hist(super_bowls.difference_pts)
plt.xlabel('Point Difference')
plt.ylabel('Number of Super Bowls')
plt.show()

# Display the closest game(s) and biggest blowouts
display(super_bowls[super_bowls['difference_pts'] == 1])
display(super_bowls[super_bowls['difference_pts'] >= 35])

# Join game and TV data, filtering out SB I because it was split over two networks
games_tv = pd.merge(tv[tv['super_bowl'] > 1], super_bowls, on='super_bowl')

# Import seaborn
import seaborn as sns

# Create a scatter plot with a linear regression model fit
sns.regplot(x='difference_pts', y='share_household', data=games_tv)
plt.xlabel("Point Difference")
plt.ylabel("TV Ratings (household share)")

# Create a figure with 3x1 subplot and activate the top subplot
plt.subplot(3, 1, 1)
plt.plot(tv.super_bowl, tv.avg_us_viewers, color='#648FFF')
plt.title('Average Number of US Viewers')

# Activate the middle subplot
plt.subplot(3, 1, 2)
plt.plot(tv.super_bowl, tv.rating_household, color='#DC267F')
plt.title('Household Rating')

# Activate the bottom subplot
plt.subplot(3, 1, 3)
plt.plot(tv.super_bowl, tv.ad_cost, color='#FFB000')
plt.title('Ad Cost')
plt.xlabel('SUPER BOWL')

# Improve the spacing between subplots
plt.tight_layout()