The Olympic Games are highly anticipated international sporting events normally held every four years. While the first Olympic Games are traditionally dated to 776 BC, the inaugural Games of the modern Olympics took place more recently in the year 1896. The Olympics encourage cultural exchanges and foster a diverse community; the Games create an opportunity for athletes from around the world to come together to compete and represent their countries in a wide range of sporting events. The Olympics have developed over time to become more inclusive of many different groups, and allow for different parts of the world to showcase their athletes' aptitude in a multitude of sports. From the inaugural modern Olympic Games of all male athletes to the current Olympic Games of a diverse mix of athletes including women, refugees, and people of different backgrounds and abilities, the Olympics have evolved immensely over the past 124 years.
In this tutorial, we aim to provide valuable insight into the evolution of the Olympic Games, as well as further understanding of how global social, political, and economic changes have impacted various aspects of the Olympics. Taking a look at Olympic data from the past century, we will see how athlete demographics, representation, and medal leaders have changed or remained the same over time. Our original dataset contains information spanning 120 years of Olympic events; this historical dataset was scraped from www.sports-reference.com and then posted as a csv file on Kaggle.com. We wanted to, however, focus on a smaller subset of countries to gather more meaningful analyses; the following parts of the tutorial will look at the countries, which cover four different continents, in the top ten medal winners up to 2016.
The libraries we used have been imported below, and include pandas, numpy, matplotlib, seaborn, folium, sklearn, and more.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
import folium
from sklearn import linear_model
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statistics import mean
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn import metrics
First, the data is read from the csv file into a DataFrame using Pandas. Each row represents an athlete that competed in a specific event in the Olympics. The first five entries of the raw data can be seen below.
olympics_df = pd.read_csv('athlete_events.csv')
olympics_df.head()
ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | A Dijiang | M | 24.0 | 180.0 | 80.0 | China | CHN | 1992 Summer | 1992 | Summer | Barcelona | Basketball | Basketball Men's Basketball | NaN |
1 | 2 | A Lamusi | M | 23.0 | 170.0 | 60.0 | China | CHN | 2012 Summer | 2012 | Summer | London | Judo | Judo Men's Extra-Lightweight | NaN |
2 | 3 | Gunnar Nielsen Aaby | M | 24.0 | NaN | NaN | Denmark | DEN | 1920 Summer | 1920 | Summer | Antwerpen | Football | Football Men's Football | NaN |
3 | 4 | Edgar Lindenau Aabye | M | 34.0 | NaN | NaN | Denmark/Sweden | DEN | 1900 Summer | 1900 | Summer | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War | Gold |
4 | 5 | Christine Jacoba Aaftink | F | 21.0 | 185.0 | 82.0 | Netherlands | NED | 1988 Winter | 1988 | Winter | Calgary | Speed Skating | Speed Skating Women's 500 metres | NaN |
The data contains many rows with NaN values, which can be seen even in the first five rows of the table displayed above. The NaN values result from Olympic athletes who participated and did not win a medal, as well as incomplete data on athletes where height or weight are missing. For our goals and purposes, the dataset is also far too large to perform meaningful computations and analyses on. Thus, we begin to tidy the data. We create a new DataFrame called top_df that drops all rows containing NaN values, and only includes data on the top 11 countries in terms of number of Summer Olympics medals won (according to TIME). The countries include the United States of America, Russia, Great Britain, Germany, France, Italy, China, Sweden, Hungary, Australia, and Japan, which also span four continents. Lastly, we reset top_df's index values to start at 0 and increase indices consecutively. This produces a more desirable dataset to work with for our tutorial.
# create top_df to contain data consisting of the same time period, but only 11 countries and NaN values dropped
# copy the original olympics_df to a new one called top_df (representing data of the top medal winning countries)
top_df = olympics_df.copy()
# drop all the rows with NaN values in the medal, height, or weight columns
top_df = top_df[top_df['Medal'].notna()]
top_df = top_df[top_df['Height'].notna()]
top_df = top_df[top_df['Weight'].notna()]
# create a list of the abbreviated names of countries we will be focusing on
countries = ['USA', 'RUS', 'GBR', 'FRA', 'CHN', 'ITA', 'AUS', 'SWE', 'HUN', 'JPN', 'GER']
# iterate through top_df and drop rows pertaining to countries that are not in the countries list
for index,row in top_df.iterrows():
if row['NOC'] not in countries:
top_df = top_df.drop(index, 0)
# reset index values after removing rows
top_df.reset_index(drop=True, inplace=True)
# display data
top_df
ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 62 | Giovanni Abagnale | M | 21.0 | 198.0 | 90.0 | Italy | ITA | 2016 Summer | 2016 | Summer | Rio de Janeiro | Rowing | Rowing Men's Coxless Pairs | Bronze |
1 | 67 | Mariya Vasilyevna Abakumova (-Tarabina) | F | 22.0 | 179.0 | 80.0 | Russia | RUS | 2008 Summer | 2008 | Summer | Beijing | Athletics | Athletics Women's Javelin Throw | Silver |
2 | 73 | Luc Abalo | M | 23.0 | 182.0 | 86.0 | France | FRA | 2008 Summer | 2008 | Summer | Beijing | Handball | Handball Men's Handball | Gold |
3 | 73 | Luc Abalo | M | 27.0 | 182.0 | 86.0 | France | FRA | 2012 Summer | 2012 | Summer | London | Handball | Handball Men's Handball | Gold |
4 | 73 | Luc Abalo | M | 31.0 | 182.0 | 86.0 | France | FRA | 2016 Summer | 2016 | Summer | Rio de Janeiro | Handball | Handball Men's Handball | Silver |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
14800 | 135508 | Vera Igorevna Zvonaryova | F | 23.0 | 172.0 | 59.0 | Russia | RUS | 2008 Summer | 2008 | Summer | Beijing | Tennis | Tennis Women's Singles | Bronze |
14801 | 135520 | Julia Zwehl | F | 28.0 | 167.0 | 60.0 | Germany | GER | 2004 Summer | 2004 | Summer | Athina | Hockey | Hockey Women's Hockey | Gold |
14802 | 135525 | Martin Zwicker | M | 29.0 | 175.0 | 64.0 | Germany | GER | 2016 Summer | 2016 | Summer | Rio de Janeiro | Hockey | Hockey Men's Hockey | Bronze |
14803 | 135563 | Olesya Nikolayevna Zykina | F | 19.0 | 171.0 | 64.0 | Russia | RUS | 2000 Summer | 2000 | Summer | Sydney | Athletics | Athletics Women's 4 x 400 metres Relay | Bronze |
14804 | 135563 | Olesya Nikolayevna Zykina | F | 23.0 | 171.0 | 64.0 | Russia | RUS | 2004 Summer | 2004 | Summer | Athina | Athletics | Athletics Women's 4 x 400 metres Relay | Silver |
14805 rows × 15 columns
Now, we intend to narrow our focus further, which leads us to creating another subset of data. The new DataFrame contains data only from the beginning of the 20th century and the beginning of the 21st century. In doing so, the subset of data allow us to create direct comparisons and contrast differences in Olympic data from the starting and recent modern Olympic Games. To create the new DataFrame, we remove rows pertaining to years outside of the ranges 1896-1912 and 2000-2016. Moreover, we look at only the Summer Olympics, thus removing rows pertaining to the Winter Olympics.
# time_pds_df contains data of the first five and last five modern Olympics
time_pds_df = top_df.copy()
# create list of the years in focus
years = [1896, 1900, 1904, 1908, 1912, 2000, 2004, 2008, 2012, 2016]
# tidy further by dropping any rows that are not Summer Olympics
time_pds_df = time_pds_df[time_pds_df['Season'] == 'Summer'].reset_index(drop=True)
# iterate through time_pds_df and drop rows containing data not within the time periods in focus
for index,row in time_pds_df.iterrows():
if row['Year'] not in years:
time_pds_df = time_pds_df.drop(index, 0)
# reset index after dropping rows in df
time_pds_df.reset_index(drop=True, inplace=True)
# display data
time_pds_df
ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 62 | Giovanni Abagnale | M | 21.0 | 198.0 | 90.0 | Italy | ITA | 2016 Summer | 2016 | Summer | Rio de Janeiro | Rowing | Rowing Men's Coxless Pairs | Bronze |
1 | 67 | Mariya Vasilyevna Abakumova (-Tarabina) | F | 22.0 | 179.0 | 80.0 | Russia | RUS | 2008 Summer | 2008 | Summer | Beijing | Athletics | Athletics Women's Javelin Throw | Silver |
2 | 73 | Luc Abalo | M | 23.0 | 182.0 | 86.0 | France | FRA | 2008 Summer | 2008 | Summer | Beijing | Handball | Handball Men's Handball | Gold |
3 | 73 | Luc Abalo | M | 27.0 | 182.0 | 86.0 | France | FRA | 2012 Summer | 2012 | Summer | London | Handball | Handball Men's Handball | Gold |
4 | 73 | Luc Abalo | M | 31.0 | 182.0 | 86.0 | France | FRA | 2016 Summer | 2016 | Summer | Rio de Janeiro | Handball | Handball Men's Handball | Silver |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6102 | 135508 | Vera Igorevna Zvonaryova | F | 23.0 | 172.0 | 59.0 | Russia | RUS | 2008 Summer | 2008 | Summer | Beijing | Tennis | Tennis Women's Singles | Bronze |
6103 | 135520 | Julia Zwehl | F | 28.0 | 167.0 | 60.0 | Germany | GER | 2004 Summer | 2004 | Summer | Athina | Hockey | Hockey Women's Hockey | Gold |
6104 | 135525 | Martin Zwicker | M | 29.0 | 175.0 | 64.0 | Germany | GER | 2016 Summer | 2016 | Summer | Rio de Janeiro | Hockey | Hockey Men's Hockey | Bronze |
6105 | 135563 | Olesya Nikolayevna Zykina | F | 19.0 | 171.0 | 64.0 | Russia | RUS | 2000 Summer | 2000 | Summer | Sydney | Athletics | Athletics Women's 4 x 400 metres Relay | Bronze |
6106 | 135563 | Olesya Nikolayevna Zykina | F | 23.0 | 171.0 | 64.0 | Russia | RUS | 2004 Summer | 2004 | Summer | Athina | Athletics | Athletics Women's 4 x 400 metres Relay | Silver |
6107 rows × 15 columns
To provide a visualization and overview of the data we are working with, we created a world map using Folium. Olympic athletes travel across ocean and land to participate in the high stake Games; they have prepared and trained incredibly hard for years to reach the point of competing at the Olympic level.
The map marks each country's capital by longitude and latitude, and displays additional information in the pop-ups. The viewer is able to easily see the various locations of countries in our focus on the map as well as click on the markers to obtain the exact location name and a specific country's total medal count from 1896 to 2016. Additionally, the markers for the countries are distinguished from one another through the assignment of different colors that correspond to a specific continent; red represents North America, blue represents Europe, green represents Asia, and orange represents Australia.
# create world map pinpointing the eleven highest medal earning countries
map_osm = folium.Map()
# create df with country, country's capital name, and the capital's longitude and latitude values using values from map_data
map_data = [['USA', 'Washington DC, USA', 38.8899, -77.0091], ['RUS','Moscow, Russia', 55.7558, 37.6173], ['GBR','London, UK', 51.5072, 0.1276], ['FRA','Paris, France',48.8566, 2.3522], ['CHN', 'Beijing, China', 39.9042, 116.4074], ['ITA', 'Rome, Italy',41.9028, 12.4964], ['AUS', 'Canberra, Australia', -35.2809, 149.1300], ['SWE','Stockholm, Sweden', 59.3293, 18.0686],
['HUN', 'Budapest, Hungary', 47.4979, 19.0402], ['JPN', 'Tokyo, Japan', 35.6762, 139.6503], ['GER', 'Berlin, Germany', 52.5200, 13.4050]]
# columns of the new df are NOC, capital, latitude, and longitude
loc_df = pd.DataFrame(map_data, columns = ['NOC', 'Capital', 'Latitude', 'Longitude'])
# create new column that currently stores NaN values, but will be updated to store each country's total medal count
loc_df['Medal_Count'] = np.nan
# iterate through top_df and sum up each country's total number of medals, storing them in medal_count_arr
# medal_count_arr is a list for storing each country's medal count, where each country is assigned to a numerical index/key
medal_count_arr = [0] * 11
for index, row in top_df.iterrows():
if row['NOC'] == 'USA':
medal_count_arr[0] += 1
elif row['NOC'] == 'RUS':
medal_count_arr[1] += 1
elif row['NOC'] == 'GBR':
medal_count_arr[2] += 1
elif row['NOC'] == 'FRA':
medal_count_arr[3] += 1
elif row['NOC'] == 'CHN':
medal_count_arr[4] += 1
elif row['NOC'] == 'ITA':
medal_count_arr[5] += 1
elif row['NOC'] == 'AUS':
medal_count_arr[6] += 1
elif row['NOC'] == 'SWE':
medal_count_arr[7] += 1
elif row['NOC'] == 'HUN':
medal_count_arr[8] += 1
elif row['NOC'] == 'JPN':
medal_count_arr[9] += 1
elif row['NOC'] == 'GER':
medal_count_arr[10] += 1
# update loc_df medal count column with the medal_count_arr values corresponding to each country
for index, row in loc_df.iterrows():
loc_df.at[index,'Medal_Count'] = medal_count_arr[index]
# display loc_df
loc_df
NOC | Capital | Latitude | Longitude | Medal_Count | |
---|---|---|---|---|---|
0 | USA | Washington DC, USA | 38.8899 | -77.0091 | 4385.0 |
1 | RUS | Moscow, Russia | 55.7558 | 37.6173 | 1134.0 |
2 | GBR | London, UK | 51.5072 | 0.1276 | 1035.0 |
3 | FRA | Paris, France | 48.8566 | 2.3522 | 987.0 |
4 | CHN | Beijing, China | 39.9042 | 116.4074 | 985.0 |
5 | ITA | Rome, Italy | 41.9028 | 12.4964 | 1061.0 |
6 | AUS | Canberra, Australia | -35.2809 | 149.1300 | 1207.0 |
7 | SWE | Stockholm, Sweden | 59.3293 | 18.0686 | 765.0 |
8 | HUN | Budapest, Hungary | 47.4979 | 19.0402 | 791.0 |
9 | JPN | Tokyo, Japan | 35.6762 | 139.6503 | 843.0 |
10 | GER | Berlin, Germany | 52.5200 | 13.4050 | 1612.0 |
# add each country's marker to the map with color associated with continent & include a pop-up containing location name and medal count
for indice, row in loc_df.iterrows():
# assign country an icon color
if row['NOC'] == 'USA': # north america
color = 'red'
elif row['NOC'] == 'RUS' or row['NOC'] == 'GBR' or row['NOC'] == 'FRA' or row['NOC'] == 'ITA' or row['NOC'] == 'HUN' or row['NOC'] == 'GER' or row['NOC'] == 'SWE': # europe
color = 'blue'
elif row['NOC'] == 'CHN' or row['NOC'] == 'JPN': # asia
color = 'green'
elif row['NOC'] == 'AUS': # austrailia
color = 'orange'
# add marker to map
folium.Marker(
location=[row["Latitude"], row["Longitude"]],
popup= str(row["Capital"]) + '\nMedal Count: ' + str(row["Medal_Count"]),
icon=folium.map.Icon(color=color)
).add_to(map_osm)
# display map
map_osm
We begin by measuring and analyzing a few variables over time. In the following graphs, the heights, weights, and ages of Olympians are plotted over the time period 1896 to 2016. Over the course of the past century, advances in equality, scientific discoveries, and technologies have resulted in a more diverse and wide ranging group of Olympic athletes. The visualization of height, weight, and age of olympians over time reveal several significant trends.
# list of variable names to measure
variables = ["Height", "Weight", "Age"]
# displays graphs of the target variable vs time
fig3, axes = plt.subplots(len(variables), 1, figsize=(15,20), squeeze=False)
plt.subplots_adjust(hspace=.4)
# creates a scatterplot using matplotlib pyplot for each variable vs year
for i in range(len(variables)):
top_df.plot(kind='scatter', x="Year", y=variables[i], ax=axes[i, 0],
title = variables[i] + " vs Year")
# display graphs
plt.show()
Before we discuss the meaningful relationships between height, weight, age, and time revealed in the above scatterplots created using matplotlib pyplot, you may have noticed that all three plots have zero points for the Olympic years of 1916, 1940, and 1944. These are the only Olympics that had been cancelled in the history of the modern Olympics, and all due to war. World War I led to the cancellation of 1916's Olympics, and World War II led to the cancellation of two Olympics—1940's and 1944's. You can read more about when world events disrupted the Olympic Games here.
Now, let's take a look at the individual graphs. Overall, we can see that there are significant increases in data points per year over the course of the time period of 1896 to 2016.
Looking at the height vs. year plot, we can see the trend of the range for height increasing over time. In the beginning years of the 1900s, Olympians' heights range from about 155 to 200 (difference of 45); in the early 2000s, Olympians' heights range from about 135 to 215 (difference of 80). The height range has expanded in both the taller and shorter direction. This may be attributed to many different factors. Firstly, social, technological, and economic factors have resulted in a more diverse group of human beings on Earth; moreover, with a significantly larger population than compared to a century ago, there are inherently more people with different body types/heights. The increased focus on training and supplementation has additionally played a role in human growth in areas such as height. Secondly, over the course of Olympic history, more countries and individuals have competed in the Olympics. The greater representation and inclusivity (for example, allowing women or refugees to participate) have impacted the height distribution of Olympic athletes over time; the average height of men is greater than the average height of women, and the average height of a group of people from one country/ethnicity may differ from that of another. Lastly, different sporting events lend themselves to competitors of certain heights. For example, the average height of gymnasts differs significantly from the average height of basketball players. The changes and increase in the nunmber of sporting events included in the Olympics over the past century affects who competes, and thus, the overall height of athletes.
The weight vs. year plot also conveys a trend of an increasing range of values over time. In the beginning years of the 1900s, Olympians' weights range from about 42 to 120 (difference of 78); in the early 2000s, Olympians' weights range from about 37 to 140 (difference of 103); while the range of weight has also expanded in both directions, lighter and heavier, the extent of it is less than that of the height measurements. The trend can also be attributed to the points made above in regards to height distribution over time.
Looking at the age vs. year plot, we can see that the height of the ranges of ages generally oscillates over time. However, the minimum and maximum age does change significantly from 1896 to 2016. A greater number of older and younger athletes have competed in the Olympics since 1896. In 1896, the ages range from about 20 to 27; in 2016, the ages range from about 15 to 57. Given that there is no age limit for competitors in the Olympic Games, it is likely that the trend of both greater numbers of older and younger athletes will be further exemplified.
Ultimately, we may draw the conclusion that bodies of Olympic athletes are becoming more differentiated from others. Athletes' success in various sporting events are in part achieved through their specialized and more extreme physique, aided by social and technological progress.
Gender and sex equality in sports has been a topic at hand since the beginning years of the Olympic Games. While it remains an issue today, athletes and organizations are helping to pave the way for gender equality in sports. The Internationl Olympic Committee (IOC) has been actively promoting the advancement of gender equality in the modern Olympic Movement and beyond since the 1900s. The IOC has stated that "sport is one of the most powerful platforms for promoting gender equality and empowering women and girls" (Olympics). Across the globe, women and girls face discrimination on the basis of sex and gender. Analyzing the Olympic data presents an opportunity to see some form of progress in the fight for gender equality, especially in sports, in the past century; however, the fight against sex and gender discrimination is still one that women and girls around the world continue to battle every day.
The dataset we are currently working with consists of data on athletes who won a medal at a Summer Olympic Games representing a country in the top ten medal leaders in the world from 1896 to 2016. We divided the data into the beginning Olympic years (before 1950) and latter Olympic years (1950 and later). We then created two pie charts using matplotlib pyplot for the beginning and latter Olympic years to show the proportion of male medal winners to female winners. The pie charts convey a part-to-whole relationship that clearly show that male athletes have dominated in the medal counts in comparison to female athletes. Comparing the two pie charts to one another, we can also see that the difference in percentage of male and female medal winners in the earlier Olympic years is immense; while the difference is smaller in the plot depicting the later Olympic years, it is clear that the medal counts for each sex is not close to equal. The commanding factor is that throughout the course of Olympic history, there has been a smaller percentage of female athletes. Increased accessiblity and encouragement for women and girls to participate in sports has helped narrowed the gap, however, it still very much exists as female athletes continue to face challenges and oppression around the world.
# create new dataframes, that are copies of top_df, to be manipulated
early_df = top_df.copy()
later_df = top_df.copy()
# early_df consists of data from before 1950, later_df consists of data from 1950 and later
# reset index after dropping rows from both df
early_df = early_df[early_df['Year'] < 1950].reset_index(drop=True)
later_df = later_df[later_df['Year'] >= 1950].reset_index(drop=True)
# create pie chart using matplotlib pyplot that plots number of males and females from the earlier Olympic years (before 1950)
plt.figure(figsize=[9,7])
early_df['Sex'].value_counts().plot.pie()
plt.title('Male and Female Medalists Count from 1896 to 1950')
# display pie chart
plt.show()
# create pie chart using matplotlib pyplot that plots number of males and females from the later Olympic years (1950 and later)
plt.figure(figsize=[9,7])
later_df['Sex'].value_counts().plot.pie()
plt.title('Male and Female Medalists Count from 1950 to 2016')
# display pie chart
plt.show()
Next, we create scatterplots using matplotlib pyplot that display the number of female and male medalists over time. We use python dictionaries to first store the key value pairs of female/male count and a specific year before creating plot visuals. To enhance the analysis, we also added regression lines to both graphs that show the rate of increase in female/male medalist count over time. We can see that both plots have a general increasing trend during the time period of 1896 to 2016. This is reasonable as the Olympics generally had more competitors over time, thus resulting in more medalists. The more revealing aspect comes from the m value of the regression line. For the Number of Female Medalists Over Time graph, the slope of the line is 4.717; for the Number of Male Medalists Over Time graph, the slope of the line is 3.316. The values show that the number of female medalists increased at a faster rate over time compared to male medalists. This aligns with the fact that the percentage of female competitors increased over time due to social movements and greater sex and gender equality globally. The Olympics began with only male competitors, but has steadily fostered a more diverse and inclusive community of athletes of both sex. The boxplot of sex vs. year of athletes further conveys the differing distribution of female and male athletes over time. The whiskers of the boxplot representing males stretches farther than the one representing females, since female athletes were either not allowed to participate in the Games, or were participating in small numbers. The quartiles and median mark also convey how female athletes have only more recently medaled in greater numbers at the Olympics.
# create a dictionary that will store key/value pairs for female count for a given year
fem_yr = dict()
# create copy of top_df stored in female_df, then keep only data pertaining to female athletes
female_df = top_df.copy()
female_df = female_df[female_df['Sex'] == 'F'].reset_index(drop=True)
# iterate through female_df to count up the number of females in a given year
for index, row in female_df.iterrows():
if row['Year'] in fem_yr:
fem_yr[row['Year']] = fem_yr[row['Year']] + 1
else:
fem_yr[row['Year']] = 1
# create lists for the keys and values of the fem_yr dict
year_lst = []
female_count = []
for k,v in fem_yr.items():
year_lst.append(k)
female_count.append(v)
# create a scatterplot using matplotlib pyplot of female medalists count over time
plt.scatter(year_lst, female_count, s=50)
plt.title("Number of Female Medalists Over Time", fontsize = 12)
plt.xlabel('Year')
plt.ylabel('Count')
# add regression line
m, b = np.polyfit(year_lst, female_count, 1)
y = female_df['Year']
plt.plot(y, m*y+b, color='red')
# print slope of line that shows numerical rate of increase
print('Slope of line: ' + str(m))
Slope of line: 4.717016778966828
# create a dictionary that will store key/value pairs for male count for a given year
male_yr = dict()
# create copy of top_df stored in male_df, then keep only data pertaining to male athletes
male_df = top_df.copy()
male_df = male_df[male_df['Sex'] == 'M'].reset_index(drop=True)
# iterate through male_df to count up the number of males in a given year
for index, row in male_df.iterrows():
if row['Year'] in male_yr:
male_yr[row['Year']] = male_yr[row['Year']] + 1
else:
male_yr[row['Year']] = 1
# create lists for the keys and values of the male_yr dict
year_lst2 = []
male_count = []
for k,v in male_yr.items():
year_lst2.append(k)
male_count.append(v)
# create a scatterplot using matplotlib pyplot of male medalists count over time
plt.scatter(year_lst2, male_count, s=50)
plt.title("Number of Male Medalists Over Time", fontsize = 12)
plt.xlabel('Year')
plt.ylabel('Count')
# add regression line
m2, b2 = np.polyfit(year_lst2, male_count, 1)
y2 = male_df['Year']
plt.plot(y2, m2*y2+b2, color='red')
# print slope of line that shows numerical rate of increase
print('Slope of line: ' + str(m2))
Slope of line: 3.3164683242687882
# create a boxplot using seaborn for plotting categorical sex variable over time (years)
sb.boxplot(x='Year', y='Sex', data=top_df, width = 0.8).set_title('Sex vs. Year of Olympic Athletes from 1986 to 2016')
Text(0.5, 1.0, 'Sex vs. Year of Olympic Athletes from 1986 to 2016')
Building upon our previous computations and analyses, we are now creating a regression model for the data in top_df to analyze athletes' sex and time throughout the course of Olympic history, and whether there is a significant relationship between the two variables. Using statsmodel, we create a linear regression model, and use the summary to further analyze the data. OLS is a common technique used in analyzing linear regression.
# copy top_df and store in new reg_df
reg_df = top_df.copy()
# convert 'Sex' column's string values (M/F) to numerical values (0/1) to be used in regression model
reg_df.replace(to_replace= 'M', value = 0, inplace=True)
reg_df.replace(to_replace= 'F', value = 1, inplace=True)
# create arrays of year and sex values needed to create the regression model
xplot = np.array(reg_df['Year'].values)
yplot = np.array(reg_df['Sex'].values)
# fit a linear regression model using statsmodels for year and sex
model = sm.OLS(yplot, xplot, missing='drop')
model = model.fit()
# print out ols regression result summary for inspection, including model coefficient and intercept
model.summary()
Dep. Variable: | y | R-squared (uncentered): | 0.367 |
---|---|---|---|
Model: | OLS | Adj. R-squared (uncentered): | 0.367 |
Method: | Least Squares | F-statistic: | 8580. |
Date: | Sun, 19 Dec 2021 | Prob (F-statistic): | 0.00 |
Time: | 07:26:45 | Log-Likelihood: | -10130. |
No. Observations: | 14805 | AIC: | 2.026e+04 |
Df Residuals: | 14804 | BIC: | 2.027e+04 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
x1 | 0.0002 | 1.98e-06 | 92.627 | 0.000 | 0.000 | 0.000 |
Omnibus: | 66488.601 | Durbin-Watson: | 1.225 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 2530.877 |
Skew: | 0.568 | Prob(JB): | 0.00 |
Kurtosis: | 1.323 | Cond. No. | 1.00 |
In the above regression model summary, there are several important values to take note of. First, we see that R-squared is 0.367. In percentage terms, this means that our model explains 36.7% of the variability in our sex variable. We can also look at the value for Prob (F-statistic) to convey the accuracy of the null hypothesis. Since the value is 0.00 which tells us that there is 0.00, or a very low, chance of it. Lastly, looking at the p value of 0.000, since it is less than 0.05, this means that the parameter is significantly different from zero.
As seen in an earlier analyis, the range of ages of Olympic athletes has widened over time. Here, we continue to look at the distribution of ages, focusing on comparing the beginning (first five) and latter (past five) Olympic years.
First, we created two violin plots (using matplotlib pyplot)—one displaying the distribution of age over time for the time period 1896-1910, and the other displaying the distribution of age over time for the time period 2000-2016. Looking at the beginning years plot, we can see that only two of the five Olympic Years in focus have data after our tidying process. The range in data is fairly small for both 1896's and 1900's data. More importantly, we can see that the median age of athletes in 1896 is 23, with greater frequency around ages 21 and 26. The median age of athletes in the 1900 Olympics is also 23; is this case, the violin is unimodal, and shows that the greatest frequency is at the age 23. Now, looking at the latter years plot, we can see that the median age of athletes for all five Olympics is about 26. This is higher than the median for the beginning Olympic years, which is understandable since as we have seen previously, the age and age range of Olympic athletes has increased over time. We can also see that the five violins are unimodal, with the greatest frequency appearing around the median age of 26. Given evolving Olympic athlete mindset, goals, and abilities, as well as new training and supplementation regimen that recent decades have allowed for, the distribution of ages seen in the following graphs make sense.
The distplot created using seaborn combines the beginning and latter years data to present a visualization of the density of data at different ages. It is apparent that a large number of Olympic athletes fall in the age range of early to late 20s, which has the highest density. Overall, we can see that majority of competitors are between mid-teens and 40 years old. To make this more explicit, we created a bar graph that assigned athletes to their corresponding age groups. Under 18 belong to the teen age group, 18-24 belong to the youth age group, 25-54 belong to the adult age group, and 55 and up belong to the senior age group. The bar graph, created using matplotlib pyplot, reveals that the largest age group is indeed adults, followed closely behind by the youth age group. In comparison to the other age groups for our data, the teen age group is much smaller; there are also no seniors.
# create plot using matplotlib pyplot
fig, ax = plt.subplots()
fig.set_figheight(6)
fig.set_figwidth(6)
# iterate through the dataframe and add the key value pairs of year and age to the age_yr dictionary
age_yr = dict()
for index, row in time_pds_df.iterrows():
if row['Year'] < 2000:
if row['Year'] in age_yr:
age_yr[row['Year']].append(row['Age'])
else:
age_yr[row['Year']] = [row['Age']]
# create lists for the keys and values of the age_yr dict
x = []
y = []
for k,v in age_yr.items():
x.append(k)
y.append(v)
# create violin plot for x (year) and y (age)
ax.violinplot(y,x,widths=2,showmeans=True)
ax.set_xlabel("Year")
ax.set_ylabel("Age")
ax.set_title("Distribution of Age Over Time: 1896-1910")
fig.savefig("violin.png")
# create plot using matplotlib pyplot
fig2, ax2 = plt.subplots()
fig2.set_figheight(6)
fig2.set_figwidth(6)
# iterate through the dataframe and add the key value pairs of year and age to the age_yr dictionary
age_yr2 = dict()
for index, row in time_pds_df.iterrows():
if row['Year'] >= 2000:
if row['Year'] in age_yr2:
age_yr2[row['Year']].append(row['Age'])
else:
age_yr2[row['Year']] = [row['Age']]
# create lists for the keys and values of the age_yr2 dict
x2 = []
y2 = []
for k,v in age_yr2.items():
x2.append(k)
y2.append(v)
# create violin plot for x (year) and y (age)
ax2.violinplot(y2,x2,widths=2,showmeans=True)
ax2.set_xlabel("Year")
ax2.set_ylabel("Age")
ax2.set_title("Distribution of Age Over Time: 2000-2016")
fig2.savefig("violin.png")
# create a distplot using seaborn which represents the distribution of age against the density distribution
sb.distplot(time_pds_df.Age)
/opt/conda/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='Age', ylabel='Density'>
# create new column in reg_df for age group
reg_df['Age_Group'] = 'empty'
# iterate through reg_df to assign the athlete to an age group based on their age
# Teen: under 18, Youth: 18-24, Adult: 25-54, Senior: 55 and up
for index, row in reg_df.iterrows():
if row['Age'] < 18:
reg_df.at[index, 'Age_Group'] = 'Teen'
elif row['Age'] >= 18 and row['Age'] < 25:
reg_df.at[index, 'Age_Group'] = 'Youth'
elif row['Age'] >= 25 and row['Age'] < 55:
reg_df.at[index, 'Age_Group'] = 'Adult'
else:
reg_df.at[index, 'Age_Group'] = 'Senior'
# create bar graph showing athlete count for each age group
plt.figure(figsize=[9,7])
reg_df['Age_Group'].value_counts().plot.barh()
plt.title('Age Group vs. Athlete Count')
plt.xlabel('Count')
plt.ylabel('Age Group')
# display plot
plt.show()
Now, let's determine whether we can find a predictive relationship between sex and age, height, and weight. After creating a decision tree classifier using entropy as its criterion, performing holdout validation on the algorithm, conducting the prediction, and then measuring the accuracy, we see that the prediction resulted in an accuracy value of 0.7820; this means that the decision tree algorithm had a 78.20% test accuracy. While a prediction's accuracy closer to 100% would be more desirable, the accuracy is still fairly good at 78.20%.
# remove any NaN values that may be in the data
reg_df = reg_df[~reg_df.isin([np.nan, np.inf, -np.inf]).any(1)]
# decision tree is the algorithm used for classification
# using entropy (range of 0 to 1) as the criterion for measuring the quality of the split and selecting the best features
# for the max_depth hyperparameter, we chose the value 5 versus default value of None which would allow nodes to expand
# until all leaves are pure or until all leaves contain less than min_samples_split samples; after testing different
# values, max_depth of 3 consistently produced higher accuracy values.
dtc = tree.DecisionTreeClassifier(criterion = 'entropy', max_depth=3)
# the predictors to be used include age, height, and weight
# store the data and the target data into two variables
cols = ['Age', 'Height', 'Weight']
X = reg_df[cols] # input features
y = reg_df.Sex # target variable
# perform holdout validation on the algorithm
# split data for separate training and testing sets, where 70% of dataset is used for training, and 30% for testing
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3)
# fit the model, train the decision tree classifier
dtc = dtc.fit(X_train,y_train)
# predict the test dataset using predict()
y_pred = dtc.predict(X_test)
# the performance metric used is accuracy (using metrics from sklearn)
dtc_accuracy = metrics.accuracy_score(y_test, y_pred)
# print accuracy of decision tree classifier
print("Accuracy:", dtc_accuracy)
Accuracy: 0.7695945945945946
The Olympic Games are known to garner large viewing audiences, as people cheer on some or all of the international athletes competing in a wide range of sporting events. You may even be one of those viewers! If you have interest in the Olympics and its multifaceted nature that lends itself to intriguing and informative analyses on society, people, and sports, there is still an abundance of data left to explore. Additionally, with the Olympics occurring every four years, there will be future influxes of Olympic athlete data to be collected, wrangled, and analyzed.
We were able to learn and apply many data science principles and processes as we worked through the pipeline that consisted of data curation and management, exploratory data analysis, hypothesis testing and machine learning, and prose to provide context and insights. Ultimately, we were able to deduce that the demographics and characteristics of athletes who have competed in the Olympics during the time period of 1896 to 2016 changed considerably. The visualizations of the data, computations, and analyses further reveal how global and societal happenings impacted the Olympics. In analyzing the course of the Olympics' history, we were able to see a trend of increased diversity and representation. It will be interesting to see how trends continue or change course in the future.