From Ancient Greece to Your Screen: How the Olympics Have Developed Over Time

Emily Chang and Vidushi Vashist


The Olympic Games are highly anticipated international sporting events normally held every four years. While the first Olympic Games are traditionally dated to 776 BC, the inaugural Games of the modern Olympics took place more recently in the year 1896. The Olympics encourage cultural exchanges and foster a diverse community; the Games create an opportunity for athletes from around the world to come together to compete and represent their countries in a wide range of sporting events. The Olympics have developed over time to become more inclusive of many different groups, and allow for different parts of the world to showcase their athletes' aptitude in a multitude of sports. From the inaugural modern Olympic Games of all male athletes to the current Olympic Games of a diverse mix of athletes including women, refugees, and people of different backgrounds and abilities, the Olympics have evolved immensely over the past 124 years.

In this tutorial, we aim to provide valuable insight into the evolution of the Olympic Games, as well as further understanding of how global social, political, and economic changes have impacted various aspects of the Olympics. Taking a look at Olympic data from the past century, we will see how athlete demographics, representation, and medal leaders have changed or remained the same over time. Our original dataset contains information spanning 120 years of Olympic events; this historical dataset was scraped from and then posted as a csv file on We wanted to, however, focus on a smaller subset of countries to gather more meaningful analyses; the following parts of the tutorial will look at the countries, which cover four different continents, in the top ten medal winners up to 2016.

The libraries we used have been imported below, and include pandas, numpy, matplotlib, seaborn, folium, sklearn, and more.

Getting Started with the Data

Data Collection and Wrangling

First, the data is read from the csv file into a DataFrame using Pandas. Each row represents an athlete that competed in a specific event in the Olympics. The first five entries of the raw data can be seen below.

The data contains many rows with NaN values, which can be seen even in the first five rows of the table displayed above. The NaN values result from Olympic athletes who participated and did not win a medal, as well as incomplete data on athletes where height or weight are missing. For our goals and purposes, the dataset is also far too large to perform meaningful computations and analyses on. Thus, we begin to tidy the data. We create a new DataFrame called top_df that drops all rows containing NaN values, and only includes data on the top 11 countries in terms of number of Summer Olympics medals won (according to TIME). The countries include the United States of America, Russia, Great Britain, Germany, France, Italy, China, Sweden, Hungary, Australia, and Japan, which also span four continents. Lastly, we reset top_df's index values to start at 0 and increase indices consecutively. This produces a more desirable dataset to work with for our tutorial.

Now, we intend to narrow our focus further, which leads us to creating another subset of data. The new DataFrame contains data only from the beginning of the 20th century and the beginning of the 21st century. In doing so, the subset of data allow us to create direct comparisons and contrast differences in Olympic data from the starting and recent modern Olympic Games. To create the new DataFrame, we remove rows pertaining to years outside of the ranges 1896-1912 and 2000-2016. Moreover, we look at only the Summer Olympics, thus removing rows pertaining to the Winter Olympics.

Exploratory Data Analysis

Map Overview of Countries of Focus

To provide a visualization and overview of the data we are working with, we created a world map using Folium. Olympic athletes travel across ocean and land to participate in the high stake Games; they have prepared and trained incredibly hard for years to reach the point of competing at the Olympic level.

The map marks each country's capital by longitude and latitude, and displays additional information in the pop-ups. The viewer is able to easily see the various locations of countries in our focus on the map as well as click on the markers to obtain the exact location name and a specific country's total medal count from 1896 to 2016. Additionally, the markers for the countries are distinguished from one another through the assignment of different colors that correspond to a specific continent; red represents North America, blue represents Europe, green represents Asia, and orange represents Australia.

Height, Weight, and Age Distribution Over Time

We begin by measuring and analyzing a few variables over time. In the following graphs, the heights, weights, and ages of Olympians are plotted over the time period 1896 to 2016. Over the course of the past century, advances in equality, scientific discoveries, and technologies have resulted in a more diverse and wide ranging group of Olympic athletes. The visualization of height, weight, and age of olympians over time reveal several significant trends.

Before we discuss the meaningful relationships between height, weight, age, and time revealed in the above scatterplots created using matplotlib pyplot, you may have noticed that all three plots have zero points for the Olympic years of 1916, 1940, and 1944. These are the only Olympics that had been cancelled in the history of the modern Olympics, and all due to war. World War I led to the cancellation of 1916's Olympics, and World War II led to the cancellation of two Olympics—1940's and 1944's. You can read more about when world events disrupted the Olympic Games here.

Now, let's take a look at the individual graphs. Overall, we can see that there are significant increases in data points per year over the course of the time period of 1896 to 2016.

Looking at the height vs. year plot, we can see the trend of the range for height increasing over time. In the beginning years of the 1900s, Olympians' heights range from about 155 to 200 (difference of 45); in the early 2000s, Olympians' heights range from about 135 to 215 (difference of 80). The height range has expanded in both the taller and shorter direction. This may be attributed to many different factors. Firstly, social, technological, and economic factors have resulted in a more diverse group of human beings on Earth; moreover, with a significantly larger population than compared to a century ago, there are inherently more people with different body types/heights. The increased focus on training and supplementation has additionally played a role in human growth in areas such as height. Secondly, over the course of Olympic history, more countries and individuals have competed in the Olympics. The greater representation and inclusivity (for example, allowing women or refugees to participate) have impacted the height distribution of Olympic athletes over time; the average height of men is greater than the average height of women, and the average height of a group of people from one country/ethnicity may differ from that of another. Lastly, different sporting events lend themselves to competitors of certain heights. For example, the average height of gymnasts differs significantly from the average height of basketball players. The changes and increase in the nunmber of sporting events included in the Olympics over the past century affects who competes, and thus, the overall height of athletes.

The weight vs. year plot also conveys a trend of an increasing range of values over time. In the beginning years of the 1900s, Olympians' weights range from about 42 to 120 (difference of 78); in the early 2000s, Olympians' weights range from about 37 to 140 (difference of 103); while the range of weight has also expanded in both directions, lighter and heavier, the extent of it is less than that of the height measurements. The trend can also be attributed to the points made above in regards to height distribution over time.

Looking at the age vs. year plot, we can see that the height of the ranges of ages generally oscillates over time. However, the minimum and maximum age does change significantly from 1896 to 2016. A greater number of older and younger athletes have competed in the Olympics since 1896. In 1896, the ages range from about 20 to 27; in 2016, the ages range from about 15 to 57. Given that there is no age limit for competitors in the Olympic Games, it is likely that the trend of both greater numbers of older and younger athletes will be further exemplified.

Ultimately, we may draw the conclusion that bodies of Olympic athletes are becoming more differentiated from others. Athletes' success in various sporting events are in part achieved through their specialized and more extreme physique, aided by social and technological progress.

Male and Female Olympic Athletes Over Time

Gender and sex equality in sports has been a topic at hand since the beginning years of the Olympic Games. While it remains an issue today, athletes and organizations are helping to pave the way for gender equality in sports. The Internationl Olympic Committee (IOC) has been actively promoting the advancement of gender equality in the modern Olympic Movement and beyond since the 1900s. The IOC has stated that "sport is one of the most powerful platforms for promoting gender equality and empowering women and girls" (Olympics). Across the globe, women and girls face discrimination on the basis of sex and gender. Analyzing the Olympic data presents an opportunity to see some form of progress in the fight for gender equality, especially in sports, in the past century; however, the fight against sex and gender discrimination is still one that women and girls around the world continue to battle every day.

The dataset we are currently working with consists of data on athletes who won a medal at a Summer Olympic Games representing a country in the top ten medal leaders in the world from 1896 to 2016. We divided the data into the beginning Olympic years (before 1950) and latter Olympic years (1950 and later). We then created two pie charts using matplotlib pyplot for the beginning and latter Olympic years to show the proportion of male medal winners to female winners. The pie charts convey a part-to-whole relationship that clearly show that male athletes have dominated in the medal counts in comparison to female athletes. Comparing the two pie charts to one another, we can also see that the difference in percentage of male and female medal winners in the earlier Olympic years is immense; while the difference is smaller in the plot depicting the later Olympic years, it is clear that the medal counts for each sex is not close to equal. The commanding factor is that throughout the course of Olympic history, there has been a smaller percentage of female athletes. Increased accessiblity and encouragement for women and girls to participate in sports has helped narrowed the gap, however, it still very much exists as female athletes continue to face challenges and oppression around the world.

Next, we create scatterplots using matplotlib pyplot that display the number of female and male medalists over time. We use python dictionaries to first store the key value pairs of female/male count and a specific year before creating plot visuals. To enhance the analysis, we also added regression lines to both graphs that show the rate of increase in female/male medalist count over time. We can see that both plots have a general increasing trend during the time period of 1896 to 2016. This is reasonable as the Olympics generally had more competitors over time, thus resulting in more medalists. The more revealing aspect comes from the m value of the regression line. For the Number of Female Medalists Over Time graph, the slope of the line is 4.717; for the Number of Male Medalists Over Time graph, the slope of the line is 3.316. The values show that the number of female medalists increased at a faster rate over time compared to male medalists. This aligns with the fact that the percentage of female competitors increased over time due to social movements and greater sex and gender equality globally. The Olympics began with only male competitors, but has steadily fostered a more diverse and inclusive community of athletes of both sex. The boxplot of sex vs. year of athletes further conveys the differing distribution of female and male athletes over time. The whiskers of the boxplot representing males stretches farther than the one representing females, since female athletes were either not allowed to participate in the Games, or were participating in small numbers. The quartiles and median mark also convey how female athletes have only more recently medaled in greater numbers at the Olympics.

Regression Model for Our Data

Building upon our previous computations and analyses, we are now creating a regression model for the data in top_df to analyze athletes' sex and time throughout the course of Olympic history, and whether there is a significant relationship between the two variables. Using statsmodel, we create a linear regression model, and use the summary to further analyze the data. OLS is a common technique used in analyzing linear regression.

In the above regression model summary, there are several important values to take note of. First, we see that R-squared is 0.367. In percentage terms, this means that our model explains 36.7% of the variability in our sex variable. We can also look at the value for Prob (F-statistic) to convey the accuracy of the null hypothesis. Since the value is 0.00 which tells us that there is 0.00, or a very low, chance of it. Lastly, looking at the p value of 0.000, since it is less than 0.05, this means that the parameter is significantly different from zero.

Distribution of Ages In Beginning Olympic Years vs Latter Olympic Years

As seen in an earlier analyis, the range of ages of Olympic athletes has widened over time. Here, we continue to look at the distribution of ages, focusing on comparing the beginning (first five) and latter (past five) Olympic years.

First, we created two violin plots (using matplotlib pyplot)—one displaying the distribution of age over time for the time period 1896-1910, and the other displaying the distribution of age over time for the time period 2000-2016. Looking at the beginning years plot, we can see that only two of the five Olympic Years in focus have data after our tidying process. The range in data is fairly small for both 1896's and 1900's data. More importantly, we can see that the median age of athletes in 1896 is 23, with greater frequency around ages 21 and 26. The median age of athletes in the 1900 Olympics is also 23; is this case, the violin is unimodal, and shows that the greatest frequency is at the age 23. Now, looking at the latter years plot, we can see that the median age of athletes for all five Olympics is about 26. This is higher than the median for the beginning Olympic years, which is understandable since as we have seen previously, the age and age range of Olympic athletes has increased over time. We can also see that the five violins are unimodal, with the greatest frequency appearing around the median age of 26. Given evolving Olympic athlete mindset, goals, and abilities, as well as new training and supplementation regimen that recent decades have allowed for, the distribution of ages seen in the following graphs make sense.

The distplot created using seaborn combines the beginning and latter years data to present a visualization of the density of data at different ages. It is apparent that a large number of Olympic athletes fall in the age range of early to late 20s, which has the highest density. Overall, we can see that majority of competitors are between mid-teens and 40 years old. To make this more explicit, we created a bar graph that assigned athletes to their corresponding age groups. Under 18 belong to the teen age group, 18-24 belong to the youth age group, 25-54 belong to the adult age group, and 55 and up belong to the senior age group. The bar graph, created using matplotlib pyplot, reveals that the largest age group is indeed adults, followed closely behind by the youth age group. In comparison to the other age groups for our data, the teen age group is much smaller; there are also no seniors.

Machine Learning

Now, let's determine whether we can find a predictive relationship between sex and age, height, and weight. After creating a decision tree classifier using entropy as its criterion, performing holdout validation on the algorithm, conducting the prediction, and then measuring the accuracy, we see that the prediction resulted in an accuracy value of 0.7820; this means that the decision tree algorithm had a 78.20% test accuracy. While a prediction's accuracy closer to 100% would be more desirable, the accuracy is still fairly good at 78.20%.