Baseball and Integration: the Effects of MLB Integration on the Negro Leagues
L. Colby Bogie
Abstract
In 1947, Jackie Robinson became the first African-American to play Major League Baseball. Before Robinson broke MLB’s color barrier, non-white baseball players competed in alternative professional baseball leagues, known collectively as the Negro Leagues. Using a Negro Leagues dataset from the website Retrosheet, I examined the Negro League seasons immediately before and immediately after Robinson’s debut in order to see how the integration of MLB affected Negro League baseball. I found that the scoring environment (total runs scored per game) remained relatively stable over these seasons, but attendance dropped precipitously in the 1948 and 1949 Negro League seasons.
Motivation
Major League Baseball recently announced that it was reclassifying certain Negro League seasons as “major league baseball,” opening the door for the statistics of non-white professionals from the early 20th century to take their place alongside the statistics of white players like Babe Ruth and Ty Cobb. MLB chose 1948 as the final season of Negro League baseball that would be considered “major league” quality. Some Negro Leagues continued to play for a few seasons after 1948, but during that time, more and more of the best black players were leaving the Negro Leagues to join MLB teams. My goal was to look at the statistical record of the Negro Leagues both before and after 1948 in order to determine whether or not there is in fact a noticeable difference in Negro League baseball before and after the integration of MLB.
Dataset(s)
I downloaded the Negro League datasets available at the baseball data website Retrosheet. Retrosheet is dedicated to recovering, preserving, and presenting as much accurate historical baseball data as possible. In the case of the Negro Leagues, record keeping was spotty, so this dataset contains many gaps and estimations, which I tried to filter out during my data cleaning. There’s a tremendous amount of granular, game-by-game data in the Retrosheet dataset, but for simplicity sake, I decided to look at two relatively simple pieces of data: total runs scored per regular season game and attendance per regular season game.
Data Preparation and Cleaning
This was by far the most time-consuming part of my project. I had to do the following:
Research Question(s)
How did the run scoring environment (i.e. total runs per game) and average attendance in the Negro Leagues change in the seasons after Jackie Robinson broke the MLB color barrier in 1947?
Methods
Once I had cleaned the data and separated it into four regular season datasets for 1946, 1947, 1948, and 1949, I plotted the changes in run values and attendance using two methods: line plots of the average values and box-and-whisker plots that also showed the variability and distribution of the data.
Findings, slide 1: Average Runs Per Game
This chart shows that there was very little change in the overall run environment of the Negro Leagues in the seasons before and after the integration of MLB.
Findings, slide 2: Runs Per Game Box Plot
When it’s displayed as a box plot, the runs-per-game data seems even less variable over this time period, as a few outlier games in 1947 seem to be partly responsible for the apparent slight increase in scoring that season.
Findings, slide 3: Average Attendance Per Game
While runs per game didn’t change much over these seasons, attendance really did! Jackie Robinson debuted for the Dodgers halfway through the 1947 season; this seemed to correlate almost immediately with a marked decline in Negro League attendance.
Findings, slide 4: Attendance Per Game Box Plot
Finally, a box plot of the game-by-game attendance data shows a fuller picture of the attendance decline. In addition to the average decreasing, the highly attended outlier games at the upper range of the 1946 data rapidly became a thing of the past.
Limitations
There are several limitations here, especially on the attendance data, which is incomplete. After I filtered out the zeroes and “NaN” rows in the attendance column, my datasets shrank from about 400 rows to about 200 rows. Also, some of the 1948 attendance data was estimated, as the original entries included characters such as “<” and “?”, which I stripped from the data before performing my analysis. I suspect that the overall trend I showed here would remain true with even more complete data, but I can’t know that for sure.
The runs-per-game data seems to be more complete and more reliable, but it’s ultimately a relatively shallow way to evaluate a baseball season. A more in-depth analysis could have looked at strikeout rate, home run rate, and walk rate to draw a more complete picture of the offensive environment of the Negro League over these seasons.
Conclusions
In the seasons immediately before and immediately after Jackie Robinson’s MLB debut, I found that:
Acknowledgements
This analysis was made possible by the hard work of the volunteer researchers at Retrosheet, who comb through old newspapers in order to create the most complete and most reliable statistical record of historical baseball.
I would also like to thank my wife, Amy Bogie, for her perspective and feedback as I worked through the project.
References
The Negro Leagues dataset I used can be downloaded from Retrosheet at the following address: https://www.retrosheet.org/NegroLeagues/NegroLeagues.html