How Machine Learning is Transforming Major League Baseball
In the past few decades, no sport has been impacted by the rise of data analytics and machine learning quite like professional baseball. Since the "Moneyball" era of the early 2000s, when the low-budget Oakland Athletics used advanced statistics to gain a competitive advantage, teams across Major League Baseball have been ramping up their investments in data science and technology in search of an edge.
Today, MLB front offices are filled with analysts poring over complex datasets to help guide decision-making in all facets of the game. By leveraging massive amounts of data collected through technologies like Statcast, which tracks the physics of every pitch and batted ball, teams are attempting to optimize everything from lineup construction and defensive positioning to player development and injury prevention.
At the forefront of baseball‘s analytics revolution are teams like the Houston Astros, Los Angeles Dodgers, New York Yankees, and Tampa Bay Rays. Despite varying market sizes and payroll constraints, these clubs have found sustained success in recent years thanks in large part to their cutting-edge analytics departments.
The Data Behind the Game
So what kinds of data are these teams analyzing? It starts with traditional statistics like batting average, home runs, ERA, and the like. But thanks to technological advancements, the amount of available data has exploded and teams now have access to much more granular details about what‘s happening on the field.
With radar and camera systems installed in every MLB stadium, analysts can measure things like:
- Exit velocity: the speed of the baseball as it comes off the bat
- Launch angle: the vertical angle at which the ball leaves the bat
- Spin rate: the rate of spin on a pitched baseball, which affects its movement
- Catch probability: the likelihood that a batted ball will be caught given its trajectory and the positioning of the defense
These metrics, along with dozens more, provide a treasure trove of information that can be used to evaluate players, optimize game strategy, and identify undervalued talent. For example, Statcast data has shown that balls hit with an exit velocity of 95+ mph and a launch angle between 25-35 degrees are home runs over 80% of the time. Armed with this type of knowledge, hitters can make swing adjustments to achieve the optimal combination of exit velocity and launch angle, while pitchers and defenses look for ways to suppress hard contact.
Machine Learning Use Cases
So how are MLB teams putting all this data to use? Let‘s look at a few specific applications of machine learning in baseball:
Projecting Player Performance
One of the most important and challenging tasks facing MLB front offices is trying to predict how players will perform in the future. Teams use machine learning algorithms to analyze past data and identify trends and patterns that can help inform expectations for a player‘s production going forward. This is especially critical for things like free agent signings and contract extensions, where accurately projecting a player‘s value can mean the difference between a good investment and a bad one.
Optimizing Lineups and Matchups
Another key application of machine learning is in the realm of game strategy. By analyzing historical data on batter-pitcher matchups, platoon splits, and other factors, teams can gain an edge by optimizing their lineups and making data-driven decisions about when to make substitutions. Some teams have even started using machine learning models during games to help guide in-game tactical moves.
Injury Prevention
Injuries are a huge problem in baseball, costing teams millions of dollars in lost production every year. Machine learning can help by identifying risk factors and warning signs that a player may be prone to injury. By analyzing biomechanical data, workload metrics, and recovery times, teams can take proactive steps to keep their players healthy and on the field.
Analyzing Baseball Data with pybaseball
For data scientists and baseball enthusiasts looking to do their own analyses, the pybaseball library offers a simple way to get started. Pybaseball is a Python package that allows users to easily access and manipulate data from a variety of sources, including the MLB Stats API, Baseball Reference, and FanGraphs.
With just a few lines of code, pybaseball can be used to pull a wealth of data for any player, team, or season in MLB history. This data can then be cleaned, visualized, and analyzed using popular tools like pandas, matplotlib, and scikit-learn.
For example, here‘s how we could use pybaseball to compare a batter‘s performance against fastballs vs. breaking balls:
from pybaseball import statcast_batter
# Grab all available Statcast data for Mike Trout from the 2019 season
data = statcast_batter(‘2019-03-28‘, ‘2019-09-29‘, player_id = 545361)
# Filter the data to only include fastballs and breaking balls
fb_data = data.loc[(data.pitch_type == ‘FF‘) | (data.pitch_type == ‘FT‘)]
brk_data = data.loc[(data.pitch_type == ‘CU‘) | (data.pitch_type == ‘SL‘)]
# Calculate batting average against each pitch type
fb_ba = fb_data.events.value_counts()[‘single‘] / fb_data.shape[0]
brk_ba = brk_data.events.value_counts()[‘single‘] / brk_data.shape[0]
print(f"Mike Trout‘s 2019 batting average vs fastballs: {fb_ba:.3f}")
print(f"Mike Trout‘s 2019 batting average vs breaking balls: {brk_ba:.3f}")
This is just a small taste of the kinds of analyses that are possible with pybaseball and machine learning. By digging into the data, we can gain insights into player tendencies, team strengths and weaknesses, league-wide trends, and much more.
The Future of Machine Learning in Baseball
As analytics continue to reshape the game of baseball, it‘s natural to wonder about the implications for the future of the sport. There‘s no doubt that teams‘ increasing reliance on data and machine learning has led to changes in the way the game is played and managed.
In recent years, we‘ve seen a massive uptick in strikeouts, home runs, and defensive shifts as teams look to optimize their outcomes based on what the data tells them is most effective. Some traditionalists argue that this data-driven approach is sucking the life out of the game and making it less entertaining to watch. Others counter that analytics are simply revealing new insights about the sport and that MLB has always been a game of adjustments.
Regardless of where one falls on that debate, it‘s clear that machine learning is here to stay in baseball. As data capture technologies continue to advance and teams invest more resources into their analytics departments, we can expect to see even more innovation in the years to come. From AI-powered scouting tools and player development systems to real-time win probability models deployed in-game, the possibilities are endless.
One thing is for sure: as long as there is data to be analyzed and a competitive edge to be gained, machine learning will continue to play a major role in America‘s pastime. The teams that are able to most effectively harness the power of data and turn it into actionable insights will be the ones left standing at the end of October.