# The Backwards K — Baseball Strikeout Looking

The backwards K is normally used to denote a called third strike in a strikeout. It’s typically written on a scorecard. I’ve been looking for the backwards K so I can denote the strikeout looking on Twitter, and I finally found it:

(for unsupported browsers — Chrome)

The easiest way to use this character is to copy and paste the backwards K from above and save it in a note or something you can copy and paste from routinely. This character is actually from Apple’s implementation of the Unicode from the artificial, Latinized version of the Lisu alphabet. This alphabet contains an upside-down, turned K which looks similar enough to a backwards K I think this pass on Twitter.

If you don’t see the backwards K in the block above, you computer or mobile device probably isn’t using a font that supports that specific character. It’s supported on Macs and iPhones (as well as the Edge browser in Windows 10).

References:
http://unicode.org/charts/PDF/UA4D0.pdf
https://en.wikipedia.org/wiki/Fraser_alphabet

# MLB — Pace of Play [Working Post]

This post is a work in progress. The data concerning the pace of play is rather messy and this project is rather large compare to what I normally tackle. For that reason I’m going start this post and update it as a ‘working post’. Please feel free to contact me if anyone has any input: @seandolinar on Twitter or
[email protected]

Having collected the time between pitches from PITCH/fx, I was able to look at the different factors that affect how long pitchers took between plays. [I’m defining this as the pitch pace.] PITCH/fx has a time stamp associated with each pitch. Using that time stamp, I was able to calculate the time between each pitch. I used the resulting calculation combined with other information available about each at-bat to draw some conclusions about what affects pace of play.

The most obvious influence on the time between pitches is whether or not there was a baserunner. This was rather simple to explore since PITCH/fx provides information on whether or not there is a runner on 1B, 2B, or 3B. Using this I was able to create the following table of median pitch pace. [I’ll explain why I decided to use the median and not the mean/average later.]

The data matches what your experience with baseball suggests. Pitchers will slow down the game when there is a runner on base. This will happen for several reasons: run-game tactics, conferences on the mound, and even time for the ball to get back to the pitcher after the play. Given the fact there is a slight drop off for when there isn’t an open base or there are two outs, I would conclude that the run-game prevention tactics play a rather significant role in the pitch pace.

The distribution of pitch pace data shows how often pitchers take 5-10 seconds, 10-15 seconds, 15-20 seconds, etc. between pitches. Both distributions are highly skewed right, so the average pitch pace isn’t representative of the central tendency of the data set; the median works a lot better in this situation to describe the most likely outcome.

The pitch pace with the highest frequency with the bases empty is the 15-20 second range, while the most frequent pitch pace bumps up to 20-25 seconds when runners are on base. MLB is kicking around the idea of having a 20 second pitch clock. From the distribution, it becomes apparent that keeping the pace to under 20 seconds would have an impact on the pitch pace of play.

I created a box plot to show another perspective of the distributions. The mean of the runners on base pitch pace is significantly higher than the mean of the pitch pace with bases empty.

Data Background

PITCH/fx data isn’t designed to accurately measure the time between pitches; it has some problems. A human operator is needed to enter data on each pitch such as ball/strike, information about the hit or if runs scored. For this reason, the data is very messy. It has problems where subtracting the time of each subsequent pitch from the pitch prior yields negative numbers because of the operator entered the previous pitch after the pitcher threw the next pitch. For these reasons I have to re-examine cleaning and processing the PITCH/fx data.

Further Work

I need to clean the data further. This will include identifying and excluding first pitches from at-bats and aggregating each at-bat. This should alleviate some of the delay problems associated with the human entry component of PITCH/fx.

I want to look at leverage’s impact on the pitch pace. My initial analysis is that leverage doesn’t matter all too much when you consider if there’s a player on base or not since leverage and having a player on base are collinear. With cleaner data the effect of leverage or post season play might be more apparent.

I’m going look at the time between innings. This should change depending on the broadcast; national broadcasts have longer commercial breaks. There also should be artifacts for weather delays.

Pitching changes should also be included. Inning breaks with new pitchers tend to be longer, it would be nice to see how much longer they are on the aggregate.

All of these need to be programmed into a parser that looks at the data sequentially. My plan is to update this page once I have more research available.

# MLB — Run Distribution Per Game & Per Inning — Negative Binomial

This is an extension of an earlier post I wrote about the runs per inning distribution. In this post I use the negative binomial distribution to better model the how MLB teams score runs in an inning or in a game. I wrote a primer on the math of the different distributions mentioned in the post for reference.

The Baseball Side

A team in the American League will average .4830 runs per inning, but does this mean they will score a run every two innings? This seems intuitive if you apply math from Algebra I [1 run / 2 innings ~ .4830 runs/inning]. However, if you attend a baseball game, the vast majority of innings you’ll watch will be scoreless. This large number of scoreless innings can be described by discrete probability distributions that account for teams scoring none, one, or multiple runs in one inning.

Runs in baseball are considered rare events and count data, so they will follow a discrete probability distribution if they are random. The overall goal of this post is to describe the random process that arises with scoring runs in baseball. Previously, I’ve used the Poisson distribution (PD) to describe the probability of getting a certain number of runs within an inning. The Poisson distribution describes count data like car crashes or earthquakes over a given period of time and defined space. This worked reasonably well to get the general shape of the distribution, but it didn’t capture all the variance that the real data set contained. It predicted fewer scoreless innings and many more 1-run innings than what really occurred. The PD makes an assumption that the mean and variance are equal. In both runs per inning and runs per game, the variance is about twice as much as the mean, so the real data will ‘spread out’ more than a PD predicts.

The graph above shows an example of the application of count data distributions. The actual data is in gray and the Poisson distribution in yellow. It’s not a terrible way to approximate the data or to conceptually understand the randomness behind baseball scoring, but the negative binomial distribution (NBD) works much better. The NBD is also a discrete probability distribution, but it finds the probability of a certain number of failures occurring before a certain number of successes. It would answer the question, what’s the probability that I get 3 TAILS before I get 5 HEADS when I continue to flip a coin. This doesn’t at first intuitively seem like it relates to a baseball game or an inning, but that will be explained later.

From a conceptual stand point, the two distributions are closely related. So if you are trying to describe why 73% of all MLB innings are scoreless to a friend over a beer, either will work. I’ve ploted both distributions for comparison through out the post. The second section of the post will discuss the specific equations and their application to baseball.

Runs per Inning

Because of the difference in rules regarding the designated hitter between the two different leagues there will be a different expected value [average] and variance of runs/inning for each league. I separated the two leagues to get a better fit for the data. Using data from 2011-2013, the American League had an expected value of 0.4830 runs/inning with a 1.0136 variance, while the National League had 0.4468 runs/innings as the expected value with a .9037 variance. [So NL games are shorter and more boring to watch.] Using only the expected value and the variance, the negative binomial distribution [the red line in the graph] approximates the distribution of runs per inning more accurately than the Poisson distribution.

It’s clear that there are a lot of scoreless innings, and very few innings having multiple runs scored. This distribution allows someone to calculate the probability of the likelihood of an MLB team scoring more than 7 runs in an inning or the probability that the home team forces extra innings down by a run in the bottom of the 9th. Using a pitcher’s expected runs/inning, the NBD could be used to approximate the pitcher’s chances of throwing a no-hitter assuming he will pitch for all 9 innings.

Runs Per Game

The NBD and PD can be used to describe the runs scored in a game by a team as well. Once again, I separated the AL and NL, because the AL had an expected run value of 4.4995 runs/game and a 9.9989 variance, and the NL had 4.2577 runs/game expected value and 9.1394 variance. This data is taken from 2008-2013. I used a larger span of years to increase the total number of games.

Even though MLB teams average more than 4 runs in a game, the single most likely run total for one team in a game is actually 3 runs. The negative binomial distribution once again modeled the distribution well, but the Poisson distribution had a terrible fit when compared to the previous graph. Both models, however, underestimate the shut-out rate. A remedy for this is to adjust for zero-inflation. This would increase the likelihood of getting a shut out in the model and adjust the rest of the probabilities accordingly. An inference of needing zero-inflation is that baseball scoring isn’t completely random. A manager is more likely to use his best pitchers to continue a shut out rather than randomly assign pitchers from the bullpen.

Hits Per Inning

It turns out the NBD/PD are useful in many other baseball statistics like hits per inning.

The distribution for hits per inning are slightly similar to runs per inning, except the expected value is higher and the variance is lower. [AL: .9769 hits/inning, 1.2847 variance | NL: .9677 hits/inning, 1.2579 variance (2011-2013)] Since the variance is much closer to the expected value, the hits per inning has more values in the middle and fewer at the extremes than the runs per inning distribution.

I could spend all day finding more applications of the NBD and PD, because there are really a lot of examples within baseball. Understanding how these discrete distributions will help you understand how the game works, and they could be used to model outcomes within baseball.

The Math Side

Hopefully, you skipped down to this section right away if you are curious about the math behind this. I’ve compiled the numbers used in the graphs for the American League above for those curious enough to look at examples of the actual values.

The Poisson distribution is given by the equation:

$P(X = x) = \frac{e^{-\lambda}\lambda^x}{x!}$

There are two parameters for this equation: expected value [$\lambda$] and the number of runs you are looking to calculate the probability for [$x$]. To determine the probability of a team scoring exactly three runs in a game, you would set $x = 3$ and using the AL expected runs per game you’d calculate:

$P(X = x) = \frac{e^{-4.4995}4.4995^3}{3!} = 16.87\%$

This is repeated for the entire set of $x$ = {0, 1, 2, 3, 4, 5, 6, … } to get the Poisson distribution used through out the post.

One of the assumption the PD makes is that mean and the variance are equal. For these examples, this assumption doesn’t hold true, so the empirical data from actual baseball results doesn’t quite fit the PD and is overdispersed. The NBD accounts for the variance by including it in the parameters.

The negative binomial distribution is usually symbolized by the following equation:

$P(X=k) = {{r+k-1}\choose{k}} p^{r} (1-p)^{k}$

where $r$ is the number of successes, $k$ is the number of failures, and $p$ is the probability of success. A key restriction is that a success has to be the last event in the series of successes and failures.

Unfortunately, we don’t have a clear value for $p$ or a clear concept on what will be measured, because the NBD measures the probability of binary, Bernoulli trials. It’s help to view this problem from the vantage point of the fielding team or pitcher, because a SUCCESS will be defined as getting out of the inning or game, and a FAILURE will be allowing 1 run to score. This will conform to the restriction by having a success [getting out of the inning/game] being the ultimate event of the series.

In order to make this work the NBD needs to be parameterized differently, for mean, variance, and number of runs allowed [failures]. The following equations are derived from the mean and variance equations of a negative binomial. $\alpha$ represents the ‘odds in favor‘ of getting out of the inning. And $r$ is the expected value multiplied by the ‘odds in favor’ which will yield a real, non-integer for the number of successes. The NBD can then be written as

$P(X=k) = \frac{\Gamma(k+r)}{\Gamma(k+1)\Gamma(r)} (\frac{\alpha}{1+\alpha})^{r} (\frac{1}{1+\alpha})^{k}$

where

$r = Expected Value * \alpha; \alpha = \frac{Expected Value}{Variance -Expected Value}$

So using the same example as the PD distribution, this would yield:

$r = 4.4995 * 0.8182 = 3.6815 ; \alpha = \frac{4.4995}{9.9989 - 4.4995} = 0.8182$

$P(X=3) = \frac{\Gamma(3+3.6815)}{\Gamma(3+1)\Gamma(3.6815)} (\frac{0.8182}{1+.0.8182})^{3.6815} (\frac{1}{1+0.8182})^{3}$

$= 14.18\%$

The above equations are adapted from this blog about negative binomials and this one about applying the distribution to baseball. The $\Gamma$ function is used in the equation instead of a combination operator because the combination operator, specifically the factorial, can’t handle the non-whole numbers we are using to describe the number of successes, and the gamma function is a continuous function from 0 to infinity.

Conclusion

The negative binomial distribution is really useful in modeling the distribution of discrete count data from baseball for a given inning or game. The most interesting aspect of the NBD is that a success is considered getting out of the inning/game, while a failure would be letting a run score. This is a little counterintuitive if you approach modeling the distribution from the perspective of the batting team. While the NBD has a better fit, the PD has a simpler concept to explain: the count of discrete event over a given period of time, which might make it better to discuss over beers with your friends.

The fit of the NBD suggests that run scoring is a negative binomial process, but inconsistencies especially with shut outs indicate elements of the game aren’t completely random. I’m explaining the underestimation number of shut outs as the increase use of the best relievers in shut out games over other games increasing the total number of shut outs and subsequently decreasing the frequency of other run-total games.

All MLB data is from retrosheet.org. It’s available free of charge from there. So please check it out, because it’s a great data set. If there are any errors or if you have questions, comments, or want to grab a beer to talk about the Poisson distribution please feel free to tweet me @seandolinar.

# Pirates 2014 — Bullpen

All the graphs are pulled from this Fangraphs leaderboard.

The Pirates bullpen has a been a source of problems and criticisms for the Pirates this year. At the beginning of 2014, the bullpen had almost the same personnel as the 2013 season. Bullpens can vary wildly from year to year, and the Pirates relievers pitched out of their minds for most of 2013, so you’d expect there to be some fall off. Currently [August 26, 2014], the Pirates lead MLB with 22 blown saves. Personally, I abhor saves and blown saves, but I needed to get this out of the way, since it’s the stat that will get thrown around the most. And for reference Tony Watson [the All-Star] leads the team with 6 blown saves. So there’s that.

I wanted to look at some of the peripheral stats of the Pirates bullpen to understand the entire story. First, the Pirates starters have been terrible this year. They rank last in starter WAR, middle of the pack in FIP, and near the bottom in WPA. Analyzing that situation is for another day, but suffice it to say they give up a lot of runs before the bullpen gets into the game. The smaller the average lead the bullpen has to hold on to, the more often they will give up the lead [accrue a blown save]. Shutdowns and meltdowns are Fangraphs stats which are better for evaluating individual relievers than saves. They provide a broader evaluation of how a pitcher or bullpen has performed rather than just looking at save situations. For a shutdown a pitcher basically adds to the win probability while for a meltdown a pitcher subtracts from the win probability. For instance last night Jared Hughes had a meltdown allowing three runs and inverting the win probability.

The Pirates are in the middle of the pack for both of those stats. There really isn’t anything interesting here.

Finally, the Pirates’ reliever xFIP is not very good. It’s towards the lower end of MLB. xFIP is one of the better park-independent, context-independent predictors of pitching skill. It just uses BB, K, and flyballs [for HR/FB]. This will also ‘adjust’ for some of Grilli and Frieri’s HRs that they gave up when they were struggling earlier this year. Those struggles won’t affect the bullpen moving forward since they are no longer on the team.

After this quick analysis to answer my initial question about the Pirates bullpen, they aren’t good. They aren’t terrible, but they aren’t good. They do have two really good pitchers with Melancon and Watson. Two decent pitchers in Wilson and Hughes. Then the rest aren’t great. Taking this analysis, what could the Pirates do to improve? Frieri was a gamble that didn’t pay off. But honestly, I think from a management stand point, you had to get rid of Grilli to get him out of the closer role. John Axford might help. He’s been good in the 5 appearances for the Pirates so far, and his career xFIP is 3.26 which is pretty good. As far as a trade, ‘proven’ relievers are overvalued in the free agent market, and the trade market was really expensive this year. Overall, one reliever isn’t going to affect your win total dramatically.

# MLB — Bases Loaded. No Outs. No Runs.

Bases loaded, no outs is one of the most tenuous points of a close baseball game. If you are rooting for the team at the plate, you feel confident your team will score here. Anything else, would be a huge disappointment. If you are rooting for the fielding team and your pitcher gets out of the jam, you are elated and praising the pitching staff for being able to handle pressure. Even though bases loaded, no outs (BLNO) seems like a sure thing, there is about a 15% chance the team DOESN’T score at all.

I’ve created this table of probability of scoring AT LEASE ONE RUN in the various base-out state situations using data from 2011-2013. The base-out states represent the 8 possible combinations of runners on base with the 3 out states that can exist [24 total]. 1- – means there’s only a runner on first, 1-3 means first and third, and 123 is bases loaded. Looking at the chart there is only an 85.18% chance that the team with BLNO scores a run. It’s one of the highest run probability situations, but there’s still a significance chance they won’t score a run.

This table considers every play that started with this base-out configuration and looks at the remainder of the inning to see if the team scored. [It uses every play in baseball from 2011-2013 including playoff games.] In general these numbers fluctuate slightly over time and between teams. This table is also context neutral, specifically batter neutral, so having Mike Trout at bat would significantly change the probability versus a player like Clint Barmes.

Looking at the table, it’s apparent to score AT LEAST one run the lead runner is the most important factor, since all the base-out states have similar probabilities between the states when the lead runner is at third or second. So having a lead-off triple is about as valuable [in the context of scoring ONLY one run] as having the bases loaded, no out.

There are different run and out possibilities that exist with each base-out state. For the lead-off triple, there is no force play on the bases, while a bases-loaded situation has a force play at every bag including home. Having bases loaded would turn a ground ball into a potential run robbing force play, while a single runner on third would require a tag. Conversely, BLNO allows for walks and hit by pitches to drive in a run. This table also looks uses the entire rest of the inning, not just the play that occurs with BLNO. So if the team got the bases loaded with no out, gets two outs, then scores a run, it still counts as a success. A double play, which is easier to get with bases loaded than just a runner on third, will dramatically reduce the run probability of the next play affecting the previous base-out state. In summary, there are trade offs that can occur effecting the overall, context-neutral probability of the base-out state.

Example — Pirates Game

Failing to score a run in the context of this post means after loading the bases, the team does not score any runs before the end of the inning. All the probabilities are determined empirically.

Something kind of cool happened during the Pirates game last night (8/8/2014). There were two instances that bases were loaded with no outs, and the teams weren’t able to score any runs. The not being able to score any runs with the bases loaded/no outs isn’t that uncommon. A run-probability table can tell you that ~14% of the time a team will fail to score any runs for the rest of the inning after achieving that base-out state.

A base-out state is one of the 24 possible combinations of baserunners and number of outs. So there are 8 base states, bases empty, runner on first, etc. to bases loaded, and three different out states, 0, 1, or 2 outs. 8 x 3 = 24.

In the control room at the Pirates game last night, we were debating how often you see two occasions in the same game where no runs are scored after the bases are loaded with no outs. It turns out it relatively rare, but it happened twice at PNC Park before 2014: May 12, 2002 and August 28, 2003.

Between 2003 and 2013, bases were loaded with no out and no runs scored 1,092 times. There were 25 games that this happened multiple times, which is 0.0923% of all games played during that time [27,094 games]. This is on par with the probability of seeing a no-hitter (0.111%) and less probable than seeing a walk-off walk to end the game (0.266%).

The probability of seeing a game with two or more non-scoring bases loaded/no outs situations is 0.0923%

Using the table below bases empty/no outs will occur in every game (this happens at the start of every inning), and all the other base-out states have varying frequencies with runners on third with low out-states being the rarest. Bases loaded/no outs is the rarest base-out state occurring in only 21.92% of all games and occurring twice in the same game only in 6.05% of all games.

Just for reference here is a chart of how often the base-out state events occur relative all events. This would represent the probability that any random event (plate appearance, at-bat, stolen base, etc.) would have that base-out state.

All data is from retrosheet.org

# Moving Average Time Series — Baseball

Usually I use stats to describe baseball, but this post is going to use baseball to illustrate stats. There’ll be some math. If that scares you, you’ve been duly warned. Also I have collected the SAS output for each model for technical reference.

A time series is data that has been collected at a regular interval over time. This is rather intuitive when given the definition, but they are different from cross-sectional data, which is the type of data set most people are familiar with. The closing price of a stock is a time series, because it’s a measurement at 4PM every M-F. Cross-sectional data would looking at which type of stocks gained the most over a quarter in your portfolio. This is one measurement (quarterly change) made for a many different stocks. Not every data set fits neatly into a category and the analysis goal is different for each instrument.

The goal of univariate time series analysis (TSA) is to forecast a variable only using past observations of that variable. In the case of the stock market example, TSA seeks to project what the closing price for the next day will be using data from the specified time frame. However, finance is boring and I wanted a data set that I can extract some insight from, so we’ll be looking at MLB strikeouts (K) per year and home runs (HR) per year as the data sets.

What does a time series look like. If you scroll down or look up a stock market graph, you’ll see what a time series looks like. It’s messy. I created this data set, so I can describe this process accurately. It’s a first-order moving average process with a lag_1 coefficient of 0.9 and a series mean of 0. I’ve also included the normal linear regression (OLS) trend for the time series that shows it to have a slightly positive trend. This is a typical analytical technique to show that a time series is moving. In this case the trend is non-significant over these 50 data points. There is no trend, and the mean is zero.

The model that corresponds to the graph above has the general form as follows:

$y_t = \mu + a_t + \theta_1 a_{t-1}$

where $y$ is the time-dependent target variable, $\mu$ is the average of the entire series of data, $\theta$ is the regression coefficient, and $a$ is a time dependent shock to the system. The $t$ terms describe which time period the variable is from starting with the most current one, $t=50$.

Before describing the model above, it is important to fully understand what the $a_t$ represents. This is a shock term that can encompass a lot of different things. If you are consider something like quarterly earnings, factors influencing the shock term are unemployment, economic growth, marketing campaigns, etc. We are looking at the data in absence of this knowledge, and since we are in the dark, the causes of the shocks appear random. The $a_t$ terms should be a normally distributed and not autocorrelated. The expected value should be zero, $E[a_t] = 0$. The expected value is another way to describe the average of all the $a_t$ terms.

Here’s a great way to think about the MA process. Think about a simplified personal monthly expenditures where you had a constant salary and a modest saving account. Shocks that would be included in the $a_t$ term would be unexpected expenses. The unexpected expense could influence the next time period if you had to dip into savings. So a high unexpected expense in January would impact the spending in February, because you’d have payoff your credit card or put money back into savings.

There are many more details to understanding time series such as autocorrelation. Hopefully I’ll write a separate post on that in the future.

Let’s look at some real data. Luckily, I have every play from MLB in a database thanks to retrosheet.org, so we’ll look at some time series from there specifically, HR and Ks per year. Conceptually for this rudimentary modeling, a MA process makes sense. A shock from the previous year like expansion, steroids, or selection bias would carry over year to year. Looking at the time series graph below, it doesn’t behave like the previous time series that was centered around zero. This time series is considered non-stationary, which means there’s a trend and that trend changes over time. The number of HR per season increased over time up until around 2001 when it leveled off and started to decline. There’s a trend up until 2001 a trend after it, and they aren’t the same. To get around this instead of modeling the actual values, the differences between two years of HRs will be model. A difference ($\nabla$) is simply $y_t - y_{t-1}$. Or the difference in HRs in 2013 and 2012, which would be -279 HRs.

The green line are the actual HRs each year. The ‘cantaloupe’ colored lines are the 50% confidence interval (CI) of the forecast. The red line are the forecasted values. I used 50% CIs to show likely deviations, not statistically significant deviations.

The differenced moving average model [ARIMA(0,1,1)] takes the form:

$\nabla y_t = \mu + a_t - \theta a_{t-1}$

Substituting the estimated coefficient for $\theta$ and $\mu$ a forecast can be made with the following equation:

$y_{t+1} = \mu + y_t + a_{t+1} - \theta * a_{t}$

$y_{t+1} = 50.11163 + y_t + a_{t+1} - .45073 * a_{t}$

The last equation is used to generate the forecast line and the ultimately the 50% CI lines. The interpretation of this equation is that half of the shock from the previous time period still has an effect on the change to the current period. The forecast predicts that the home runs will actually increase over the past few years and not continue the decline. Looking backwards the model can be used to identify some years of interest, and I’ve marked those on the graph. Expansion probably has the greatest impact on the number of HRs, because it dilutes the talent pool and increases the total number of games per season. If you wanted to measure the impact training or steroids had on HRs, you’d wanted to use a HR/game time series [see below] instead of total HRs. [This is total HRs between both teams.]

The HR/Gm is the time series that a baseball analyst would want to use, because it controls for extra games from expansion, so the trends are also less pronounced. This is still a non-stationary time series, so it needs to be difference like the previous model and can be described by the following equation:

$y_{t+1} = 0.0045989 + y_t + a_t - .49927 * a_{t}$

Still the greatest shocks are the expansion years, which tend to have a bit of a lingering effect before regressing. 1987 now stands as a really enigmatic outlier. There was no expansion that year. The best explanation is there was a strike zone change, but I can only find that in one article. The home run outburst of the late 90s and early 2000s happens with the ‘steroid era’ and two close periods of expansion. This post isn’t interested in analyzing steroids effect on MLB, only that it’s ‘shock’ is mixed in with expansion team ‘shock’. Also it should be noted HRs/Gm haven’t returned to pre-1993 expansion levels.

Looking at the opposite of a home run, the strike outs per year has a trend that is much more steady, and it’s increasing.

The graph displayed above is also differenced first order moving average process, ARIMA(0,1,1). Its equation looks very similar to the last two so I won’t write it out. The parameters can be found in the SAS output appendix, I have for this page. The forecast has a definite increase in total strike outs over the next few years. Just like the HR per year time series, the time series of Ks are best analyzed by looking at the K/Gm. The K/Gm time series turns out to be a different model than the first three models, because it is a just a random walk around a linear trend.

This process has random shocks around a positive trend with no ‘memory’ of the past shocks like the other three models had. This model for K/Gm, ARIMA(0,1,0), looks a little different than the ARIMA(0,1,1) models seen earlier since there is no lagged $a_{t-1}$ term. The ARIMA(0,1,0) model is given by the following equation:

$\nabla y_t = \mu + a_t$

and the forecast equation with parameters in it would be:

$y_{t+1} = 0.11637 + y_t + a_t$

This indicates that the K/Gm will increase by 0.11637 every year on average. Obviously since there are only 54 outs in a baseball game this trend can’t go on forever. As of the beginning August 2014, the current K/Gm is 15.4 and it is forecasted to be 15.2497, which is within the 50% CI of the forecast.

While these models can make predictions about baseball, I wouldn’t considering this the best [or even good] models for forecasting since we could incorporate other variables or improve the granularity of the forecast to individual players. There also isn’t much value in saying there’ll be more strike outs in 2014 than 2013. However, this example is a good academic exercise in understanding how univariate time series work. And hopefully it provides some insight into both time series and a little bit about trends in baseball.

# Pirates Do Not Need Help Against LHP

Stats in this post are current up to right before the July 31, 2014 PIT-ARZ game.

The MLB non-waiver trade deadline just passed. I’m not interesting in debating what teams should or should not have done except to say the price for quality players was very high this year. The whole supply & demand, free market thing really worked in the favor of teams that were already out of the post season race. It was suggested that the Pirates needed a right-handed batter (RHB), since they don’t do well against left-handed pitching (LHP). I had my doubts this was really true, and adding a good RHB won’t improve the team beyond what general improvements you could expect from that batter. MLB teams generally do better against LHP, since most batters are RHB and the RHB/LHP split favors the batter.

Before getting into this, LHP make up only 21% of the Pirates’ season-to-date plate appearances, out of all the problems the Pirates could have making a roster move to address this isn’t necessary unless you are looking to platoon. More on that later.

Looking at the team batting splits, the Pirates have an overall .722 OPS and a LHP .670 OPS. On the surface, it appears they are performing worse against LHP, and I will concede the argument the Pirates HAVE performed worse against LHP so far in 2014, but this shouldn’t continue going forward.

The Pirates have 4,152 plate appearances racked up thru July 30th, but only 867 of them have occurred against LHP (~21%). To put this in perspective, that is equivalent to less than one month of games. How accurate are batting statistics at the end of April? They aren’t. Put simply the Pirates ‘struggles’ against LHP can mostly be attributed to a small sample size.

I went and laid out all the outcomes (1B, BB, 2B, etc.) in a vector of plate appearances and had the computer randomly draw 900 samples from the entire Pirates season and computed the OPS 1000 different times. Then I plotted them below.

Due to the central limit theorem the mean should hover around .720 (the overall OPS) and the data should be normally distributed. Because of this I constructed the normal distribution curve and then used that to calculate the probability that a 900 plate-appearance sample can be drawn from the Pirates’ total plate appearances. It turns out 9% of the time the program will select plate appearances that total a < .670 OPS. 9% isn't that likely, but it is not outrageous to conclude the Pirates low vsLHP OPS is due to small sample size. This is not just applicable to LHP vs overall splits, but any low-count split including RISP. I wrote about this previously and came to a similar conclusion.

The composite distribution curves below illustrate what happens with sample size increases and why small small sizes are problematic. The vertical line is the .670 OPS mark. On the 900-sample distribution (vs LHP) there is a 9% probability of drawing a .670 OPS from the Pirates’ total plate appearances. This is the area underneath the curve to the left of the red line. Using the 3000-sample distribution curve, it’s 0.0016%. There is barely any area under the 3000-PA curve at that point, and this is a huge difference. (3000 samples are approximately how many the team has against RHP.)

One more graph! This is a histogram of the differences between the LHP OPS and the overall OPS. The Pirates are on the low end of it. Not great, but there’s a lot of variation there.

Switching from statistics to baseball, the Pirates have the second fewest plate appearances against LHP in MLB. They are 11-9 in games started by a LHP. That alone should discount the poor-performance-against-LHP argument, but obviously the team batting stats suggests that they are and it has been woven into a narrative.

Looking closely at the Pirates’ roster there are many solid RHBs, McCutchen [their best hitter], Martin, Marte, Sanchez, and Mercer/Harrison are pretty good against lefties. Now, some of these player are underperforming against LHP this year, but this is where the small sample size comes in again. You wouldn’t determine any of these batters lost their platoon advantage after only 80 plate appearances. Going forward almost all of these bats should regress to their normal platoon splits.

Pedro Alvarez, Gregory Polanco, Ike Davis. Their platoon splits are pretty atrocious both for 2014 and career-wise. For example, Alvarez has a .787 OPS vs RHP and a .517 OPS against LHP this year. I don’t want to get into analyzing what’s wrong with the Pirates’ left-handed bats, except to say they are terrible against LHP. The argument should change from the Pirates don’t do well against LHP to the Pirates’ left-handed batters are terrible against LHP.

What can be done about this? The simple answer is to get better left-handed batters. Since that’s not really possible, the next best option would be platooning the left-handed batters. Ike Davis is already platooned with Gaby Sanchez, and Pedro Alvarez is barely starting any games. Polanco has regressed from his debut, but I think the best idea is for him to play everyday and deal with LOOGY relievers. I also don’t know how many fans actually want to see or are suggesting that he’s should be platooned. With all this in mind I’m not quite sure what acquiring a right-handed bat would accomplish. The Pirates are already trying to find a place for RHB Josh Harrison to play. He’s been having a good season, no matter what you think about Harrison. Furthermore, the Pirates have a guy who’s been killing LHP this year and has decent splits against them for his career. And that’s Jose Tabata.

Bottom line, adding a RHB wouldn’t help much because the team splits are still a small sample size against LHP. Beyond the statistics, the two big left-handed bats have terrible splits against LHP, and these problems have been already addressed by platooning and benching.

# Predicting Baseball Wins with WAR

This is a lot of debate about the usefulness of the comprehensive baseball statistic, WAR — Wins Above Replacement. I don’t think that WAR is the end all statistic, but it is a useful tool. Why? Because it can describe relatively accurately how a player contributes to a team. It also can help fans understand the real impact of one player. I might have to refer people here once people start clamoring that a single player will change the direction of a team at the trade deadline.

If anyone wants a primer on the details of what goes into the WAR stat, check out baseball-reference.com’s comparison between systems. Basically, WAR is the number of statistical wins the player is responsible for above a replacement player. In theory the replacement is the mediocre AAA player that is not a prospect. That statistic is the middle estimate of the impact the player will have, a player can be ‘responsible’ for more wins than their WAR number, but also drastically less. Think of WAR as the average wins he’s responsible for.

For probably over a year, I’ve wanted to see if WAR actually can predict the number of wins a team will have. I forget my original methods of trying to determine this, but this time round, I used FanGraphs’ WAR numbers for both pitching and batting from the last decade of season for all 30 teams. That’s 300 data points. After assembling the data and then running it through a basic linear regression, I was quite happy with what I saw. I’ve heard that if you add 48 to the team’s WAR number that you will get their total wins, and this can be seen mathematically by looking at real data.

I’ve graphed the actual wins to WAR and actual wins to the Pythagorean predicted wins for comparison. [Pythagorean wins performed better.] The linear regression for the WAR comparison actually turns out to be incredibly powerful. The regression coefficient is almost exactly equal to one meaning that each unit increase in WAR means an equal increase in wins. The y-intercept is +48.5, which means for the last decade the number of theoretical replacement wins has been just about 48. This should make sense, since the calculation of WAR is calibrated to a 48 win replacement level. The actual implementation of WAR works really well to predict teams wins. Unfortunately, this model will have a 95% prediction interval of 20 wins. That seems like a lot but, it shows how much luck has to do with a baseball season.

Pythagorean wins are typically used to show how lucky the team has been this year or not. This is actually a slightly better predictor of a teams’ success than WAR. There is less variance since run differential is just one step away from wins. You can see from the histograms that the spread on Pythagorean wins is less than with WAR. This can also be seen in the r-square for the linear regression. Pythagorean wins linear model has an r-square of .87 while the WAR model has an r-square of .77. This ultimately means that 87% and 77% of the variance is explained by the model indicating that the Pythagorean wins is slightly more accurate. The trade off is that WAR can give you player-level detail while run differential is only team-specific.

As always, let’s look at what the Pirates did.

A theme I always harp on was that the 2013 Pirates were good and really lucky. This can be seen by the data point for 2013 falling above the linear regression trend line. If you were wondering 2012 and 2011 (the two ‘collapse’ years) also fall above this line. I don’t know if this is the best way to measure a collapse, but the in-season stats did indicate regression during all three seasons 2011-2013.

# Probability and Sunday Night Baseball

There’s nothing I like more than a bases-loaded, no-outs situation in baseball. This might be my favorite situation/stat no one realizes. There’s around a 15% chance that the team who has the bases loaded will not score at all that inning! 15% might not seem like much, but over the course of the season it happens often.

Let’s set the scene: Bottom of the ninth, down by two, the Pirates knock in a run and get McCutchen on 1st with no outs to move within one run of the Cardinals.

This is a win probability graph FanGraphs has for every game. I’m not entirely sure what all they consider when calculating a win probability, but it mirrors the data I have, so there’s not much to discuss there. Clearly, the closest they came to winning the game was after Barmes walked putting Alvarez, the winning run on 2nd.

source: FanGraphs

According my run probability calculations for 2013, the probability to score at least one run with bases loaded and no outs was lower than the Pirates batting with a runner on second/third or first/third and no outs [Probabilities –123: 77.9%, 1_3: 82.4%, _23: 90.9%] The advantage of having the bases loaded is a walk or HBP brings a runner home, but the downside is there is an easy force at home. That would hurt the Pirates in this instance because Mercer didn’t hit the ball past the pitcher’s mound making for an easy 1-2-3 double play.

# Chicago Transit Authority — Ridership

Waiting for the break of day…oooOOOOO…25 or 6 to 4!
-Chicago (formerly The Chicago Transit Authority)

I was lucky to live in Chicago during the summer of 2012. The thing I most miss from Chicago is the transit system. Taking the ‘L’ to work everyday was much more relaxing and interesting than having to drive in. No parking hassles or gas. It was great. Transit data is critical to making those systems much more efficient. Fortunately, Rahm Emanuel is kind enough to release some of the transit data from the Chicago Transit Authority (CTA). The data only contains ridership per day information from each station, so I am limited in the insight this analysis can produce.

Before diving into the results of the descriptive analytics, let’s look at how the ‘L’ is designed. There are eight different lines, each designated by a color. All of the lines goes through the downtown area called ‘The Loop’, because the elevated track forms a huge loop around a huge block of the city. There are two main lines which also go underground: the Red and Blue Line. These two lines run all night and carry the most passengers. When a Chicagoan rides the ‘L’ they swipe their pass at the entrance of stations, then board their desired train in either direction. Unlike transit systems like the Metro in DC, there is no exit swipe. So every data point in this post is going to be a person swiping at a station to board a train, but we can’t determine which direction or destination.

There’s also another problem, at several stations that service different lines. Clark/Lake has practically every line go through it. Without more resolution in how the data are measure, the most I can infer from the data is what stations are the most popular on certain days. This comes from the assumption that if a person arrives at a station they will leave from the same station.

This visualization looks a lot like a CTA map. I don’t have a good way to automatically draw the lines between the stations, but I think the location data that’s attached to the station names does a good job of recreating the CTA map everyone is used to seeing. I’ve labeled any ‘L’ stations which service multiple lines in an order of importance. The priority is Red, Blue, Orange, Brown, Green, Purple, Pink, Yellow in that order. The reasoning behind this is that these are the largest or most popular lines, so the station will have the majority of patrons using these lines. From this map Clark/Lake, the station with the most train lines, is the most popular. Terminuses (termini?) of the the lines also have a lot of use. This can help visualize where the transfer points, the most popular entry points or destinations are. The Red Line and Blue Line have the most stops and the most ridership. Admittedly, this analysis has problems parsing Brown/Red Line customers, but there is higher ridership at the non-transfer stations of the Red Line; that confirms that more customers are using the Red Line in general. I have a separate post for the chart of the ridership of every station. The chart is way to big to put in this post, you’ll be scrolling for days. It’s worth checking out though to drill down into the details.

Ridership is rarely constant. In fact, the ridership of the ‘L’ varies into three predictable groups of days: weekdays, Saturday, and Sundays. The differences in the data based on the different day-group will effectively ruin any time-series analysis, because the average values between the three groups varies so much that any trends are going to be hard to spot. The graph would look very erratic. To account for this, any time-series graphs are split into those three groups.

I can’t write a post without tying baseball into it. This will be no exception, because in 2012 I spent a lot of time watching baseball games on the North Side and South Side of Chicago. Both teams are connected by the Red Line, and the stations are extremely close the parks. So what would we expect to see? Baseball games have attendance ranging from 10k to 30k, so this should present a spike in daily ridership. I graphed the daily ridership for the two stops adjacent to the ballparks and then label when there was a home or away game.

Any spike or dip not described by baseball is labeled. St. Patrick’s day has large spike all over the CTA system, but the spike at Addison is particularly high because of the all the bars in the neighborhood. The largest non-baseball spike in the Addison station’s ridership came during the gay pride parade, since this is an incredibly popular event in a neighborhood nearby the station.

There is one anomaly I forgot to point out on the graph, but there’s a spike when the Cubs are away for Sept 8th. At first I thought this might have been labor day, but it turns out that The Boss was playing at Wrigley Friday night, and I remember walking by it late at night. So there’s a spike for the 7th (the day the concert actually was) and then a larger spike from the 8th presumably the concert ended near or after midnight or people stayed after and drank in one of the fine establishments around the park. [I went to a Sox game early that day, so I accounted for a few data points that day.]

I arrived on June 12, 2012 right in the middle of a Cubs-Tigers game. Those were three really crowded games in Wrigleyville. Ridership at the two ballpark ‘L’ stations peaks during the summer, especially at Wrigley. You wouldn’t have guess that the Sox were in contention for a division title up until the last week of the season. The Cubs have a huge tourist draw including me because I went to at least a dozen games while I lived there, and you can see a surge during the summer (vacation) months. I would leave Chicago on October 20, 2012, and start out on #SeanTrek about 11 days later.