Getting Lucky in a Playoff Series

Sports have a constant uncertainty and randomness in every aspect of the game including determining champions. This is one area you wouldn’t expect to have a lot of variability, since you would want the team that has the best roster composition and played the hardest to win the championship. This concept is usually brought up in the arguments against the one-game Wild Card round that MLB introduced in 2012 saying there’s too much that can happen in one game to determine the fate of a season. [The counter-argument to this specifically is the division winners now have a reward for winning the division, besides having cool sweatshirts.]

The Sports Side

The basis for championship series in MLB, NBA, and NHL is the an odd-numbered of games series with the champion being the team that wins the majority of games in those series. Most sports use a 7-game series; so for example the Boston Red Sox had to win 4 games to win the World Series last year. Using an example of randomness I got from Leonard Mlodinow’s The Drunkard’s Walk: How Randomness Rules Our Lives, I can illustrate how a team that’s clearly an underdog can win a playoff series against a superior opponent. Mlodinow has a recorded lecture where he explains what he wrote in his book. [It’s a good book, you should read it.]

Let’s use two teams; one is the Favorite, and it is assumed they will beat the Underdog 55% of the time [given enough games]. This also means that the Underdog will win 45% of the time. This represents win probabilities more uneven that you are likely to find in a playoff game since teams are typically much more evenly matched [at least in baseball]. The last assumption of this example is that the teams win probabilities don’t change with a different starting pitcher or home field/court/ice advantage. These are terrible assumptions if you wanted to project real playoff series, but the underlying principle of random sequencing will still hold true.

In order to win the playoff series, a team has to win a certain number of games before the opponent wins that number. To model this distribution based on pure randomness, you can use the negative binomial distribution to determine the probability that the Favorite will win a 7-game playoff series in 4, 5, 6, or 7 games. If you wanted to design a playoff series to minimize the chances that the underdog will win, you’d want to choose a number of games which would have the smallest probability of the underdog winning the series.

This chart shows the probability for all 8 possible outcomes of a 7-game playoff series based of a 55/45 winning percentage split and pure randomness with no home-field advantage. As you can see there’s a substantial chance [39% probability] that the Underdog will win a 7-game series. 39% is rather large, and this is a 7-game series. Baseball also employs a 5-game series for their division series (LDS) and a one-game playoff for the the Wild Card round (WC). The chances of an upset becomes more likely as the number of games decreases. I’ve also added another set of teams (60/40 split — greater disparity) for comparison’s sake.

It should be obvious that the 1-game series has the greatest chance of an upset, hence the objections to its use in baseball. Though my contention would be that a 3-game series does not offer much more certainty that the best team will win.

The Math Side

I first calculated these probabilities by writing out all the possible combinations then adding up those probabilities. I have since realized there was a much easier way to determine these probabilities, and that is by using the negative binomial distribution (NBD). If you want to familiarize yourself with what the distribution represents please read the count data primer. In short, the NBD will determine the probability that a team will lose a certain number of games [0-3] before the other team wins 4 games. The NBD is defined by the following function:

$P(X=k) = {{r+k-1}\choose{k}} p^{r} (1-p)^{k}$

where X is the random variable whose probability we are calculating, k is number of Team A losses [this will vary], r is the number of Team A wins [for the 7-game series, it will be 4 games], and p is the probability of Team A winning. In the case of this example we will be determining the probability of Team A winning a 7-game series, when Team A has a 55%/45% advantage over Team B.

$P(X=2) = {{4+2-1}\choose{2}} 0.55^{4} (1-0.55)^{2}$

$P(X=2) = {{4+2-1}\choose{2}} 0.55^{4} (1-0.55)^{2}$

$P(X=2) = 10 * 0.0915 * 0.2025 = 18.53\%$

This is the probability for just one possible outcome, Team A wins the series in 6 games. To determine the probability that Team A wins the series, you add the probabilities for each outcome Team A wins in 4, 5, 6, or 7 games. So this calculation then repeated for every loss possibility:

$P(WinningSeries) = P(X=0) + P(X=1) + P(X=2) + P(X=3)$

$P(WinningSeries) = 9.15\% + 16.47\% + 18.53\% + 16.68\% = 60.83\%$

From these calculations, there is a 60.83% chance that the Team A wins just by randomness. Conversely, there is a 39.17% [100% – 60.83%] chance that Team B, the inferior team, wins because of random sequencing.

Conclusion

The MLB Wild Card game rightfully gets criticized for being too susceptible to having a bad day or getting a bad bounce. I wanted to illustrate that any playoff series has a lot of randomness in it. Beyond the numbers, people remember the bad bounces way more than they remember the positive or neutral events that occur [negativity bias]. A bad bounce or a pitcher having a bad day could easily benefit the team you are rooting for. The only real way to root out the randomness you would need to play hundreds of games, and somehow I don’t think that is feasible.

Do MLB Playoff Odds Work?

One of the more fan-accessible advanced stats are playoff odds [technically postseason probabilities]. Playoff odds range from 0% – 100% telling the fan the probability that a certain team will reach the MLB postseason. These are determined by creating a Monte Carlo simulation which runs the baseball season thousands of times [FanGraph runs theirs 10,000 times]. In those simulations, if a team reaches the postseason 5,000 times, then the team is predicted to have a 50% probability for making the postseason. FanGraphs and Baseball Prospectus run these every day, so playoff odds can be collected every day and show the story of a team’s season if they are graphed.

Above is a composite graph of the three different types of teams. The Dodgers were identified as a good team early in the season and their playoff odds stayed high because of consistently good play. The Brewers started their season off strong but had two steep drop offs in early July and early September. Even though the Brewers had more wins than the Dodgers, the FanGraphs playoff odds never valued the Brewers more than the Dodgers. The Royals started slow and had a strong finish to secure themselves their first postseason birth since 1985. All these seasons are different and their stories are captured by the graph. Generally, this is how fans will remember their team’s season — by the storyline.

Since the playoff odds change every day and become either 100% or 0% by the end of the season, the projections need to be compared to the actual results at the end of the season. The interpretation of having a playoff probability of 85% means that 85% of the time teams with the given parameters will make the postseason.

I gathered the entire 2014 season playoff odds from FanGraphs, put their predictions in buckets containing 10% increments of playoff probability. The bucket containing all the predictions for 20% bucket means that 20% of all the predictions in that bucket will go on to postseason. This can be applied to all the buckets 0%, 10%, 20%, etc.

Above is a chart comparing the buckets to the actual results. Since this is only using one year of data and only 10 teams made the playoffs, the results don’t quite match up to the buckets. The desired pattern is encouraging, but I would insist on looking at multiple years before making any real conclusions. The results for any given year is subject to the ‘stories’ of the 30 teams that play that season. For example, the 2014 season did have a team like the 2011 Red Sox, who failed to make the postseason after having a > 95% playoff probability. This is colloquially considered an epic ‘collapse’, but the 95% probability prediction not only implies there’s chance the team might fail, but it PREDICTS that 5% of the teams will fail. So there would be nothing wrong with the playoff odds model if ‘collapses’ like the Red Sox only happened once in a while.

The playoff probability model relies on an expected winning percentage. Unlike a binary variable like making the postseason, a winning percentage has a more continuous quality to the data, so this will make the evaluation of the model easier. For the most part most teams do a good job staying around the initial predicted winning percentage coming really close to the prediction by the end of the season. Not every prediction is correct, but if there are enough good predictions the predictive model is useful. Teams also aren’t static, so bad teams can become worse by trading away players at the trade deadline or improve by acquiring those good players who were traded. There are also factors like injuries or player improvement, that the prediction system can’t account for because they are unpredictable by definition. The following line graph allows you to pick a team and check to see how they did relative to the predicted winning percentage. Some teams are spot on, but there are a few like the Orioles or Red Sox which are really far off.

The residual distribution [the actual values – the predicted values] should be a normal distribution centered around 0 wins. The following graph shows the residual distribution in numbers of wins, the teams in the middle had their actual results close to the predicted values. The values on the edges of the distribution are more extreme deviations. You would expect that improved teams would balance out the teams that got worse. However, the graph is skewed toward the teams that become much worse implying that there would be some mechanism that makes bad teams lose more often. This is where attitude, trades, and changes in strategy would come into play. I’d would go so far to say this is evidence that soft skills of a team like chemistry break down.

Since I don’t have access to more years of FanGraphs projections or other projection systems, I can’t do a full evaluation of the team projections. More years of playoff odds should yield probability buckets that reflect the expectation much better than a single year. This would allow for more than 10 different paths to the postseason to be present in the data. In the absence of this, I would say the playoff odds and predicted win expectancy are on the right track and a good predictor of how a team will perform.

Statistics — Probability vs. Odds

Probability and odds are two basic statistic terms to describe the likeliness that an event will occur. They are often used interchangeably in causal conversation or even in published material. However, they are not mathematically equivalent because they are looking at likeliness in different contexts. In everyday conversation when numbers or values aren’t given, the two terms are synonymous . If an event has a high probability, then it has high odds for happening. The incorrect usage arises when a person ascribes a mathematical value to either the odds or probability they are discussing. Hopefully, if you aren’t quite sure what the exact mathematical difference is, this will clear it up for you.

Probability is defined as the fraction of desired outcomes in the context of every possible outcome with a value between 0 and 1, where 0 would be an impossible event and 1 would represent an inevitable event. Probabilities are usually given as percentages. [ie. 50% probability that a coin will land on HEADS.] Odds can have any value from zero to infinity and they represent a ratio of desired outcomes versus the field. Odds are a ratio, and can be given in two different ways: ‘odds in favor’ and ‘odds against’. ‘Odds in favor’ are odds describing the if an event will occur, while ‘odds against’ will describe if an event will not occur. If you are familiar with gambling, ‘odds against’ are what Vegas gives as odds. More on that later. For the coin flip odds in favor of a HEADS outcome is 1:1, not 50%.

Visual Math

Simple probability of event A occurring is mathematically defined as:

$P(A) = \frac{Number \ of \ Event \ A}{Total \ Number \ of \ Events}$

The best way to illustrate this is with the classic marbles-in-a-bag example. The graphic below depicts all the marbles in an opaque bag that one marble will be pulled out of. There are 6 blue, 3 red, 2 yellow, and 1 green for a total of 12 marbles in the bag.

The probability of pulling a red marble would be calculated by taking the total number of red marbles and dividing it by the total number of marbles.

OR

$P(RED) = \frac{3 \ RED \ marbles}{12 \ TOTAL \ marbles} = 25\%$.

Notice that the probability calculation includes the red marbles in the denominator of the calculation, because probability considers the context of the entire event space. Odds, on the other hand, are the ratio of favorable outcomes to unfavorable outcomes. The denominator contains ONLY the marbles that aren’t the favorable outcomes. Odds uses the contexts of good outcomes and bad outcomes. Written as fractions, these two values are completely different. Probability is 1/4 while odds in favor are 1/3. You can see how mistakenly interchanging the terms could give the wrong information. The ‘odds in favor’ of RED would be mathematically calculated by

OR

$Odds\_Favor(RED) = \frac{3 \ RED \ marbles}{9 \ NOT \ RED \ marbles} = 1:3$.

To find ‘odds against’ you would simply flip odds in favor upside down and this describes the odds of the event not occurring.

OR

$Odds\_Against(RED) = \frac{9 \ NOT \ RED \ marbles}{3 \ RED \ marbles} = 3:1$.

Gambling

‘Odds against’ are commonly are used in the context of gambling. When you hear that the Seattle Seahawks Vegas odds to win the Super Bowl are 5:1 [Retrieved 9/19/2014], the 5:1 is referring to the ‘odds against’ Seattle winning the Super Bowl. Using some quick math we could determine the probability of Seattle winning the Super Bowl would be 1/6 or 16.7%.

Vegas odds are technically payoff odds, because they describe the payout if you were to win the bet. The payout on the Seahawks would win you $5 for every$1 bet on the Seattle winning the Super Bowl. They aren’t true odds, since no one is really sure what the true odds are, because you can’t simply count and weigh the possibilities like with the bag of marbles. The payoff will increase when the event becomes less likely. If you could create a reliable predictive model that told you the Seahawks actually had a 20% probability to win the Super Bowl, you could bet on the Seahawks, knowing that their actual probability to win is better than what Vegas is giving them. And if you made enough bets like this you could beat Vegas.

Mathematical Relationship

I stated earlier that probability and odds were colloquially interchangeable when values aren’t given. This is true, because the two are mathematically related. Odds can be computed from probability and probability from odds.

$P(A) = \frac{Odds\_Favor(A)}{1 + Odds\_Favor(A)}$

$Odds\_Favor(A) = \frac{P(A)}{1 - P(A)}$

Using the RED marble example [P(RED) = 1/4 and Odds_Favor(RED) = 1/3] we can demonstrate how these are equivalent:

$P(RED) = \frac{1/3}{1 + 1/3} = \frac{1/3}{4/3} = \frac{1}{4}$

$Odds\_Favor(RED) = \frac{1/4}{1 - 1/4} = \frac{1/4}{3/4} = \frac{1}{3}$

MLB — Run Distribution Per Game & Per Inning — Negative Binomial

This is an extension of an earlier post I wrote about the runs per inning distribution. In this post I use the negative binomial distribution to better model the how MLB teams score runs in an inning or in a game. I wrote a primer on the math of the different distributions mentioned in the post for reference.

The Baseball Side

A team in the American League will average .4830 runs per inning, but does this mean they will score a run every two innings? This seems intuitive if you apply math from Algebra I [1 run / 2 innings ~ .4830 runs/inning]. However, if you attend a baseball game, the vast majority of innings you’ll watch will be scoreless. This large number of scoreless innings can be described by discrete probability distributions that account for teams scoring none, one, or multiple runs in one inning.

Runs in baseball are considered rare events and count data, so they will follow a discrete probability distribution if they are random. The overall goal of this post is to describe the random process that arises with scoring runs in baseball. Previously, I’ve used the Poisson distribution (PD) to describe the probability of getting a certain number of runs within an inning. The Poisson distribution describes count data like car crashes or earthquakes over a given period of time and defined space. This worked reasonably well to get the general shape of the distribution, but it didn’t capture all the variance that the real data set contained. It predicted fewer scoreless innings and many more 1-run innings than what really occurred. The PD makes an assumption that the mean and variance are equal. In both runs per inning and runs per game, the variance is about twice as much as the mean, so the real data will ‘spread out’ more than a PD predicts.

The graph above shows an example of the application of count data distributions. The actual data is in gray and the Poisson distribution in yellow. It’s not a terrible way to approximate the data or to conceptually understand the randomness behind baseball scoring, but the negative binomial distribution (NBD) works much better. The NBD is also a discrete probability distribution, but it finds the probability of a certain number of failures occurring before a certain number of successes. It would answer the question, what’s the probability that I get 3 TAILS before I get 5 HEADS when I continue to flip a coin. This doesn’t at first intuitively seem like it relates to a baseball game or an inning, but that will be explained later.

From a conceptual stand point, the two distributions are closely related. So if you are trying to describe why 73% of all MLB innings are scoreless to a friend over a beer, either will work. I’ve ploted both distributions for comparison through out the post. The second section of the post will discuss the specific equations and their application to baseball.

Runs per Inning

Because of the difference in rules regarding the designated hitter between the two different leagues there will be a different expected value [average] and variance of runs/inning for each league. I separated the two leagues to get a better fit for the data. Using data from 2011-2013, the American League had an expected value of 0.4830 runs/inning with a 1.0136 variance, while the National League had 0.4468 runs/innings as the expected value with a .9037 variance. [So NL games are shorter and more boring to watch.] Using only the expected value and the variance, the negative binomial distribution [the red line in the graph] approximates the distribution of runs per inning more accurately than the Poisson distribution.

It’s clear that there are a lot of scoreless innings, and very few innings having multiple runs scored. This distribution allows someone to calculate the probability of the likelihood of an MLB team scoring more than 7 runs in an inning or the probability that the home team forces extra innings down by a run in the bottom of the 9th. Using a pitcher’s expected runs/inning, the NBD could be used to approximate the pitcher’s chances of throwing a no-hitter assuming he will pitch for all 9 innings.

Runs Per Game

The NBD and PD can be used to describe the runs scored in a game by a team as well. Once again, I separated the AL and NL, because the AL had an expected run value of 4.4995 runs/game and a 9.9989 variance, and the NL had 4.2577 runs/game expected value and 9.1394 variance. This data is taken from 2008-2013. I used a larger span of years to increase the total number of games.

Even though MLB teams average more than 4 runs in a game, the single most likely run total for one team in a game is actually 3 runs. The negative binomial distribution once again modeled the distribution well, but the Poisson distribution had a terrible fit when compared to the previous graph. Both models, however, underestimate the shut-out rate. A remedy for this is to adjust for zero-inflation. This would increase the likelihood of getting a shut out in the model and adjust the rest of the probabilities accordingly. An inference of needing zero-inflation is that baseball scoring isn’t completely random. A manager is more likely to use his best pitchers to continue a shut out rather than randomly assign pitchers from the bullpen.

Hits Per Inning

It turns out the NBD/PD are useful in many other baseball statistics like hits per inning.

The distribution for hits per inning are slightly similar to runs per inning, except the expected value is higher and the variance is lower. [AL: .9769 hits/inning, 1.2847 variance | NL: .9677 hits/inning, 1.2579 variance (2011-2013)] Since the variance is much closer to the expected value, the hits per inning has more values in the middle and fewer at the extremes than the runs per inning distribution.

I could spend all day finding more applications of the NBD and PD, because there are really a lot of examples within baseball. Understanding how these discrete distributions will help you understand how the game works, and they could be used to model outcomes within baseball.

The Math Side

Hopefully, you skipped down to this section right away if you are curious about the math behind this. I’ve compiled the numbers used in the graphs for the American League above for those curious enough to look at examples of the actual values.

The Poisson distribution is given by the equation:

$P(X = x) = \frac{e^{-\lambda}\lambda^x}{x!}$

There are two parameters for this equation: expected value [$\lambda$] and the number of runs you are looking to calculate the probability for [$x$]. To determine the probability of a team scoring exactly three runs in a game, you would set $x = 3$ and using the AL expected runs per game you’d calculate:

$P(X = x) = \frac{e^{-4.4995}4.4995^3}{3!} = 16.87\%$

This is repeated for the entire set of $x$ = {0, 1, 2, 3, 4, 5, 6, … } to get the Poisson distribution used through out the post.

One of the assumption the PD makes is that mean and the variance are equal. For these examples, this assumption doesn’t hold true, so the empirical data from actual baseball results doesn’t quite fit the PD and is overdispersed. The NBD accounts for the variance by including it in the parameters.

The negative binomial distribution is usually symbolized by the following equation:

$P(X=k) = {{r+k-1}\choose{k}} p^{r} (1-p)^{k}$

where $r$ is the number of successes, $k$ is the number of failures, and $p$ is the probability of success. A key restriction is that a success has to be the last event in the series of successes and failures.

Unfortunately, we don’t have a clear value for $p$ or a clear concept on what will be measured, because the NBD measures the probability of binary, Bernoulli trials. It’s help to view this problem from the vantage point of the fielding team or pitcher, because a SUCCESS will be defined as getting out of the inning or game, and a FAILURE will be allowing 1 run to score. This will conform to the restriction by having a success [getting out of the inning/game] being the ultimate event of the series.

In order to make this work the NBD needs to be parameterized differently, for mean, variance, and number of runs allowed [failures]. The following equations are derived from the mean and variance equations of a negative binomial. $\alpha$ represents the ‘odds in favor‘ of getting out of the inning. And $r$ is the expected value multiplied by the ‘odds in favor’ which will yield a real, non-integer for the number of successes. The NBD can then be written as

$P(X=k) = \frac{\Gamma(k+r)}{\Gamma(k+1)\Gamma(r)} (\frac{\alpha}{1+\alpha})^{r} (\frac{1}{1+\alpha})^{k}$

where

$r = Expected Value * \alpha; \alpha = \frac{Expected Value}{Variance -Expected Value}$

So using the same example as the PD distribution, this would yield:

$r = 4.4995 * 0.8182 = 3.6815 ; \alpha = \frac{4.4995}{9.9989 - 4.4995} = 0.8182$

$P(X=3) = \frac{\Gamma(3+3.6815)}{\Gamma(3+1)\Gamma(3.6815)} (\frac{0.8182}{1+.0.8182})^{3.6815} (\frac{1}{1+0.8182})^{3}$

$= 14.18\%$

The above equations are adapted from this blog about negative binomials and this one about applying the distribution to baseball. The $\Gamma$ function is used in the equation instead of a combination operator because the combination operator, specifically the factorial, can’t handle the non-whole numbers we are using to describe the number of successes, and the gamma function is a continuous function from 0 to infinity.

Conclusion

The negative binomial distribution is really useful in modeling the distribution of discrete count data from baseball for a given inning or game. The most interesting aspect of the NBD is that a success is considered getting out of the inning/game, while a failure would be letting a run score. This is a little counterintuitive if you approach modeling the distribution from the perspective of the batting team. While the NBD has a better fit, the PD has a simpler concept to explain: the count of discrete event over a given period of time, which might make it better to discuss over beers with your friends.

The fit of the NBD suggests that run scoring is a negative binomial process, but inconsistencies especially with shut outs indicate elements of the game aren’t completely random. I’m explaining the underestimation number of shut outs as the increase use of the best relievers in shut out games over other games increasing the total number of shut outs and subsequently decreasing the frequency of other run-total games.

All MLB data is from retrosheet.org. It’s available free of charge from there. So please check it out, because it’s a great data set. If there are any errors or if you have questions, comments, or want to grab a beer to talk about the Poisson distribution please feel free to tweet me @seandolinar.

Count Data Distribution Primer — Binomial / Negative Binomial / Poisson

Count data is exclusively whole number data where each increment represents one of something. It could be a car accident, a run in baseball, or an insurance claim. The critical thing here is that these are discrete, distinct items. Count data behaves differently than continuous data, and the distribution [frequency of of different values] is different between the two. Random continuous data typically follows the normal distribution, which is the bell curve everyone remembers from high school grade systems. [Which is a really bad way to grade, but I digress.] Count data generally follows the Binomial/Negative Binomial/Poisson distribution depending what context you are viewing the data; all three distributions are mathematically related.

Binomial Distribution:

The binomial distribution (BD) is the collection of probabilities of getting a certain number of successes in a given number of trials specifically measuring Bernoulli trials [a yes/no event similar to a coin flip, but it’s not necessarily 50/50]. My favorite example to understand the binomial distribution is using it to determine the probability that you’d get exactly 5 HEADS if you flipped a coin 10 times [it’s NOT 50%!].

It’s actually 24.61%. The probability of getting heads in any given coin flip is 50%, but over 10 flips, you’ll only get exactly 5 HEADS and 5 TAILS about 25% of the time. The equation below gives the two popular notations for the binomial probability mass function. $n$ is total number of trials. [the graph above used n=10]. $r$ is the number of successes you want to know the probability for. You calculate this function for each number of HEADS [0-10] for $r$ to get the distribution above. $p$ is the simple probability for each event. [$p$ = .5 for the coin flip.]

$P(X=r) = {{n}\choose{r}} p^{r} (1-p)^{n-r} = \frac{n!}{r!(n-r)!} p^{r} (1-p)^{n-r}$

The equation has three parts. The first part is the combination ${{n}\choose{r}}$, which is the number of combinations when you have $n$ total items taken $r$ at a time. Combination disregard order, so the set {1, 4, 9} is the same as {4, 9, 1}. This part of the equation tells you how many possible ways there are to get to a certain outcome since there are many way to get 5 HEADS in 10 tosses. Since ${{10}\choose{5}}$ is larger than any other combination, 5 HEADS will have the largest probability.

There are two more terms in the equation. $p^r$ is joint probability of getting r successes in a particular order, and $(1-p)^{n-r}$ is the corresponding probably of also getting the failures also in a particular order. I find it helpful to conceptualize the equation as having three parts accounting for different things: total combinations of successes and failures, the probabilities of successes, and the probability of failures.

Negative Binomial Distribution:

While there is a good reason for it, the name of the negative binomial distribution (NBD) is confusing. Nothing I will present will involve making anything negative so, let’s just get that out of the way and ignore it. The binomial distribution uses the probability of successes in the total number of ATTEMPTS. To contrast this, the negative binomial distribution uses the probability that a certain number of FAILURES occur before the $r$th SUCCESS. This has many applications specifically when a sequence terminates after the $r$th success such as modeling the probability that you will sell out of the 25 cups of lemonade you have stocked for a given number of cars that pass by. The idea is that you would pack up your lemonade stand after you sell out, so cars that would pass by after the final success won’t matter. Another good example is modeling the win probability of a 7-game sports playoff series. The team that wins the series must win 4 games and specifically the last game played in the series, since the playoff series terminates after one team reaches 4 wins.

One of the more important restrictions on the NBD is that the last event must be a success. Going back to the sports playoff series example, the team that wins the series will NEVER lose the last game. With the 10 coin-flip example, the BD was looking for the probability of getting a certain number of HEADS within a set number of coin flips. Using the NBD, we will look for the probability of 5 HEADS before getting a certain number of TAILS. The total number of flips will not ALWAYS equal 10 and actually exceeds 10 as seen below.

The probability mass function that describes the NBD graph above is given below:

$P(X=k) = {{r+k-1}\choose{k}} p^{r} (1-p)^{k}$

The equation for the NBD has the same parts as the BD: the combinations, the success, and the failures. In the NBD case the combinations are less than the BD [for the same total number of coin flips]. This is because the last outcome is held fix at a success. The probability of success and failure parts of the equation are conceptually the same as the BD. The failure portion is written differently because the number of failures is a parameter $k$ instead of a derived quantity like [$n-r$].

Poisson Distribution:

The Poisson Distribution (PD) is directly related to both the BD and the NBD, because it is the limiting case of both of them. As the number of trials goes to infinity, then the Poisson distribution emerges. The graph for the PD will look similar to the NBD or the BD, and there is no example comparing the coin flip since there has to be some non-discrete process like traffic flow or earthquakes. The major difference is not what is represented, but how it is viewed and calculated. The Poisson distribution is described by the equation:

$P(X = x) = \frac{e^{-\lambda}\lambda^x}{x!}$

$\lambda$ is the expected value [or the mean] for an event and $x$ is the count value. If you knew an average of 0.2 car crashes happen at an intersection at a given day then you could solve the equation for $x$ = {0, 1, 2, 3, 4, 5, … } and get the PD for the problem.

One of the restrictions and major issues with the use of the PD is that the model assumes the mean and the variance are equal. In most real data instances the variance is greater than the mean, so the PD tends to favor more values around the expected value than real data reflects.

If you are interested in the derivations and math behind these I recommend this site: http://statisticalmodeling.wordpress.com/. I feel like they explain the derivation of the negative binomial better than most places I’ve found. It addresses why it’s called the NEGATIVE binomial distribution as well. The site also contains derivations of the PD being the limiting case of the BD and NBD.

Moving Average Time Series — Baseball

Usually I use stats to describe baseball, but this post is going to use baseball to illustrate stats. There’ll be some math. If that scares you, you’ve been duly warned. Also I have collected the SAS output for each model for technical reference.

A time series is data that has been collected at a regular interval over time. This is rather intuitive when given the definition, but they are different from cross-sectional data, which is the type of data set most people are familiar with. The closing price of a stock is a time series, because it’s a measurement at 4PM every M-F. Cross-sectional data would looking at which type of stocks gained the most over a quarter in your portfolio. This is one measurement (quarterly change) made for a many different stocks. Not every data set fits neatly into a category and the analysis goal is different for each instrument.

The goal of univariate time series analysis (TSA) is to forecast a variable only using past observations of that variable. In the case of the stock market example, TSA seeks to project what the closing price for the next day will be using data from the specified time frame. However, finance is boring and I wanted a data set that I can extract some insight from, so we’ll be looking at MLB strikeouts (K) per year and home runs (HR) per year as the data sets.

What does a time series look like. If you scroll down or look up a stock market graph, you’ll see what a time series looks like. It’s messy. I created this data set, so I can describe this process accurately. It’s a first-order moving average process with a lag_1 coefficient of 0.9 and a series mean of 0. I’ve also included the normal linear regression (OLS) trend for the time series that shows it to have a slightly positive trend. This is a typical analytical technique to show that a time series is moving. In this case the trend is non-significant over these 50 data points. There is no trend, and the mean is zero.

The model that corresponds to the graph above has the general form as follows:

$y_t = \mu + a_t + \theta_1 a_{t-1}$

where $y$ is the time-dependent target variable, $\mu$ is the average of the entire series of data, $\theta$ is the regression coefficient, and $a$ is a time dependent shock to the system. The $t$ terms describe which time period the variable is from starting with the most current one, $t=50$.

Before describing the model above, it is important to fully understand what the $a_t$ represents. This is a shock term that can encompass a lot of different things. If you are consider something like quarterly earnings, factors influencing the shock term are unemployment, economic growth, marketing campaigns, etc. We are looking at the data in absence of this knowledge, and since we are in the dark, the causes of the shocks appear random. The $a_t$ terms should be a normally distributed and not autocorrelated. The expected value should be zero, $E[a_t] = 0$. The expected value is another way to describe the average of all the $a_t$ terms.

Here’s a great way to think about the MA process. Think about a simplified personal monthly expenditures where you had a constant salary and a modest saving account. Shocks that would be included in the $a_t$ term would be unexpected expenses. The unexpected expense could influence the next time period if you had to dip into savings. So a high unexpected expense in January would impact the spending in February, because you’d have payoff your credit card or put money back into savings.

There are many more details to understanding time series such as autocorrelation. Hopefully I’ll write a separate post on that in the future.

Let’s look at some real data. Luckily, I have every play from MLB in a database thanks to retrosheet.org, so we’ll look at some time series from there specifically, HR and Ks per year. Conceptually for this rudimentary modeling, a MA process makes sense. A shock from the previous year like expansion, steroids, or selection bias would carry over year to year. Looking at the time series graph below, it doesn’t behave like the previous time series that was centered around zero. This time series is considered non-stationary, which means there’s a trend and that trend changes over time. The number of HR per season increased over time up until around 2001 when it leveled off and started to decline. There’s a trend up until 2001 a trend after it, and they aren’t the same. To get around this instead of modeling the actual values, the differences between two years of HRs will be model. A difference ($\nabla$) is simply $y_t - y_{t-1}$. Or the difference in HRs in 2013 and 2012, which would be -279 HRs.

The green line are the actual HRs each year. The ‘cantaloupe’ colored lines are the 50% confidence interval (CI) of the forecast. The red line are the forecasted values. I used 50% CIs to show likely deviations, not statistically significant deviations.

The differenced moving average model [ARIMA(0,1,1)] takes the form:

$\nabla y_t = \mu + a_t - \theta a_{t-1}$

Substituting the estimated coefficient for $\theta$ and $\mu$ a forecast can be made with the following equation:

$y_{t+1} = \mu + y_t + a_{t+1} - \theta * a_{t}$

$y_{t+1} = 50.11163 + y_t + a_{t+1} - .45073 * a_{t}$

The last equation is used to generate the forecast line and the ultimately the 50% CI lines. The interpretation of this equation is that half of the shock from the previous time period still has an effect on the change to the current period. The forecast predicts that the home runs will actually increase over the past few years and not continue the decline. Looking backwards the model can be used to identify some years of interest, and I’ve marked those on the graph. Expansion probably has the greatest impact on the number of HRs, because it dilutes the talent pool and increases the total number of games per season. If you wanted to measure the impact training or steroids had on HRs, you’d wanted to use a HR/game time series [see below] instead of total HRs. [This is total HRs between both teams.]

The HR/Gm is the time series that a baseball analyst would want to use, because it controls for extra games from expansion, so the trends are also less pronounced. This is still a non-stationary time series, so it needs to be difference like the previous model and can be described by the following equation:

$y_{t+1} = 0.0045989 + y_t + a_t - .49927 * a_{t}$

Still the greatest shocks are the expansion years, which tend to have a bit of a lingering effect before regressing. 1987 now stands as a really enigmatic outlier. There was no expansion that year. The best explanation is there was a strike zone change, but I can only find that in one article. The home run outburst of the late 90s and early 2000s happens with the ‘steroid era’ and two close periods of expansion. This post isn’t interested in analyzing steroids effect on MLB, only that it’s ‘shock’ is mixed in with expansion team ‘shock’. Also it should be noted HRs/Gm haven’t returned to pre-1993 expansion levels.

Looking at the opposite of a home run, the strike outs per year has a trend that is much more steady, and it’s increasing.

The graph displayed above is also differenced first order moving average process, ARIMA(0,1,1). Its equation looks very similar to the last two so I won’t write it out. The parameters can be found in the SAS output appendix, I have for this page. The forecast has a definite increase in total strike outs over the next few years. Just like the HR per year time series, the time series of Ks are best analyzed by looking at the K/Gm. The K/Gm time series turns out to be a different model than the first three models, because it is a just a random walk around a linear trend.

This process has random shocks around a positive trend with no ‘memory’ of the past shocks like the other three models had. This model for K/Gm, ARIMA(0,1,0), looks a little different than the ARIMA(0,1,1) models seen earlier since there is no lagged $a_{t-1}$ term. The ARIMA(0,1,0) model is given by the following equation:

$\nabla y_t = \mu + a_t$

and the forecast equation with parameters in it would be:

$y_{t+1} = 0.11637 + y_t + a_t$

This indicates that the K/Gm will increase by 0.11637 every year on average. Obviously since there are only 54 outs in a baseball game this trend can’t go on forever. As of the beginning August 2014, the current K/Gm is 15.4 and it is forecasted to be 15.2497, which is within the 50% CI of the forecast.

While these models can make predictions about baseball, I wouldn’t considering this the best [or even good] models for forecasting since we could incorporate other variables or improve the granularity of the forecast to individual players. There also isn’t much value in saying there’ll be more strike outs in 2014 than 2013. However, this example is a good academic exercise in understanding how univariate time series work. And hopefully it provides some insight into both time series and a little bit about trends in baseball.

Where Do People Tweet?

This is a representative map of twitter from 11am to 11pm EDT yesterday.

Chicago Transit Authority — Ridership

Waiting for the break of day…oooOOOOO…25 or 6 to 4!
-Chicago (formerly The Chicago Transit Authority)

I was lucky to live in Chicago during the summer of 2012. The thing I most miss from Chicago is the transit system. Taking the ‘L’ to work everyday was much more relaxing and interesting than having to drive in. No parking hassles or gas. It was great. Transit data is critical to making those systems much more efficient. Fortunately, Rahm Emanuel is kind enough to release some of the transit data from the Chicago Transit Authority (CTA). The data only contains ridership per day information from each station, so I am limited in the insight this analysis can produce.

Before diving into the results of the descriptive analytics, let’s look at how the ‘L’ is designed. There are eight different lines, each designated by a color. All of the lines goes through the downtown area called ‘The Loop’, because the elevated track forms a huge loop around a huge block of the city. There are two main lines which also go underground: the Red and Blue Line. These two lines run all night and carry the most passengers. When a Chicagoan rides the ‘L’ they swipe their pass at the entrance of stations, then board their desired train in either direction. Unlike transit systems like the Metro in DC, there is no exit swipe. So every data point in this post is going to be a person swiping at a station to board a train, but we can’t determine which direction or destination.

There’s also another problem, at several stations that service different lines. Clark/Lake has practically every line go through it. Without more resolution in how the data are measure, the most I can infer from the data is what stations are the most popular on certain days. This comes from the assumption that if a person arrives at a station they will leave from the same station.

This visualization looks a lot like a CTA map. I don’t have a good way to automatically draw the lines between the stations, but I think the location data that’s attached to the station names does a good job of recreating the CTA map everyone is used to seeing. I’ve labeled any ‘L’ stations which service multiple lines in an order of importance. The priority is Red, Blue, Orange, Brown, Green, Purple, Pink, Yellow in that order. The reasoning behind this is that these are the largest or most popular lines, so the station will have the majority of patrons using these lines. From this map Clark/Lake, the station with the most train lines, is the most popular. Terminuses (termini?) of the the lines also have a lot of use. This can help visualize where the transfer points, the most popular entry points or destinations are. The Red Line and Blue Line have the most stops and the most ridership. Admittedly, this analysis has problems parsing Brown/Red Line customers, but there is higher ridership at the non-transfer stations of the Red Line; that confirms that more customers are using the Red Line in general. I have a separate post for the chart of the ridership of every station. The chart is way to big to put in this post, you’ll be scrolling for days. It’s worth checking out though to drill down into the details.

Ridership is rarely constant. In fact, the ridership of the ‘L’ varies into three predictable groups of days: weekdays, Saturday, and Sundays. The differences in the data based on the different day-group will effectively ruin any time-series analysis, because the average values between the three groups varies so much that any trends are going to be hard to spot. The graph would look very erratic. To account for this, any time-series graphs are split into those three groups.

I can’t write a post without tying baseball into it. This will be no exception, because in 2012 I spent a lot of time watching baseball games on the North Side and South Side of Chicago. Both teams are connected by the Red Line, and the stations are extremely close the parks. So what would we expect to see? Baseball games have attendance ranging from 10k to 30k, so this should present a spike in daily ridership. I graphed the daily ridership for the two stops adjacent to the ballparks and then label when there was a home or away game.

Any spike or dip not described by baseball is labeled. St. Patrick’s day has large spike all over the CTA system, but the spike at Addison is particularly high because of the all the bars in the neighborhood. The largest non-baseball spike in the Addison station’s ridership came during the gay pride parade, since this is an incredibly popular event in a neighborhood nearby the station.

There is one anomaly I forgot to point out on the graph, but there’s a spike when the Cubs are away for Sept 8th. At first I thought this might have been labor day, but it turns out that The Boss was playing at Wrigley Friday night, and I remember walking by it late at night. So there’s a spike for the 7th (the day the concert actually was) and then a larger spike from the 8th presumably the concert ended near or after midnight or people stayed after and drank in one of the fine establishments around the park. [I went to a Sox game early that day, so I accounted for a few data points that day.]

I arrived on June 12, 2012 right in the middle of a Cubs-Tigers game. Those were three really crowded games in Wrigleyville. Ridership at the two ballpark ‘L’ stations peaks during the summer, especially at Wrigley. You wouldn’t have guess that the Sox were in contention for a division title up until the last week of the season. The Cubs have a huge tourist draw including me because I went to at least a dozen games while I lived there, and you can see a surge during the summer (vacation) months. I would leave Chicago on October 20, 2012, and start out on #SeanTrek about 11 days later.

Text Message Analytics — Numbers

People communicate a lot through text messages, and lucky for me iPhones keep track of those text messages I’ve sent. iPhones store your text messages in a SQLite database, and this database is readily accessible in your iPhone backup on your computer. [This is why encrypting your backup might be a good idea if you have sensitive data.] I want to eventually perform some advanced text analytics to try to interpret the content of the text message. This post is only going to look at the ‘numbers’ aspect of my text messages. All the numbers on the following pages include both sent and received texts, and excludes texts that I either deleted or where deleted by the system. [I know I’ve deleted threads. I don’t think iOS deletes old messages, but it’s a possibility till I know otherwise.]

The most simple stat from text messages is how many have I sent/received per day or per week. The chart below has both. The notable trend is that there has been more text messages sent/received the longer I’ve had my iPhones. I’d suggest this is a little biases since I would be more likely to delete text threads that are much older, but I still think there would be the slight trend upwards regardless.

I wanted to look at area codes just out of curiosity. I thought that I would have the most texts between me and a 412 or 724 number. I’m a little surprised how many 412 numbers there are given how many people I know living in the Pittsburgh suburbs and I’m a 724. I think 412 is Allegheny County, while 724 is anything outside of that. I’m little surprised how close traditional SMS text messages, which go through your carrier network opposed over the Internet, since most of my friends have iPhones.

The last chart is my favorite, a breakdown of how often I text for each hour of the day. I think this tells you something about my behavior, albeit nothing common sense won’t tell you. I generally text earlier in the morning (7am to 10am) more often during the work week compared to the weekend. There’s a spike at 12PM (lunch time) and 9PM (making plans/socializing) for any day of the week. There’s virtually no texting between 4am and 6am. There are some texts that occur after the ‘Ted-Mobsy-Hour’ of 2am, where nothing good happens after that time, but not a spike like there might have been in college.

‘Kids, if it’s after 2am, don’t text, just go home….and watch How I Met Your Mother.’

#SeanTrek GeoTracks 2012

You might remember #SeanTrek — the 46 day, 12,000 mile, 34 state excursion I took back at the very end of 2012. I didn’t know what I how I was going to use this at the time, but I geotagged just about everything I did on the trip. I checked-in to every place on Foursquare and obtained over 700 points in Portland and San Francisco, which is insane because I checked in just about everything I did or place I went. On top of Foursquare I geotagged every tweet I sent and picture I took. This resulted in me now having thousands of data points of both timestamps and location data.

The above map is what happens when you put all of them together. It outlines my entire trip! The more dense the marks the more I was in one place longer exploring it. Sparse points means I was driving a lot. You’ll find a lot of marks around Pittsburgh, Portland, SF, LA, Austin, and New Orleans, because I spent the most time there and didn’t drive much in most of those cities. I have a rather nice record of a long trip that didn’t require me to painstakingly record exactly what I did.

This map only has geotag data and the type of media. I’m hoping to use the geotag data and the timestamp to get an average speed between the two points. I also want to geocode some tweets or photos that were not geocoded in 2012 by interpolating using the timestamp now.

Once I properly extract the data from the tweets, I can have hashtags or mentions searchable by frequency and location. I used #SeanTrek a lot more than any other hashtag on the trip. Though curiously enough the first tweet mentioning #SeanTrek is not geotagged. (technical glitch) Hopefully, I’ll get some more things mapped out in the future.