# Moving Average Time Series — Baseball

Usually I use stats to describe baseball, but this post is going to use baseball to illustrate stats. There’ll be some math. If that scares you, you’ve been duly warned. Also I have collected the SAS output for each model for technical reference.

A time series is data that has been collected at a regular interval over time. This is rather intuitive when given the definition, but they are different from cross-sectional data, which is the type of data set most people are familiar with. The closing price of a stock is a time series, because it’s a measurement at 4PM every M-F. Cross-sectional data would looking at which type of stocks gained the most over a quarter in your portfolio. This is one measurement (quarterly change) made for a many different stocks. Not every data set fits neatly into a category and the analysis goal is different for each instrument.

The goal of univariate time series analysis (TSA) is to forecast a variable only using past observations of that variable. In the case of the stock market example, TSA seeks to project what the closing price for the next day will be using data from the specified time frame. However, finance is boring and I wanted a data set that I can extract some insight from, so we’ll be looking at MLB strikeouts (K) per year and home runs (HR) per year as the data sets.

What does a time series look like. If you scroll down or look up a stock market graph, you’ll see what a time series looks like. It’s messy. I created this data set, so I can describe this process accurately. It’s a first-order moving average process with a lag_1 coefficient of 0.9 and a series mean of 0. I’ve also included the normal linear regression (OLS) trend for the time series that shows it to have a slightly positive trend. This is a typical analytical technique to show that a time series is moving. In this case the trend is non-significant over these 50 data points. There is no trend, and the mean is zero.

The model that corresponds to the graph above has the general form as follows:

$y_t = \mu + a_t + \theta_1 a_{t-1}$

where $y$ is the time-dependent target variable, $\mu$ is the average of the entire series of data, $\theta$ is the regression coefficient, and $a$ is a time dependent shock to the system. The $t$ terms describe which time period the variable is from starting with the most current one, $t=50$.

Before describing the model above, it is important to fully understand what the $a_t$ represents. This is a shock term that can encompass a lot of different things. If you are consider something like quarterly earnings, factors influencing the shock term are unemployment, economic growth, marketing campaigns, etc. We are looking at the data in absence of this knowledge, and since we are in the dark, the causes of the shocks appear random. The $a_t$ terms should be a normally distributed and not autocorrelated. The expected value should be zero, $E[a_t] = 0$. The expected value is another way to describe the average of all the $a_t$ terms.

Here’s a great way to think about the MA process. Think about a simplified personal monthly expenditures where you had a constant salary and a modest saving account. Shocks that would be included in the $a_t$ term would be unexpected expenses. The unexpected expense could influence the next time period if you had to dip into savings. So a high unexpected expense in January would impact the spending in February, because you’d have payoff your credit card or put money back into savings.

There are many more details to understanding time series such as autocorrelation. Hopefully I’ll write a separate post on that in the future.

Let’s look at some real data. Luckily, I have every play from MLB in a database thanks to retrosheet.org, so we’ll look at some time series from there specifically, HR and Ks per year. Conceptually for this rudimentary modeling, a MA process makes sense. A shock from the previous year like expansion, steroids, or selection bias would carry over year to year. Looking at the time series graph below, it doesn’t behave like the previous time series that was centered around zero. This time series is considered non-stationary, which means there’s a trend and that trend changes over time. The number of HR per season increased over time up until around 2001 when it leveled off and started to decline. There’s a trend up until 2001 a trend after it, and they aren’t the same. To get around this instead of modeling the actual values, the differences between two years of HRs will be model. A difference ($\nabla$) is simply $y_t - y_{t-1}$. Or the difference in HRs in 2013 and 2012, which would be -279 HRs.

The green line are the actual HRs each year. The ‘cantaloupe’ colored lines are the 50% confidence interval (CI) of the forecast. The red line are the forecasted values. I used 50% CIs to show likely deviations, not statistically significant deviations.

The differenced moving average model [ARIMA(0,1,1)] takes the form:

$\nabla y_t = \mu + a_t - \theta a_{t-1}$

Substituting the estimated coefficient for $\theta$ and $\mu$ a forecast can be made with the following equation:

$y_{t+1} = \mu + y_t + a_{t+1} - \theta * a_{t}$

$y_{t+1} = 50.11163 + y_t + a_{t+1} - .45073 * a_{t}$

The last equation is used to generate the forecast line and the ultimately the 50% CI lines. The interpretation of this equation is that half of the shock from the previous time period still has an effect on the change to the current period. The forecast predicts that the home runs will actually increase over the past few years and not continue the decline. Looking backwards the model can be used to identify some years of interest, and I’ve marked those on the graph. Expansion probably has the greatest impact on the number of HRs, because it dilutes the talent pool and increases the total number of games per season. If you wanted to measure the impact training or steroids had on HRs, you’d wanted to use a HR/game time series [see below] instead of total HRs. [This is total HRs between both teams.]

The HR/Gm is the time series that a baseball analyst would want to use, because it controls for extra games from expansion, so the trends are also less pronounced. This is still a non-stationary time series, so it needs to be difference like the previous model and can be described by the following equation:

$y_{t+1} = 0.0045989 + y_t + a_t - .49927 * a_{t}$

Still the greatest shocks are the expansion years, which tend to have a bit of a lingering effect before regressing. 1987 now stands as a really enigmatic outlier. There was no expansion that year. The best explanation is there was a strike zone change, but I can only find that in one article. The home run outburst of the late 90s and early 2000s happens with the ‘steroid era’ and two close periods of expansion. This post isn’t interested in analyzing steroids effect on MLB, only that it’s ‘shock’ is mixed in with expansion team ‘shock’. Also it should be noted HRs/Gm haven’t returned to pre-1993 expansion levels.

Looking at the opposite of a home run, the strike outs per year has a trend that is much more steady, and it’s increasing.

The graph displayed above is also differenced first order moving average process, ARIMA(0,1,1). Its equation looks very similar to the last two so I won’t write it out. The parameters can be found in the SAS output appendix, I have for this page. The forecast has a definite increase in total strike outs over the next few years. Just like the HR per year time series, the time series of Ks are best analyzed by looking at the K/Gm. The K/Gm time series turns out to be a different model than the first three models, because it is a just a random walk around a linear trend.

This process has random shocks around a positive trend with no ‘memory’ of the past shocks like the other three models had. This model for K/Gm, ARIMA(0,1,0), looks a little different than the ARIMA(0,1,1) models seen earlier since there is no lagged $a_{t-1}$ term. The ARIMA(0,1,0) model is given by the following equation:

$\nabla y_t = \mu + a_t$

and the forecast equation with parameters in it would be:

$y_{t+1} = 0.11637 + y_t + a_t$

This indicates that the K/Gm will increase by 0.11637 every year on average. Obviously since there are only 54 outs in a baseball game this trend can’t go on forever. As of the beginning August 2014, the current K/Gm is 15.4 and it is forecasted to be 15.2497, which is within the 50% CI of the forecast.

While these models can make predictions about baseball, I wouldn’t considering this the best [or even good] models for forecasting since we could incorporate other variables or improve the granularity of the forecast to individual players. There also isn’t much value in saying there’ll be more strike outs in 2014 than 2013. However, this example is a good academic exercise in understanding how univariate time series work. And hopefully it provides some insight into both time series and a little bit about trends in baseball.

# Twitter Analysis – Penguins Game 7

I’ve been listening to 93.7 The Fan while running the analysis for this, and I never realized that people can say the same thing over and over again but in slightly different ways. Also all tweets were captured AFTER THE CONCLUSION OF THE 1st PERIOD.

Everyone knows Twitter is the best venue to vent your anger about sports teams. I was able to the statistical programming language R to scrape tweets which had certain keywords or hashtags in them, put them in a database and then flag the tweets that contain certain keywords or collection of words. I had about 20 keywords including: “penguins”, “pens”, “rangers”, “game 7”, and “firebylsma”. I also search for any of the handles of the local hockey writers, because a lot of people will reply to the sports writers during the game with their own opinions.

The first graph has the total number of tweets that I scraped and tweets that I flagged as ‘swearing’. For the most part, I feel like if someone swore in the tweet, it indicates anger or at the very least aggressiveness. As mentioned before these graphs begin after the 1st period ends. And tweets containing the keyword ‘rangers’ has been filtered from this first graph to include a greater amount of Penguins fan.

The quickest and most basic analysis is the number of tweets as a time-series. As soon as you look at the time-series line graph, you can tell when the game ends [9:41 PM]. It’s like Mt. Everest in the graph. Looking closer you can see the spikes where each team scores. An interesting occurrence happened right before the end of the game. Twitter got quiet. The tweets per minute dropped below 500 right before it exploded to a few thousand per minute. I am attributing this silence to people actually watching the game during the tense last minute.

The tweets peaked about 2 minutes after the game ended indicating a minimal lag which includes the time of picking up the smart phone, unlocking it, and composing the tweet. However, the angry, swearing tweets peaked right as the game ended indicating more visceral emotions instead of more thought-out, 140-character commentary. There are two severe dips that I can’t account for at 9:49 PM and 10:04 PM. If anyone knows something that occurred at this time, please let me know. Since there is a clear downward trend and the game was no longer being played, I am going to write those dips off as some technical difficulties that didn’t allow a lot of tweets to be sent at those times.

Since Twitter isn’t an invention specific to just Penguins fans, I separated and compared two sets of tweets: one containing the word “penguins” and one comparing “rangers”. There are a lot more Rangers fans than Penguins fans, because the “rangers” tweets outnumber “penguins” tweets at almost every time. The “penguins” tweets did spike when they scored their only goal. Interestingly enough, there was a lot of swearing right when the goal was scored. Penguins fans are just so angry!

Bottom-line calm down; it’s just sports.