# Count Data Distribution Primer — Binomial / Negative Binomial / Poisson

Count data is exclusively whole number data where each increment represents one of something. It could be a car accident, a run in baseball, or an insurance claim. The critical thing here is that these are discrete, distinct items. Count data behaves differently than continuous data, and the distribution [frequency of of different values] is different between the two. Random continuous data typically follows the normal distribution, which is the bell curve everyone remembers from high school grade systems. [Which is a really bad way to grade, but I digress.] Count data generally follows the Binomial/Negative Binomial/Poisson distribution depending what context you are viewing the data; all three distributions are mathematically related.

Binomial Distribution:

The binomial distribution (BD) is the collection of probabilities of getting a certain number of successes in a given number of trials specifically measuring Bernoulli trials [a yes/no event similar to a coin flip, but it’s not necessarily 50/50]. My favorite example to understand the binomial distribution is using it to determine the probability that you’d get exactly 5 HEADS if you flipped a coin 10 times [it’s NOT 50%!].

It’s actually 24.61%. The probability of getting heads in any given coin flip is 50%, but over 10 flips, you’ll only get exactly 5 HEADS and 5 TAILS about 25% of the time. The equation below gives the two popular notations for the binomial probability mass function. $n$ is total number of trials. [the graph above used n=10]. $r$ is the number of successes you want to know the probability for. You calculate this function for each number of HEADS [0-10] for $r$ to get the distribution above. $p$ is the simple probability for each event. [$p$ = .5 for the coin flip.]

$P(X=r) = {{n}\choose{r}} p^{r} (1-p)^{n-r} = \frac{n!}{r!(n-r)!} p^{r} (1-p)^{n-r}$

The equation has three parts. The first part is the combination ${{n}\choose{r}}$, which is the number of combinations when you have $n$ total items taken $r$ at a time. Combination disregard order, so the set {1, 4, 9} is the same as {4, 9, 1}. This part of the equation tells you how many possible ways there are to get to a certain outcome since there are many way to get 5 HEADS in 10 tosses. Since ${{10}\choose{5}}$ is larger than any other combination, 5 HEADS will have the largest probability.

There are two more terms in the equation. $p^r$ is joint probability of getting r successes in a particular order, and $(1-p)^{n-r}$ is the corresponding probably of also getting the failures also in a particular order. I find it helpful to conceptualize the equation as having three parts accounting for different things: total combinations of successes and failures, the probabilities of successes, and the probability of failures.

Negative Binomial Distribution:

While there is a good reason for it, the name of the negative binomial distribution (NBD) is confusing. Nothing I will present will involve making anything negative so, let’s just get that out of the way and ignore it. The binomial distribution uses the probability of successes in the total number of ATTEMPTS. To contrast this, the negative binomial distribution uses the probability that a certain number of FAILURES occur before the $r$th SUCCESS. This has many applications specifically when a sequence terminates after the $r$th success such as modeling the probability that you will sell out of the 25 cups of lemonade you have stocked for a given number of cars that pass by. The idea is that you would pack up your lemonade stand after you sell out, so cars that would pass by after the final success won’t matter. Another good example is modeling the win probability of a 7-game sports playoff series. The team that wins the series must win 4 games and specifically the last game played in the series, since the playoff series terminates after one team reaches 4 wins.

One of the more important restrictions on the NBD is that the last event must be a success. Going back to the sports playoff series example, the team that wins the series will NEVER lose the last game. With the 10 coin-flip example, the BD was looking for the probability of getting a certain number of HEADS within a set number of coin flips. Using the NBD, we will look for the probability of 5 HEADS before getting a certain number of TAILS. The total number of flips will not ALWAYS equal 10 and actually exceeds 10 as seen below.

The probability mass function that describes the NBD graph above is given below:

$P(X=k) = {{r+k-1}\choose{k}} p^{r} (1-p)^{k}$

The equation for the NBD has the same parts as the BD: the combinations, the success, and the failures. In the NBD case the combinations are less than the BD [for the same total number of coin flips]. This is because the last outcome is held fix at a success. The probability of success and failure parts of the equation are conceptually the same as the BD. The failure portion is written differently because the number of failures is a parameter $k$ instead of a derived quantity like [$n-r$].

Poisson Distribution:

The Poisson Distribution (PD) is directly related to both the BD and the NBD, because it is the limiting case of both of them. As the number of trials goes to infinity, then the Poisson distribution emerges. The graph for the PD will look similar to the NBD or the BD, and there is no example comparing the coin flip since there has to be some non-discrete process like traffic flow or earthquakes. The major difference is not what is represented, but how it is viewed and calculated. The Poisson distribution is described by the equation:

$P(X = x) = \frac{e^{-\lambda}\lambda^x}{x!}$

$\lambda$ is the expected value [or the mean] for an event and $x$ is the count value. If you knew an average of 0.2 car crashes happen at an intersection at a given day then you could solve the equation for $x$ = {0, 1, 2, 3, 4, 5, … } and get the PD for the problem.

One of the restrictions and major issues with the use of the PD is that the model assumes the mean and the variance are equal. In most real data instances the variance is greater than the mean, so the PD tends to favor more values around the expected value than real data reflects.

If you are interested in the derivations and math behind these I recommend this site: http://statisticalmodeling.wordpress.com/. I feel like they explain the derivation of the negative binomial better than most places I’ve found. It addresses why it’s called the NEGATIVE binomial distribution as well. The site also contains derivations of the PD being the limiting case of the BD and NBD.