Investigating the Pareto Distribution

During my graduate degree program in Statistics I needed to know the ins and outs of many different probability distributions. This includes the normal, binomial, beta, gamma, exponential, and Poisson. Surprisingly I never ran into one of the most popular distributions in popular culture: the Pareto distribution.

You may have heard of the 80/20 rule? Most commonly you hear about 80% of the wealth is controlled by 20% of people. Maybe 99% of book sales are generated by 1% of authors? These numbers don’t need to add up to 100%. We could just as easily say 70% of productivity comes from 10% of workers, or 90% of your musical abilities is generated from 30% of your practice. But no matter how you slice it there’s a sense of skewness of how some percentage of input generates a disproportionate amount of output. My goal in this post is to mathematically describe how the Pareto distribution arises from simple assumptions, and how it captures this skewness phenomenon.

Let’s describe a real life property, such as a random individual’s age, via a random variable X. Assume X\sim Exponential(1). That is X follows an exponential distribution with PDF and CDF

f_X(x) = e^{-x} and F_X(x) = 1-e^{-x} for x\geq 0.

From here let’s assume that everyone’s income grows over time at a constant rate of return. Ignore units and just assume everyone starts life on equal footing with a net worth of 1. Then we’ll use Y=e^{\alpha x} to describe a random individual’s net worth, where \alpha is the rate of return. The CDF of Y is derived like so…

\begin{aligned} F_Y(y) &= P(Y<y) \\ &= P(e^{\alpha X} < y) \\ &= P(X < \frac{1}{\alpha} \ln y) \\ &= F_X\left(\frac{1}{\alpha} \ln y\right) \\ &= 1-\left(\frac{1}{y}\right)^{\frac{1}{\alpha}} \text{ for } y\geq 1. \end{aligned}

The PDF comes from taking the derivative of the CDF. We have…

f_X(x) = \frac{1}{\alpha} y^{-(1-\frac{1}{\alpha})} \text{ for } y\geq 1.

This is the Pareto distribution (typically parameterized to remove the reciprocal of \alpha).

Consider p=1-F_Y(y_0). This value tells us the proportion of the population that makes an income above y_0. A natural question to ask is how much wealth does this portion of the population control? To answer this question first let’s calculate the following quantities.

\begin{aligned} \text{Total Wealth} &= \int_1^\infty y f_Y(y) dy = \frac{1}{1-\alpha } \\  \text{Total Wealth Above } y_0 &= \int_{y_0}^\infty y f_Y(y) dy  = \frac{y_0^{1-\frac{1}{\alpha }}}{1-\alpha }\end{aligned}

Taking the ratio of the above terms we can define a new function, W(p), to get the proportion of wealth owned by the top p percentile of the population.

W(p) =y_0^{\frac{\alpha -1}{\alpha }} = p ^{1-\alpha}

where the following substitution is applied

y_0 = F_Y^{-1}(1-p) = p^{-\alpha}.

Now let’s consider finding a value \alpha that expresses the most popular form of the Pareto Principle, i.e. the 80/20 rule. Simply setting W(p) = \frac{4}{5} and p = \frac{1}{5} we solve and get \alpha =\log_5(4) \approx 0.8616.

In our every day world 86% may seem like an unrealistic rate of return, but keep in mind that we never gave units for our underlying distribution of age. If the units X are in decades and you assume a person lives only 10 years on average, then you achieve this same level of inequality with just a 9% annualized rate of return. Of course this model is highly simplified since human lifespans don’t follow an exponential distribution with a 10 year mean. Still, it speaks volumes to how inequality can arise from something as innocuous as age and a modest return on investment.

Oddly enough there’s nothing special about the Pareto distribution that is unique in capturing this sense of inequality. We could have used this same methodology for many distributions. A natural question then becomes how \alpha must change to capture the 80/20 rule when you assume a different underlying distribution for age besides the Exponential distribution.

Standard Deviation vs MAD

Standard deviation is the most commonly used metric for measuring the volatility, or spread, of data. Besides mean average, it is the most commonly used metric in data science to the point where we never stop to think about why it’s used.

The formula for standard deviation is given by:

\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^N(x_i-\mu)^2} \text{, where } \mu = \frac{1}{N}\sum_{i=1}^N x_i.

Not the simplest formula, but it does its job in capturing how data deviates from its mean. Except there is a simpler, more intuitive formula available to us. This metric is called Mean Absolute Deviation (or MAD for short). The formula looks like this:

\text{MAD} = \frac{1}{N}\sum_{i=1}^N\left|x_i-\mu\right| .

If you want to know how spread out your data is, doesn’t it make more sense to find the mean, then take the average distance of all the data from that mean? I see no reason why not.

So why is standard deviation the industry standard for measuring spread? I think it’s partly an accident of history, and partly due to the fact that square functions are easier to work with in mathematics (and lead to nicer advanced results).

Compare the square function to the absolute value function below.

Calculus has no way of dealing with that sharp corner on the right.

Okay, so we can see that there are alternatives to standard deviation. What difference does it make when analyzing data?

The consequences of using a 2nd order (or higher) polynomial when measuring spread is that data points further from the mean are weighted higher. Consider the comparison of the two data sets below:


MAD is constant between the two, but standard deviation INCREASES if we included data points that are further out.

Is this good or bad? It just depends on what you’re looking for. Are you more interested in detecting possible outliers between two data sets? Use standard deviation. More interested in how the data deviates from the mean? Give MAD a try.

Making Inferences from “Fake News” (a Bayesian Approach)

When it comes to understanding news on politically divisive topics I follow a simple rule:

Only trust biased news sources when they say the opposite of what you would expect, otherwise lean on your own intuition.

For example if CNN or Vox were to say something positive about Donald Trump you should give that information some merit. If they say something negative then consider it noise and instead lean on your own intuition (you can use the inverse logic on Fox News or Breitbart). This heuristic can actually be justified using Bayesian inference and the language of probability theory.

The most well-known biased news source is Fox News, so I will use them for the following example.

blog data

The first table above reflects what it means for a source to be biased: A high probability of support (or non-support) for a person or position, regardless of the data. The second table can be thought of as your own personal bias on a given issue. If you think you are perfectly unbiased on the topic then you would set the corresponding probabilities at 50/50.

Recall Bayes’ Rule:

P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}

From this we can obtain the following results:blog data 2

This result exactly reflects my initial heuristic. i.e. We can place a high degree of certainty in an event if a biased source says the opposite of our expectations. Also note that if the source says exactly what we expect them to say (i.e. Fox tends to say good things about Trump), then the probability above comes pretty close to our own personal intuition. This means the source can mostly be ignored.

An interesting result emerges when we compare this analysis to an unbiased source. Consider the how we might perceive information from an unbiased source, like say, Reuters:

blog data 3

Our intuition tells us that assessing positive Trump news from Reuters is more reliable than pro-Trump sources. But what I find even more interesting is that Reuters is actually less reliable for assessing negative news about Trump than Fox News.

Lesson: Sometimes Fake News is actually the most reliable source of information!

Counter-Intuitive Sampling Result

I’ve been reading ET Jaynes’ book on probability theory and encountered an interesting little result.

Consider the classic urn model where we are presented with a bucket containing 4 balls (2 red and 2 white) and we draw from this urn without replacement. Denote R_i to be the event that a red ball is picked from the urn on the ith draw. So P(R_1) would be interpreted as the probability of  picking a red ball on the first draw.

Now consider the following probabilities P(R_1|R_2) and P(R_1| R_2 \text{ or } R_3). For the second probability, we consider the case where the red ball is picked on the 2nd or 3rd draw (possibly both). Which probability do you suppose is greater? Think about this question for a second before continuing reading.

Let’s imagine that the sampling experiment has already taken place and you selected all the balls from the urn blindly. I tell you that the 2nd ball you selected was red. What do you suppose are the chances the first ball you drew was also red? This is easy to calculate…

P(R_1|R_2) = \frac{P(R_1, R_2)}{P(R_2)} = \frac{\frac{2}{4}\times\frac{2-1}{4-1}}{\frac{1}{2}} = \frac{1}{3}

This result easily matches our intuition. If 1 of the 3 remaining balls that is unaccounted for is red, we should expect to have a 1 in 3 chance of having selected it first.

Now consider the case where I tell you that either the 2nd or 3rd ball (possibly both) are red. Let’s calculate the chance that the first ball selected is also red…

P(R_1|R_2 \text{ or} R_3) = \frac{P(R_1 \text{ and } (R_2 \text{ or } R_3)) }{P((R_2 \text{ or } R_3) } = \frac{P(R_1 \text{ and } (R_2 \text{ or } R_3)) }{P(R_2) + P(R_3) - P(R2, R3)}  = \frac{ \frac{1}{2} \times \frac{2}{3} } {\frac{1}{2} + \frac{1}{2} - \frac{2}{4}\times\frac{2-1}{4-1} } = \frac{2}{5} 

Therefore P(R_1|R_2) < P(R_1| R_2 \text{ or } R_3). This result clashes with my own intuition. I would expect that knowledge of a red ball picked on the 2nd or 3rd draw would reduce my chances of having picked a red ball on the 1st draw, especially since R_2 \text{ or } R_3 includes the possibility of both red balls being selected on the 2nd and 3rd draws. How is it that the knowledge of a red ball possibly being drawn in an additional spot actually  increases our odds of selecting it on the first draw?

Here is how Jaynes attempts to intuit the phenomenon: The information R_2 reduces the number of red balls available for the first draw by one, and it reduces the number of balls in the urn available for the first draw by one, giving P(R_1|R_2) = (M-1)/(N-1) = 1/3. The information [ R_2 \text{ or } R_3 ] reduces the ‘[expected] number of red balls’ available for the first draw, but it reduces the number of balls in the urn available for the first draw by two.

So similarly to how we calculate P(R_1|R_2) =(2-1) / (4-1) =1/3, we should think of P(R_1|R_2 \text{ or} R_3)=(2-\langle R\rangle )/(4-2), where \langle R\rangle is the expected number of red balls removed when we know that the 2nd, 3rd, or both picks possibly withdrew a red ball. Let’s do that:

\langle R\rangle = \frac{1\times P(R_2,\bar{R_3}) + 1\times P(\bar{R_2},R_3) +2\times P(R_2,R_3)}{P(R_2,\bar{R_3}) + P(\bar{R_2},R_3) + P(R_2,R_3)} =  \frac{1\times \frac{1}{2}\frac{2}{3} + 1\times \frac{1}{2}\frac{2}{3} +2\times \frac{1}{2}\frac{1}{3}}{\frac{1}{2}\frac{2}{3} + \frac{1}{2}\frac{2}{3} +\frac{1}{2}\frac{1}{3}} = \frac{6}{5}

Thus we attain our desired result.

P(R_1|R_2 \text{ or } R_3)=\frac{2-\langle R\rangle}{4-2} =\frac{4/5}{2} = \frac{2}{5}

After encountering this result I couldn’t help but think about the famous Monty Hall problem. Both lead to seemingly contradictory results where probabilities change in seemingly counter-intuitive ways based on our knowledge. Is there really a relationship between the Monty Hall problem and the above result? Or is the connection tenuous at best?

Casino Math: Markov Chains


Understanding and applying Markov chains is an essential component of calculating probabilities in casino games that would otherwise become unwieldy. I have used Markov chains in calculating probabilities associated with popular slot features such as collection bonuses, sticky Wilds, and Lightning Link-style bonuses.

I typically use Markov chains in a games where there are a reasonable number of states the player can go through. The definition of reasonable depends on time constraints and computational processing power available. Because matrix multiplication is involved, the processing time grows cubically with respect to the number of states.

The Basics

To model a game as a Markov process we must define the following items:

  1. A well-defined state space \mathcal{S}=\{s_0, s_1, ..., s_N\}. This is simply the set of all the states the player can be in.
  2. A probability distribution of the initial state space, X_0. i.e. What is the initial probability of being in each state? X_0 is typically represented by a 1\times N row vector.
  3. The transition matrix, T. Each element of T defines the probability of moving from one state to another. For example, the probability of transitioning from state s_i to state s_j would be given by the ith row and jth column of T. Note that it is essential that T does not change from round to round.

From here we can easily determine the state probability distribution X_n at each step in the process:

 X_n = X_0 T^n

X_n is a 1\times N dimensional vector that represents the probability of being in each state after the nth step of the process.


Consider a game where a player is given an equal chance of starting with 1, 2 or 3 fair coins. At the beginning of each round all coins are flipped and every coin that flipped heads is removed. The game is played until all coins have been removed. As a prize for making it to each successive round the player is paid $1 at the beginning of the first round, $2 at the beginning of the second round, $3 at the beginning of the third, etc.

To model this game as a Markov process we first define all the states that player can be in at each round. The states are 1). no coins removed, 2). 1 coin removed, 3). 2 coins removed, and 4). all coins removed (or game over).

Since the game has an equal chance of starting with 1, 2, or 3 coins already removed, we define the initial state vector like so:

X_0 = \left[\frac{1}{3} \  \frac{1}{3} \ \frac{1}{3} \ 0 \right]

It usually takes a little more work to determine the transition matrix. For this game it is defined as follows:

T = \left[\begin{array}{cccc}\frac{1}{8} & \frac{3}{8} & \frac{3}{8} & \frac{1}{8} \\ 0 & \frac{1}{4} & \frac{1}{2} & \frac{1}{4} \\0 & 0 & \frac{1}{2} & \frac{1}{2} \\0 & 0 & 0 & 1 \\ \end{array} \right]

From here we can determine the state distribution vector at each round of the game…

X_n = X_0 T^n = \left[\frac{1}{3} \  \frac{1}{3} \ \frac{1}{3} \ 0 \right]  \left[\begin{array}{cccc}\frac{1}{8} & \frac{3}{8} & \frac{3}{8} & \frac{1}{8} \\ 0 & \frac{1}{4} & \frac{1}{2} & \frac{1}{4} \\0 & 0 & \frac{1}{2} & \frac{1}{2} \\0 & 0 & 0 & 1 \\ \end{array} \right]^n = \left[ p_{1,n} \ p_{2,n} \ p_{3,n} \ p_{4,n} \right]


\begin{array}{rl} p_{1,n} & = \frac{8^{-n}}{3} \\ p_{2,n} & =\frac{4^{-n}}{3}+\frac{1}{3}(34^{-n}-38^{-n}) \\ p_{3,n} & = \frac{4^{-n}}{3}+\frac{1}{3}(-2^{1-2n}+2^{1-n}) +\frac{1}{3}(-32^{1-2n}+32^{-n}+38^{-n}) \\ p_{4,n} & = \frac{1}{3}(1-2^{-n}) +\frac{1}{3}(1-2^{1-n}+4^{-n}) + \frac{1}{3}(1-32^{-n}+34^{-n}-8^{-n}) \\ \end{array}

Side note: In case you’re wondering on how to get a nice formula for T^n, you can take a look at this example. In my case, I cheated and used Mathematica. 🙂

These value x_{i,n} represents the probability of being in state s_i during round n. Note that based on the above equations it is clear that \lim_{n\rightarrow\infty}p_{4,n} = 1, implying that the game is guaranteed to end given enough time. A few more interesting properties of this game can be uncovered by analyzing these equations. For example, on average, how many rounds of this game can the player be expected to play?

\text{Expected rounds}=\sum_{n=1}^\infty (p_{1,n}+p_{2,n}+p_{3,n})=\frac{101}{63}\approx 1.603

How about the value of the game itself?

\text{Expected game value}=\sum_{n=1}^\infty (p_{1,n}+p_{2,n}+p_{3,n})n=\frac{4580}{1323}\approx\$3.46

What other interesting properties from this game can you discover by modeling the game as a Markov chain?