As expected we get our answer of -3, but this solution takes a bit of work: we need to know how to code and we need a computer. It’s also a bit messy because if we had iterated by an incredment that didn’t include -3 exactly (say by 0.031) we would not get the exact answer.
If we know some basic calculus we know that our minimum has to be where the derivative is at 0. We can very easily work out that
$$f'(x) = 2(x+3)$$And that
$$2(x + 3) = 0 $$
When
$$ x = -3 $$
Knowning basic calculus this later solution becomes much easier.
But even with the calculus is part is hard, often solving it once makes future solutions much easier. Take for example if you wanted to find the maximum likelihood for a normal distribution with a mean of \(\mu\) and standard deviation of \(\sigma\)
To solve this we start with our PDF for the normal distribution \(\varphi\):
$$\varphi(x) = \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}$$
Now computing the derivative of this is not necessarily “easy” but it’s certainly something we can do. All we really care about is when
$$\varphi'(x) = 0$$
Which we can find out happens when (computing the deriviative of course is left as an exercise for the reader):
$$\frac{\mu-x}{\sigma}$$
This allows us to reallize the amazing fact that for any normal distribution we come across, we know that the maximum likelihood estimate is the sample is when \(x = \mu\)!
Even though our calculus might take us a bit of work, once this is done the problem of doing maximum likelihood estimation for any Normal distribution truly does become easy!
Proposing an Analytic solution to our problem
Let’s revisit our original problem this time attempting to find an analytic solution. This is a very interesting case because arguably this is the simplest Bayesian hypothesis test you can imagine.
Recall that we have two random variable representing our beliefs in each test. These are distributed according the the posterior which we described earlier.
$$A \sim \text{Beta}(2+1, 13+1)$$$$B \sim \text{Beta}(3+1,11+1)$$
Here is where I skipped some steps in reasoning. What we want to know is:
$$P(B >A)$$
Which is not expressed in a particularly useful mathematical way. A better way to solve this is to consider this as the sum (or difference in this case) of two random variables. What we really want to know is:
$$P(B – A > 0)$$
In order to solve this problem we can think of a new random variable \(X\) which is going to be the difference between B and A:
$$X = B – A$$
Finally we’ll suppose we have a probability density function for \(X\) we’ll call \(\text{pdf}_X\). If we know \(\text{pdf}_X\) our solution is pretty close, we just need to integrate between 0 and the max domain of this distribution:
$$P(B > A) = P(B – A > 0) = \int_{x=0}^{\text{max}}\text{pdf}_{X}(x)$$
Already this is starting to look a bit complicated, but there’s one big problem ahead. Unlike Normally distributed random variables, we have no equivalent of the Normal sum theorem (we’ll cover this in a bit) for Beta distributed random variables.
What does \(\text{pdf}_X\) look like? For starters we know it’s not a Beta distribution itself. We can see this because we know the domain (or support) of this distribution is not \([0,1]\). Because they are Beta distributed, A and B can both take on values from 0 to 1, which means the maximum result of this difference is 1 but the minimum is -1. So whatever this distribution is, its domain is \([-1,1]\) meaning it cannot be a Beta distribution.
We can use various rules about sum of random variables to determine the mean and variance of this distribution, but without knowing the exact form of this distribution we are unable to solve the integral analytically.
Here we can see that even in this profoundly simple problem the analytical solution is frustratingly evasive.
Thankfully the joy of deriving this integral can be found in this great post on Cross Validated by Corey Yanofsky. You can also find some additional exposition from Evan Millers post on the topic.
This brings up the next point of the question regarding a ‘non-brute force’ method. Coming from a computer science background, I would generally regard ‘brute force’ to be defined as niavely trying solutions in until you find the best. For this problem the code for that would involve interating through probabilities and then coming up with some way to analyze the results.
What we are doing in our simulation is really a form of Monte-Carlo integration. We’ve already established that \(P(B > A)\) is equivalent to \(\int_{x=0}^{\text{max}}\text{pdf}_{X}(x)\). Since \(\text{pdf}_{X}\) is unknown we are approximating it by sampling from the possible differences between B and A.
If may help to visualize this distribution:
Home>>Europe>>Bayesian statistics rely heavily on Monte-Carlo methods. A common question that arises is “isn’t there an easier, analytical solution?” This post explores a bit more why this is by breaking down the analysis of a Bayesian A/B test and showing how tricky the a…
Europe