The Elements of Statistical Learning 2.3
Overview of Supervised Learning Part 2 (Statistical Decision Theory-Expected Value of a Function)
Hopefully last time you gain some intuition about expected value, because this time we will start deriving the Linear Model by taking the expected value of the loss function.
The journey begins with the square error loss function, [y-f(x)]², because it is the most common and convenient one compared to other loss functions that I will cover later. If that looks scary to you, denote the square error function with “L” and it should look more familiar: E[L]. And we also begin this derivation with quantitative outputs, hence, you see the expected value formula in the form of continuous random variable with the integral sign. Before we continue, let’s refresh our memory a little.
Previously the subject of our expected value function is the random variable as opposed to the observed value (the actually occurrence), which is the subject of the average. We also established that the expected value of a random variable is the summation of every random variable multiplying their corresponding probability, aka PDF (probability density function). But what happens when we are looking for the expected value of a function instead of a random variable?
The short answer is to replace all random variables with the function output. What about the PDF part of equation, does it mean that we will be multiplying the output with the PDF of the function? While this is still true, but thanks to the god of probability theory, we can also achieve the same by multiplying the function output with the PDF of the random variable instead of the PDF of the function, which is much more difficult to find out.
Wait a minute, this sounds too convenient, but how? What is the intuitive understanding?
If we back up a little, this idea should also make intuitive sense. If certain input leads to an intermedium outcome, and ultimately the final output, then the distribution of that initial input should also reflect the final output.
Let’s say I have a factory that makes socks. And it turns out that my profit margin increases when I produce more happy socks. And let’s also assume that the quantity of happy socks will increase if and only if there are more happy workers. In this sense, quantity of happy workers mirrors the quantity of the happy socks. I don’t need to know the PDF of happy socks as long as I have the PDF of happy workers. In our case, f(X) is the worker PDF and f(L) is the PDF of happy socks.
Conclusion
Ok, cool, socks, but wait, why do we take expected value of a square error loss function again?
If expected value means the average of all possible values in the sample space, what does it mean to take the average of all possible outputs in the sample space? Remember, the output of square error measures how bad our predictions are: large amount of correct predictions will produce an inverse curve as displayed above, and the opposite will give an exponential curve.
One last thing, since we don’t have any data (inputs) to train anything, we are not going to find any average. Keep in mind the purpose here is to derive the regression function, a linear model more specifically, by taking the expected value of the square error loss function. Next time we will continue with EPE by breaking down Pr(dx, dy), see you then!