The Elements of Statistical Learning 2.4

Overview of Supervised Learning Part 2 (Statistical Decision Theory-Conditional Expectation-EPE)

袁晗 | Luo, Yuan Han
3 min readDec 3, 2021
img src: https://imgflip.com/tag/math?sort=top-2021-02

Now that we are more familiar with expected value in terms of square error loss, let’s talk about expected prediction error (EPE) and come to a full circle.

The EPE is the expected value we discussed previously. But since we are talking about expected value of the loss function, which outputs quantity of errors, we are calling it EPE. What I didn’t mention last time is its parameter, which represents infinite amount of functions. The goal here is to find 1 function that gives the lowest square error loss outputs. Let’s go through that step by step.

img src: https://stats.stackexchange.com/questions/92180/expected-prediction-error-derivation/102662#102662?newreg=b382eef8c7d247deab11fd33f89e4663

At this point, you should be questioning why do we spend so much effort to rearrange E[] to E_x[]. We did so to find the parameter of E_x[] which is E_(y|x)[]. Once we did that, we can isolate E_(y|x)[] out and throw it inside any function we like. This means that we are no longer bind to the expected value of E_(y|x)[]. Instead, we can do all sorts of things like look for a specific value when E_(y|x)[] returns the minimum, and that’s exactly what we did. We yank E_(y|x)[] out of E_x[] and throw it behind argmin_c (argmin_c returns the value of c for which the input function, E_(y|x)[] in our case, attains its minimum). Note, in the computer world, you can think of argmin_c as a function because that’s exactly what it did, took an input and returned an output. However, I am not sure if you can call it a function in the mathematical world.

The “c” in argmin_c represents our predictor or model. When we do that, we also naturally replace the EPE notation with a function notation, because argmin_c returns a function not an expectation that E_x[] returns, but we are not done yet.

img src: https://web.stanford.edu/~hastie/Papers/ESLII.pdf

If we differentiate f with respective to c we get the following.

im src: https://stats.stackexchange.com/questions/418967/solving-argmin-ey-c2-x?noredirect=1&lq=1

Wait a second, why do we do that? Remember that in calculus class we can find the minimum or maximum with derivative because max/min is when slope equals to 0. So after we find the derivative, aka fancy slopes, we set it to equal 0 to find the minimum. How do we know it’s a min instead of max? We know that because we are differentiating a square function that concaves up and ultimately we get a conditional expectation, aka regression function.

Conditional Expectation

In human terms, taking the average at every x will minimize the output of loss function.

Conclusion

It turns out machines make decisions not that different from us: chose the most probable outcomes. Then what is the point of having machines to do it? Well, counting is also a simple task, but a calculator can find out 3729572141 * 257512312 in a split second. Next time let’s talk about Nearest Neighbor. From the lenses of Statistical Decision Theory, it is the same guy with different haircut.

--

--