forked from johnmyleswhite/MLNotes
-
Notifications
You must be signed in to change notification settings - Fork 0
/
norms.rmd
76 lines (59 loc) · 5.1 KB
/
norms.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
Often we want to measure the size of some mathematical object. When the object can be built out of real numbers in the way that scalars, vectors and matrices can, then there is a consistent strategy for defining the size of the object as a whole. We will call the size of the object its norm and write the norm of x as $N(x)$ or, more commonly, as $||x||$.
For scalars, we can use absolute values as a measure of size:
\[
N(x) = ||x|| \triangleq |x|
\]
For reasons that will be clear in a moment, we can also write this in a "more complicated" way:
\[
N(x) = ||x|| \triangleq (|x|^2)^{\frac{1}{2}}
\]
For $n$-dimensional vectors, we can use Euclidean norms as a measure of size:
\[
N(x) = ||x|| \triangleq (\sum_{i = 1}^{n} x_{i}^2)^{\frac{1}{2}}
\]
Again, we can write this in a "more complicated" way:
\[
N(x) = ||x|| \triangleq (\sum_{i = 1}^{n} |x_{i}|^2)^{\frac{1}{2}}
\]
For $m$-by-$n$ dimensional matrices, we can use what are called Frobenius norms as a measure of size, which we write in the "more complicated" way from the start since you are less likely to be familiar with them in another form:
\[
N(X) = ||X|| \triangleq (\sum_{i = 1}^{m} \sum_{j = 1}^{n} |x_{i, j}|^2)^{\frac{1}{2}}
\]
When these measures of sizes are all presented right next to one another, it becomes clear that they are actually the same mathematical operation given different names depending on the type of object being measured. For any object with extents along $k$ dimensions built up out of real numbers, you sum up the squared entries and then compute the square root. Scalars are a degenerate case in which there are $0$ dimensions.
Note that the operations of squaring and computing square roots occur in all of these models. This is not a deep fact about the essential truths of our universe: we typically perform squaring for mathematical convenience because the resulting norm functions are differentiable. This is not required by any deep meaning of the concept of size: modern statistics and machine learning have moved beyond this requirement and explored a much richer set of norms.
This richer set of norms, which provide a natural extension to the squared norms defined above, consists of the $L_{p}$ norms or the Minkowski $p$-metrics. For scalars, the $L_{p}$ norm is,
\[
N_{p}(x) = ||x||_{p} \triangleq (|x|^p)^{\frac{1}{p}}
\]
For vectors, the $L_{p}$ norm is,
\[
N_{p}(x) = ||x||_{p} \triangleq (\sum_{i = 1}^{n} |x_{i}|^p)^{\frac{1}{p}}
\]
For matrices, the $L_{p}$ norm is
\[
N_{p}(X) = ||X||_{p} \triangleq (\sum_{i = 1}^{m} \sum_{j = 1}^{n} |x_{i, j}|^p)^{\frac{1}{p}}
\]
The squared norms we described above are all special cases in which $p = 2$. But now that you've seen all the forms of the $L_{p}$ metric, the "more complicated" ways to write things are much more natural than the simpler ways, no?
Thankfully, once you understand norms, we can trivially describe the three fundamental objects of modern statistical theory:
* Means: $\mu^{*} = \arg \min_{\mu} \sum_{i = 1}^{n} ||x_{i} - \mu||_{2}$
* OLS linear regressions: $\beta^{*} = \arg \min_{\beta} ||Y - X\beta||_{2}$
* Rank-$k$ SVD's: $\bar{X}^{*} = \arg \min_{\bar{X}} ||X - \bar{X}||_{2}$, where $\bar{X}$ has rank $k$
The reason for this is simple: almost all of statistical theory can be understand in terms of prediction error minimization: we have (1) a system that generates predictions, (2) a ground truth we want to predict and (3) a measure of the error in our predictions. We typically measure the error in our predictions using a norm of the disprecancy between our predictions and ground truth. The standard algorithms in statistics are simply mechanisms for making out-of-the-box predictions that minimize different errors measured using $L_{2}$ norms.
Once you master means, OLS and SVD's, we can move on to:
* Medians: $m^{*} = \arg \min_{m} \sum_{i = 1}^{n} ||x_{i} - m||_{1}$
* LAD linear regression: $\beta^{*} = \arg \min_{\beta} ||Y - X\beta||_{1}$
* Robust SVD (Correct name?): $\bar{X}^{*} = \arg \min_{\bar{X}} ||X - \bar{X}||_{1}$, where $\bar{X}$ has rank $k$
And then after that, we can add penalties to the basic statistical models by penalizing the parameters we fit using these same norms. We then get:
* Bayesian inference for means with Normal priors: $\mu^{*} = \arg \min_{\mu} \sum_{i = 1}^{n} ||x_{i} - \mu||_{2} + \lambda ||\mu||_{2}$
* Ridge regression: $\beta^{*} = \arg \min_{\beta} ||Y - X\beta||_{2} + \lambda ||\beta||_{2}$
* L2 regularized SVD: $\bar{X}^{*} = \arg \min_{\bar{X}} ||X - \bar{X}||_{2} + \lambda ||\bar{X}||_{2}$, where $\bar{X}$ has rank $k$
* Bayesian inference for medians with Normal priors: $m^{*} = \arg \min_{m} \sum_{i = 1}^{n} ||x_{i} - m||_{1} + \lambda ||m||_{2}$
* LASSO regression: $\arg \min_{\beta} ||Y - X\beta||_{2} + \lambda ||\beta||_{1}$
* L1 regularized SVD: $\bar{X}^{*} = \arg \min_{\bar{X}} ||X - \bar{X}||_{2} + \lambda ||\bar{X}||_{1}$, where $\bar{X}$ has rank $k$
We also note that the following models are all special cases of, variations of, or straightforward applications of the standard SVD:
* Factor analysis
* PCA
* MDS
* Procrustean analysis
* More...
Really, predictive error minimization is a universal principle in statistical theory.