Lab 1: Cleaned up notes

e3f9949b · Felix Ramnelöv · 9e38aaf8 · e3f9949b · e3f9949b
Commit e3f9949b authored 5 months ago by Felix Ramnelöv
--- a/lab1/assignment2.R
+++ b/lab1/assignment2.R
@@ -30,11 +30,13 @@ mse <- function(y, y_hat)
 feature_cols <- colnames(X_train)
 target_col <- colnames(train_scaled[5])

-print(target_col)
 formula <- as.formula(paste(target_col, "~", paste(feature_cols, collapse = " + "), "-1"))
 model <- lm(formula, data = train_scaled)

 model_summary <- summary(model)
+
+print(model_summary)
+
 theta <- model_summary$coefficients[, 1]
 sigma = model_summary$sigma


--- a/lab1/lab-notes.md
+++ b/lab1/lab-notes.md
@@ -38,20 +38,20 @@ Confusion matrix o missclassification error e framtana.

   Missclassification errors:

-   - **Training**: 0.04500262
-   - **Test**: 0.05329154
+   - $E_{\text{mis,train}} = 0.04500262$
+   - $E_{\text{mis,test}} =0.05329154$

 3. Comment: The easy cases were easy to recognize visually while the hard ones were hard to recognize.

-4. The complexity is the highest when k is the lowest and decreases when we increase k (as seen in the graph when the training error increases with an increasing k). Optimal k when the validation error is minimum, when k = 3.
+4. The complexity is the highest when $k$ is the lowest and decreases when we increase $k$ (as seen in the graph when the training error increases with an increasing $k$). Optimal $k$ when the validation error is minimum, when $k = 3$.

   Formula: $R(Y, \hat{Y}) = \frac{1}{N} \sum_{i=1}^{N} I(Y_i \neq \hat{Y}_i)$

   ![Missclassification rate depending on k](./assignment1-4.png)

-   Test error (k = 3): 0.02403344. Higher than the training error but slightly lower than the validation error. According to us it is a pretty good model considering that it correct ~98% of times.
+   Test error ($k = 3$): $0.02403344$. Higher than the training error but slightly lower than the validation error. According to us it is a pretty good model considering that it correct $\approx 98 \%$ of times.

-5. Optimal k = 6, when the average cross-entropy loss is the lowest. Average cross-entropy loss takes probabilities in the prediction into account which is a better represntation of a model with multionmial distribution. An important aspect is that we can determina how wrong a classification is, not just wether it is wrong or not.
+5. Optimal $k = 6$, when the average cross-entropy loss is the lowest. Average cross-entropy loss takes probabilities in the prediction into account which is a better represntation of a model with multionmial distribution. An important aspect is that we can determina how wrong a classification is, not just wether it is wrong or not.

   Formula: $R(Y, \hat{p}(Y)) = - \frac{1}{N} \sum_{i=1}^{N} \sum_{m=1}^{M} I(Y_i = C_m) \log \hat{p}(Y_i = C_m)$

@@ -59,9 +59,55 @@ Confusion matrix o missclassification error e framtana.

 ## Assignment 2

-2. DFA är viktigast men även PPE och HNR i minskande ordning. Framförallt DFA
-
-3.
-
-4. lambda = 1 gives error in theta parameters, mse = 0,87868, df = 13,86. lambda = 100 => mse = 0,8897, df = 9,92. lambda = 1000 => mse = 0,9399 , df = 5,64.
-   test: lambda = 1 : mse = 0,9347, lambda = 100: 0,9341, lambda = 1000: mse = 0,9678. 100 verkar va bästa penaltyn då det minska inte testfelet med lägre degrees of freedom, enlgit tetningsdatan.
+2. In the estimation summary shown bellow we can see our features ordered by significance. Here DFA is the most significant.
+
+   Estimation summary:
+
+   | Coefficient   | Estimate   | Std. Error | t value | Pr(>\|t\|) |
+   | ------------- | ---------- | ---------- | ------- | ---------- |
+   | DFA           | -0.280318  | 0.020136   | -13.921 | < 2e-16    |
+   | PPE           | 0.226467   | 0.032881   | 6.887   | 6.70e-12   |
+   | HNR           | -0.238543  | 0.036395   | -6.554  | 6.41e-11   |
+   | Shimmer.APQ11 | 0.305546   | 0.061236   | 4.990   | 6.34e-07   |
+   | Jitter.Abs.   | -0.169609  | 0.040805   | -4.157  | 3.31e-05   |
+   | NHR           | -0.185387  | 0.045567   | -4.068  | 4.84e-05   |
+   | Shimmer.APQ5  | -0.387507  | 0.113789   | -3.405  | 0.000668   |
+   | Shimmer       | 0.592436   | 0.205981   | 2.876   | 0.004050   |
+   | RPDE          | 0.004068   | 0.022664   | 0.179   | 0.857556   |
+   | Shimmer.APQ3  | 32.070932  | 77.159242  | 0.416   | 0.677694   |
+   | Shimmer.DDA   | -32.387241 | 77.158814  | -0.420  | 0.674695   |
+   | Jitter.RAP    | -5.269544  | 18.834160  | -0.280  | 0.779658   |
+   | Jitter.DDP    | 5.249558   | 18.837525  | 0.279   | 0.780510   |
+   | Jitter.PPQ5   | -0.074568  | 0.087766   | -0.850  | 0.395592   |
+   | Jitter...     | 0.186931   | 0.149561   | 1.250   | 0.211431   |
+   | Shimmer.dB.   | -0.172655  | 0.139316   | -1.239  | 0.215315   |
+
+   Mean square error:
+
+   - Formula: $$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$
+   - $\text{MSE}_{\text{train}} = 0.878543102826276$
+   - $\text{MSE}_{\text{test}} = 0.935447712156739$
+
+3. The functions where implemented using the following formulas:
+
+   - _Loglikelihood_: $$\log P(T | \theta, \sigma) = -\frac{n}{2} \log(2 \pi \sigma^2) - \frac{1}{2 \sigma^2} \sum_{i=1}^{n} (T_i - \mathbf{X}_i \boldsymbol{\theta})^2$$
+   - _Ridge_: $$\mathcal{L}_{\text{ridge}}(\theta, \sigma, \lambda) = \lambda \sum_{j=1}^{p} \theta_j^2 - \log P(T | \theta, \sigma)$$
+   - _RidgeOpt_: $$\hat{\theta}, \hat{\sigma} = \arg \min_{\theta, \sigma} \mathcal{L}_{\text{ridge}}(\theta, \sigma, \lambda)$$
+   - _DF_: $$\text{df}(\lambda) = \text{tr}\left( X \left( X^T X + \lambda I \right)^{-1} X^T \right)$$
+
+4. Optimal $\bold{\theta}$ for $\lambda \in \{1,100,1000\}$:
+
+   - $\lambda = 1$:
+     - $\text{MSE}_{\text{train}} = 0.878681448897974$:
+     - $\text{MSE}_{\text{test}} = 0.934684486872397$:
+     - $df = 13.8607362829965$
+   - $\lambda = 100$:
+     - $\text{MSE}_{\text{train}} = 0.889775499501371$
+     - $\text{MSE}_{\text{test}} = 0.934131808081541$
+     - $df = 9.92488712829542$
+   - $\lambda = 1000$:
+     - $\text{MSE}_{\text{train}} = 0.939949118364897$
+     - $\text{MSE}_{\text{test}} = 0.967756869359676$
+     - $df = 5.6439254878463$
+
+   $\lambda = 100$ seems to be the most suitable penalty parameter considering we are able to drop $df(1) - df(100) \approx 4$ degrees of freedom without any significant change in $\text{MSE}_{test}$.