- _What are the practical approaches for reducing the expected new data error,
according to the book?_
- _What are the practical approaches for reducing the expected new data error, according to the book?_
[p. 76]
In practice the aim is to reduce a small error in production, i.e. $E_{\text{new}}$. Of course in practice we can not calculate $E_{\text{new}}$ but instead one can take the decomposition $E_{\text{new}} = E_{\text{train}} + \text{generalisation gap}$ from which one can draw two conclusions: $E_{\text{new}}$ will on averag be smaller than $E_{\text{train}}$ meaning if $E_{\text{train}}$ is greater than required $E_{\text{new}}$ the problem needs to be reconsidered. Also the generalisation gap and $E_{\text{new}}$ decreases as the size of the training data increases, thus it could help a lot with reducing $E_{\text{new}}$. In addition it is possible to evaluate model complexity using $E_{\text{hold-out}}$, if $E_{\text{hold-out}} \approx E_{\text{train}}$, then underfitting is likely and it might be beneficial to increase complexity and if $E_{\text{train}}$ is close to zero while $E_{\text{hold-out}}$ is not, then overfitting is likely and it might be beneficial to decrease complexity instead. [p. 76-77]
- _What important aspect should be considered when selecting minibatches,
according to the book?_
- _What important aspect should be considered when selecting minibatches, according to the book?_
[p. 125]
When using minibatches it is vital to ensure that the different batches are balanced and represent the whole dataset. E.g. a training dataset with different output classes that are which is sorted by these output classes. Then the first minibatch might only include one output class and thus not give a good representation of the whole dataset. Therefore it is imprtand that the batches are formed randomly, for example by randomly shuffling the training data and then divide it into minibatches in an ordered manner. [p. 125]
- _Provide an example of modifications in a loss function and in data that can be
done to take into account the data imbalance, according to the book._
- _Provide an example of modifications in a loss function and in data that can be done to take into account the data imbalance, according to the book._
[p. 101]
An example of modification in a loss function to take imbalance into account is to reflect that not predicting $y = 1$ correctly is a C times more costly than not predicting $y = -1$ (binary classification). The misclassification loss can be moddified to:
$$
L(y,\hat{y}) =
\begin{cases}
0 & \text{if } \hat{y} = y, \\
1 & \text{if } \hat{y} \neq y \text{ and } y = -1, \\
C & \text{if } \hat{y} \neq y \text{ and } y = 1.
\end{cases}
$$
Other loss functions can be modified in a similar way. A similar effect can be achieved by e.g. duplicating al positive training data points C times in the training data instead of modifying the loss function. [p. 101-102]