Underfitting Vs. Overfitting
If you are aiming to be a good data scientist, you should definitely understand the main difference between underfitting and overfitting in practice and how to find the best model’s fitting in Machine learning.
As you know, data preparation is the first precious step that requires a lot of work from a data scientist, datasets are never clean and even after data pre-processing it may contain some noise related to the context of use, for example, some values considered by a data scientist as corrects could be treated as outliers from the point of view of the expert of the field. Obviously, we aim to build a model that fits the training data sufficiently to capture a representative model of the population, but does not fit the noise inherent in the available data.
In the following figure, we notice that underfitting occurs when the model has a high bias and low variance, that’s mean models that are too simple are likely to underfit the training data and fail to capture the underlying regularities. However, models with low bias and high variance are likely to encounter the problem of overfitting, i.e: complex models could fit the training data too well including sometimes outliers, however, they could lack generalisation criterion which refers to how well they are applied to unseen data.
The question that everyone should ask is how could I model robust ML algorithms by reducing underfitting and overfitting. Actually, several strategies could be tested according to the available data to find the best fit.
On the one hand, underfitting could be avoided by removing noise from data or improving the feature engineering step, it could be by using better feature extraction methods or conversely reducing the number of features by a feature selection approaches. Also, by increasing the model complexity and considering the best parameters based on the type of learning.
On the other hand, avoiding overfitting needs serious consideration of using resampling approaches with a robust validation phase and if possible increasing training data and ensuring to use of stratified classes during the learning phase. Sometimes, reducing model complexity and analyzing the learning curve over the learning process could help in avoiding overfitting. for example, non-parametric algorithms like decision trees need to be pruned to improve results [Bramer, M. (2007). Avoiding overfitting of decision trees. Principles of data mining, 119–134].
If you want to learn more about specific topics in data science, clap this post and send me your suggestions, I will be glad to share more and more :)
Author: #Khalida #DOUIBI, Data scientist & PhD. Biomedical Informatics.