I’ve done a couple blogposts in the past on Statistical learning, see here if you havn’t read them yet. In this blog post I’ll explain the most popular way to compare models and decide which one is best. It’s known as the test-train split. This is really only useful for supervised problems.
The test set approach
So the test-set approach is quite intuitive when you hear about it. You have your \(n\) data-points you observed each of which has explanatory \(x_i\) and response \(y_i\) and our end goal is to predict the \(y_i\).
Generally, we don’t really care about how our model performs on the data that we already have, we want it to be able to predict future data which we can get the explanatory data for but cannot obtain the response. But in previous blog-posts I only mentioned training error which is how it preforms on the data we observed and fit the model on.
In order to mimic the “future data” scenario we use a method commonly referred to as test set or hold-out set. This is simple, you split your dataset in two one called train with \(n_{train}\) observations and one called test with \(n_{test}\) observations. Generally \(n_{test}\) is picked to be about \(20\%\) of your data but this choice is fairly arbitrary.
From here you fit your model on the training set, then predict what the response for the test set should be. We then compare this response to what the actual response is.
Why do we do this
Models can be over fit to data, that is they focus too much on explaining the dataset which then causes the model to preform on new data extremely poorly.
For almost any model it will perform much better on the training dataset than the test dataset just because it has already seen the training dataset. The more complex we allow the model to be generally makes the model perform better on the training set, but this, more often than not, actually decreases the performance on the test dataset. Seeing the test dataset is closer to the real life application of our model it’s a better way to measure how the model will preform on unseen data.
Uses
It can be used to compare the predictive accuracy of any two models, it can be used for
- Selecting parameters in models, i.e which variables to include in your linear regression or how deep should I allow my classification tree to go.
- Picking which model to use, it can be used to compare different types of models like LDA, K-nn and classification trees, just run the test-train procedure on each model then see which one preforms best.
- Obtaining an estimate for how well your model will do on future data.
Other uses include prediction competitions, for example kaggle runs competitions with cash prizes. They offer hundreds of live competitions where they supply a training set which you fit a model for, after that you run the model on the test then give kaggle your predictions. They’ll then measure how well you did and place you on a scoreboard.
Conclusion
This is just a quick summary of how the test-train split works. If you’re interested in using this type of approach it’s quite easy to code up. An immediate extension would be looking at Cross-Validation , it’s a little twist that does a test-train split, giving each part of the dataset a chance to be in the test set. See Chapter 7, Elements of Statistical Learning.