Suppose you have some data and you want to build a model to predict future observations. Data splitting means dividing your data into two, not necessarily equal, parts. One part is used to build the model and the other half to evaluate it in some way. There are two related, but distinct, reasons why people do this. It’s well known that if you use the same data to both build the model and test the performance of that model, you’ll be overoptimistic about how well your model will do in predicting new data. Analysts have various tricks to avoid overconfidence, such as crossvalidation, but these are not perfect. Furthermore, if you need to prove to someone else how well you’ve done, they will be sceptical of such tricks. The gold standard is to reserve part of your data as a test set and build and fit your model on the remaining training set part of the data. You use this model to predict the observations in the test set. Because the test set has been held back, it’s (almost) like having fresh data. The performance on this test set will be a realistic assessment of how the model will perform on future data. But you lost something in the data splitting – the training data is smaller than the full data so the model you select and fit won’t be as good. If you don’t need to prove how good your model is, this form of data splitting is a bad idea. It’s as if a customer orders a new car and asks that the seller drive it around for 10K miles to prove there’s nothing wrong with it. The customer will receive the assurance that the car is not a lemon but it won’t be as good as getting a brand new car in perfect condition.

By the way, if you find your model doesn’t do as well as you’d hoped on the test set, you might be tempted to go back and change the model. But the performance on this new model cannot be judged cleanly with the test set because you’ve borrowed some information from the test set to help build the model. You only get one shot using the test set.

There’s a second reason why you might use data splitting. The typical model building process involves choosing from a wide range of potential models. Usually there is substantial *model uncertainty *about the choice but you take your best pick. You then use the data to estimate the parameters of your chosen model. Statistical methods are good at assessing the *parametric uncertainty* in your estimates, but don’t reflect the model uncertainty at all. This a big reason why statistical models tend to be overconfident about future predictions. That’s where the data splitting can help. You use the the first part of your data to build the model and the second part to estimate the parameters of your chosen model. This results in more realistic estimates of the uncertainty in predictions. Again you pay a price for the data splitting. You use less data to choose the model and less data to fit the model. Your point predictions will not be as good as if you used the full data for everything. But your statements of the uncertainty in your predictions will be better (and probably wider).

So judging the value of data splitting in this context means we have to attach some value to uncertainty assessment as well as point prediction. The method of scoring is a good way to do this. All this and more is discussed in my paper Does data splitting improve prediction? Although it depends on the circumstances, I show that this form of data splitting can improve prediction.

Dear Professor Faraway,

Thank you for writing this useful article! I am curious whether you have also considered the following estimator:

Use Z_1 to build models, Z_2 to select one of them, and then reuse all of Z to (re)estimate the chosen model’s parameters.

(For instance, in your example 3.2, you might fit a full stepwise path using Z_1. Then choose a model from that path using AICs computed on Z_2. Finally refit the chosen model on all of Z.)

I imagine this could do at least as well as SAFE.

It also seems commonly used in practice. For example, it’s analogous to the usual strategy of cross-validating to choose lambda for Lasso, then refitting the full dataset at that lambda.

If you do not have time to try this yourself, would you be willing to share your code?

Best wishes,

Jerzy

LikeLike

You are breaking down the model choice into an exploratory phase where a set of candidate models is generated followed by a model selection phase. I agree that this corresponds to actual practice in some cases. Your suggestion seems reasonable but it does have the disadvantage of using all the data in some part of the model choice. So maybe it would be an improvement over SAFE and maybe not – it would surely depend on the situation.

Would be worth checking out although I’d need to figure out my old code.

LikeLike