14
Assessments of where to get more information drive a large amount of
regulatory activity. Many regulatory problems are, in essence, prediction
problems. Supervised machine learning solves prediction problems. And it does
so by trawling through all the data we have by itself, looking for patterns.
What is it that is so different to econometrics?
For prediction problems, contrary to most traditional economic problems, we are
normally just interested in maximising our predictive power. To prevent bad
events, such as product mis-selling, or catch illicit behaviour we want to know
where it is likely to occur.
There are many different algorithms – you may have heard of Random Forests
or Deep Learning. And they find patterns in different ways. But – for those
interested in the statistics a little – at their core these techniques all benefit from
the same basic trick: creating a hold-out sample of data – the test set – and
splitting it from the training sample. By doing this, we can exploit all kinds of
funky and crazy heuristics to spot patterns in the training sample. We can then
use the test set to see how well the algorithm actually predicts.
A second statistical trick is crucial to achieving great prediction. All these
algorithms run the risk of overfitting the training data. The model can get overly
complicated and predict brilliantly well in the training sample.
26
But it predicts
terribly out of sample because it does not represent the true structure and
relationships that exist in the world. To avoid this, we need to ‘regularise’
complexity. An important way of doing this is to create extra test sets within the
training sample, allowing us to choose the right level of complexity before
moving to the test set.
27
These two tricks - the test set and regularisation – allow us to find true and
meaningful, rather than spurious, correlations between variables. Armed with
these tricks, computer scientists invented a myriad of pattern-hunting heuristics.
26
Overfitting is the problem that as you put more and more parameters into your model
when working with the training data you are able to explain more and more of your data.
But, after a certain point, this fit is completely spurious. This is similar to say if we want
to predict which horses are the fastest at the racing track. Wouldn’t that be nice. Let’s
say that we first consider whether the horse likes the turf soft or firm, or the age of the
horse, or the horse’s previous achievements. These can all be useful predictors. But let’s
imagine we throw in some other variables – whether the horse seems to like
Wednesdays, whether its name begins with the letter A etc. – the mechanics of
prediction with a set amount of data means that more variables can only ever help us
get a better fit and prediction. But the problem with these variables is that they are just
noise. Any fit we get from them is definitely spurious. We can’t really predict which
horses are going to win their races, unfortunately. In fact it turns out that if we add lots
of variables that seem reasonable – are not clearly garbage – we get the same problem.
Too many variables – too much complication in our model – creates spurious correlation.
27
The technique of creating extra test sets in the training sample is called cross-
validation and allows us to understand the properties of our model without using the
hold-out sample.