Modeling 101

“Essentially, all models are wrong, but some are useful.” – George E.P. Box

from Empirical Model-Building and Response Surfaces (1987) co-authored with Norman R. Draper, p. 424.

Mr. Box’s quote is quite popular and used a lot these days. However, I am not sure if all those who quote it fully understand it. And that includes me.

What I took away from the quote.

A model is an approximation of a process that we don’t fully understand. We model processes to get a deeper understanding of the process itself. To build a model we use what we understand about a process and try to approximate (or sometimes ignore) what we dont know.

All models are wrong because of the inherent approximation. But some could be useful because what we have “approximated (or ignored)” is not important to understand or replicate the process.

“Relative simplicity” is an important virtue of a good model. By “relative simplicity” I mean the simplicity of a model compared to the actual process. For example, a “simple model” of Relativity theory for a physicist could still take me two lifetimes to understand.

Math lends itself very well to represent these models. We test the usefulness of a model against the real process by giving identical inputs and comparing the outputs. This is easy in case of some models and not so easy in case of others. ( like verifying the existence of Higgs-Boson particle using the Large Hadron Collider that cost an estimated $9b to build).

A model is useful if for a range of inputs, the output of model closely matches the real output of the process. (How close? – It is a matter of threshold and the +/- error range the user of the model is comfortable with.). To summarize, an useful model has some predictive power with

The game of football is a process. Over the years a lot of people have tried to improve our understanding of the game by building models based on what they understood with the data available to them. We have been collecting more and more information as we progressed. More information has led to the better understanding of the game and dispelling some of the myths. But it has also helped create “new myths” (or truths until they get proven wrong in due course of time).

There are a lot of models out there for football. Some based on past results, some based on what happens in a game (events like shots, goals, final 3rd passes etc) and so forth. I dont quite agree with all of them. Some I agree with more than the others.

What I look for in a model?

A model should an understanding of the process; to be aware of what and how the process works. It doesn’t have to be comprehensive (because it is a model after all) but capture the essence of it.Example: If I want to build a model to predict the winner of a football game – I want my model to take into consideration how I win a game of football? By scoring more goals than the opponent. How do I score more goals? By taking a lot of (hopefully good) shots and not letting the opponent take good shots, How have the two teams have been doing in the run-up to the game and so forth. I can list 20-30 items and I am sure I still would have missed many things. The goal is to not to account for every factor but to identify the handful of factors that capture the essence or “signal” as Nate Silver calls it in his popular new book.
I am skeptical of any model that does not understand the process that it is trying to be a model of.
A model should factor in the nature of the underlying data from a process. For example: Based on Chris Anderson‘s analysis, number of goals scored in a game is not normally distributed. Something like that needs to be factored in if the model using goals scored as an input. This is probably more important for models making long term predictions because the inherent characteristics of data tend to manifest over a longer period of time than in a shorter period of time.Example: Probability of a getting 4 heads if you toss a coin 4 times is 6.25%. But if you repeat the experiment 5 times, you might get 4 heads once, twice, thrice, 4 times, 5 times or never. You might have to repeat the experiment thousands of times to see the probability of 4 heads converge to 6.25%.

Building a good model is not trivial and is an iterative process. But if the first version of a model doesn’t address the above, it might be time for a rethink.