Tom’s Ten Data Tips – November 2010
How to build predictive models
Building predictive models is the “bread and butter” of applied data mining. Because you can compare using a model with your current “best practice”, it is relatively straightforward to monetize the added value of data mining. This calculation should be a default part of your process. Not just to support business decision making, but also to foster buy-in for your work.
To excel as a data miner, you need programming skills, solid mastery of methodology and statistics, and last but not least a curious, investigative mind. People who take an interest in this field, can look forward to a rewarding career with endless opportunities. Data are all around us, and this “flood” is not about to stop anytime soon!
1. Make Sure You Understand The Modeling Objective
“We need a predictive model!” all too often marks the beginning of modeling projects. Before you begin the actual modeling work, make sure you truly understand the business process(es) in place that you are supposed to support. How will the model be deployed? How will the business then be better of using the model? For every improvement, something has to give. How will the business be worse of by employing the model? Is there any way to mitigate these drawbacks? How will the change be managed? How will the effectiveness of the model be monitored?
All these questions need to be considered. At the outset of the project. For sake of clarity: any other time would be too late. Think of model building as working backwards: start with the desired outcomes, and then work your way back to what needs to happen in order to pursue those outcomes. There is no use in building a great model for the wrong problem. Getting the modeling objective wrong is the single most occurring regret in post-mortems. Don’t join the line, get a head start with a good briefing. Unless you are feeling particularly lucky… (Dirty Harry)
2. Get A Feel For Your Data
After you’ve achieved clarity on the modeling objective, hold your horses for another moment. Before you dive in with both feet, it is often a good idea to first play around with your data set. This holds in particular for the relation between the predictor variables with the target. When you stumble on “intra data set” patterns, you’ll know what variables to expect in your model. More importantly, you might also get some inspiration which new ones to derive.
You need to know which variables to expect in your model, because if some of them are conspicuously absent, this might spell trouble. Investigating new and/or derived variables is the most powerful way to look for (big) improvements in prediction accuracy.
3. Habitually Compare A
The
algorithm people are most familiar with almost always gives the best results.
This, of course, has less to do with the suitability of the algorithm than with
skill level of the analyst. Don’t become a “one trick pony.” Like Jerry
Weinberg wrote (secrets of consulting): “The child who receives a hammer for Christmas will discover
that everything needs pounding.”
When you develop the habit to compare a range of algorithms (even if you have your “favorite”), you keep an eye out for surprises and maintain at least working knowledge in multiple approaches. A data mining suite that facilitates such comparisons is a great asset. Decision trees are fast and great for their transparency. Neural networks usually have superior lift (at least after some tuning), in particular when interaction plays a significant role. Regression tends to come out strong when the analyst knows how to represent the variables properly. An in-depth treatment of algorithm comparisons can be found in Van Der Putten’s excellent PhD thesis On Data Mining in Context (pdf download).
4. Split Your Mining Set
Sometimes you “merely” want the best possible model, but sometimes getting an estimate how well the model will perform is important, too. The estimate (“accuracy prediction”) comes at a cost, however. When you split your mining set in two sections, the estimate for predictive accuracy is (slightly) inflated. In particular if you alternated frequently between the training- and test set during model development. When you need an unbiased estimate of model accuracy, you must exclude a third part of the mining set (the evaluation set) from the model building process until you have settled on your final model.
Why would you need such an unbiased estimate? For most model applications, you want as high a lift as possible. But sometimes, you want neither more, nor less than the estimate. This may occur when you need to plan your resources to handle response, and it is important neither to get too many, nor too few responses. Like Goldilocks you want your porridge not too warm, not too cold, but “just right.” We faced this when planning mortgage leads: application processing staff was hired (so came at additional cost), but if you get too many responses a queue forms, and warm leads may seek a quote with the competition if you make them wait too long.
5. When A Model Looks Too Good To Be True, It Usually Is… L
When a predictive model returns an accuracy of prediction that goes (way) beyond the imaginable, be wary. Very wary. The most likely explanation is that something stinks. The variables used to make the prediction should be independent of response, and should have been captured at a point in time preceding the target variable you are trying to predict. When a model does “surprisingly” well, chances are that one of these rules has been violated. Variables that post-date response are called “leakers” (Berry & Linoff, 2000) or “anachronistic variables” (Pyle, 1999).
Unfortunately, there is no absolutely certain way to know which variables are “wrong” (except, maybe, for a 100% correlation with the target). This will always require in-depth investigation. A testing method we have found to be very effective, is to plot all predictor variables (not just the ones that made it into the model) against the target variable, using some univariate measure of association. Then look for remarkable “jumps” in association. This approach is simple and has proven very effective, so make this a habit (see also tip# 6).
6. Customarily Plot Predictor Variables (Univariate) Against Target Variable
As a safety measure against “leakers” or “anachronistic variables” it is good practice to habitually plot predictor variables against the target. When building predictive models becomes ongoing concern, producing this plot should probably be automated. Although the resulting graph tends to look grossly the same, preferably use some measure that isn’t restricted to linear correlation, for instance Cramer’s V or entropy. When you sort all input variables by association, and find some “knee” (a sudden jump in values) in this scree plot, be suspicious of all variables left of this bend. See, for example, this plot.
Carefully inspect the variables left of the “knee”, to determine if anything might be wrong with your data. Also, variables that have high (univariate) association with the target but nevertheless did not make it into the final model are of interest, too. They might (likely) have significant multi-collinearity with other variables that were included. In such cases, you might want to reconsider which variables you’d prefer to include in the final model. Obviously, other than immediate statistical considerations play a role then. Entering business domain knowledge into the model building process in such a way is the hallmark of advanced data analytics (see also tip# 10).
7. PIE Is An Elusive Concept
In
his classic work, Preparing Data for Data Mining
(1999) Dorian Pyle described an encompassing concept that all (advanced)
data miners should familiarize themselves with: the Prepared Information
Environment (PIE). Some variable representations are easier to “read”, more
amenable to information transfer than others. The PIE “surrounds” mathematical
transformations that describe the relation between input and output variables.
The PIE “connects” the equation through transformations on input variables in
order to optimize the amount of information being passed on to the algorithm.
Dedicated data miners are encouraged to thoroughly study this material. Your author has read this particular book six times from cover-to-cover (some sections even more), but then again, I am pretty dense J.
8. Two-Stage Models Give Higher Yield, Albeit Lower Response
When you are predicting an amount that customers should deposit, the lifetime value of customers you are trying to acquire, or how much people will donate to charity, etc., you have a two-stage process. First you need to predict who will respond, and then you estimate how much.
Although there are plenty of algorithms that can deal with this problem directly, experience has shown that predictive accuracy tends to be higher when you build two separate models: one for predicting response, and one for predicting the amount. Combining these two models tends to work the best. “Best” here means accrue the largest possible pool of funds. Note that optimizing for total funds leads to lower response percentages as you are purposely going after the “big fish.”
9. Always, Always Include An Explanation With Your Prediction
Even when “only” a prediction is required, you should still accompany each model you deliver with an explanation of its workings. There are three reasons for this:
1 – Sanity check
Sometimes, something goes awfully wrong with your dataset. Although this might seem “overkill”, in particular after following tip# 2, you do not want to find yourself in similar embarrassing places your author wound up after omitting this L (e.g.: EBCDIC-ASCII conversion had dropped all minus signs which I had failed to notice…). Experienced data miners have the bruises and scars to show why they no longer consider sanity checks a “luxury”…
2 – Foster buy-in
When you supplement your prediction with an explanation of its dynamics, it becomes more transparent, less “mystical”, and therefore more acceptable to the business. For greater and wider application and adoption, this is often a good idea.
3 – Transform the business
Every once in a blue moon, the explanation you provide with your model may spark inspiration to approach the existing market in new ways, or find new markets for an existing proposition. Although this is a low chance occurrence, it potentially has a very high pay-off.
10. Model Engineering Is Fine Art
In some settings it makes perfect business sense to settle for a model that doesn’t quite have the highest predictive accuracy possible. This can be worthwhile because the “engineered” model provides significant side benefits.
We’ve seen examples in credit scoring, for instance, where the choice of model has an impact on which reason comes out for rejecting a credit application (providing this reason –sometimes upon request– is legally required in many markets). Some explanations are more “palatable” to consumers than others. So if you can engineer, for instance, that the most frequently occurring rejection reasons shift from “occupation” to “credit bureau score” (the latter being more “palatable”), that outcome might be quite valuable to the business. And the business can decide how much predictive accuracy they’d be willing to sacrifice in order to achieve that result (you’ve monetized alternative options, of course).
Further reading
Some excellent books on How to build predictive models:
Mastering Data
Mining
Michael
Berry & Gordon Linoff (2000)
ISBN#
0471331236
Business Modeling and Data Mining
Dorian
Pyle (2003)
ISBN# 155860653-X
Data Preparation for Data Mining
Dorian
Pyle (1999)
ISBN#
1558605290
Data Mining: Practical Machine Learning Tools and Techniques
Ian Witten & Eibe Frank (2005)
ISBN#
0120884070
On Data Mining in Context. PhD Thesis
Peter
van der Putten (2010)







