Newsletters list:

Throughput accounting
Agile Coaching
Data Models
“Big Data”
Visual facilitation
Agile planning
Churn modeling
Writing Survey Questions
Theory of Constraints
Hands On Data Mining
Data Vault
Time boxing
Surrounding requirements
Cybercrime
Retrospectives
Self Service BI
Internet Surveys
How to build predictive models
New Accounting Standards
Technical Reviews
Text Mining
Meta Data
Open Source BI
Data Warehouse Testing
Customer Value Management
Value From Transaction Data
Data Visualization
Survey Design
Predictive Modelling
Applied Probability Theory
Open Source
Software Testing
Data Warehouse Development
Data Quality Policy
History of Mathematics
Usability Research
Life Time Value
Balanced Scorecards
Survey Sampling
Agile Software Development
ETL
Neural Networks
Corporate Strategy
Missing Data
Segmentation
Decision Trees
XBRL
OLAP
Data Quality Assessment
Dashboards and Scorecards
Data Mining for CRM
Data Mining Algorithms
Data Preparation
Campaign Optimisation
Affinity Analysis
Vendor Selection
System Dynamics
Credit Scoring
Forecasting
Web Usage Analysis
Customer Profitability
Problem Analysis
Customer Satisfaction & Loyalty
IT Governance
Market Research
Search Engines
Marketing Accountability
CRM
Data Mining Models
Privacy
Data Warehousing
Data Quality

PDF iconPrint this newsletter

Tom's Ten Data Tips - July 2011

Hands On Data Mining

Data mining is part science and art. Tools play an important role, but not nearly as important as the ingenuity and versatility of the modeler. Data mining is a craft that you acquire by building a lot of models, using statistical foresight, knowledge of the data you are working with, and by reviewing all the products you have delivered, and how they are being used. Or by finding out why your model aren’t being used anymore…

1. What Is Data Mining, Really?

There is a lot of confusion in the industry about what data mining really is. Although we cannot claim or aspire to resolve this ambiguity, let’s point to a couple of different uses of the term. Oftentimes people with advanced SQL skills who can answer the most challenging information requests from a complex data schema are called “data miners.” And sometimes people (like actuaries) who calculate life expectancies and required insurance premiums are called “data miners.” Sometimes marketers who sift through big, multi-dimensional OLAP cubes in search for patterns in sales across regions, product lines, seasonal patterns, sales channels or premium types are called “data miners.”

All of these professionals are doing complex data investigations, and all are using lots of data. There’s nothing inherently “wrong” with calling all of these activities “data mining”, it merely becomes a bit confusing in terms of activities and requirements with regards to data and tooling. We’d encourage a more focused use of the term “data mining” to encompass finding meaningful patterns in (very) large volumes of data, by means of automated procedures. Navigating an OLAP cube requires human brainpower – so that doesn’t qualify as an automated procedure. Composing complex queries doesn’t use any algorithm either, so we’d exclude that, too. A narrow definition of data mining keeps activities focused and clear.

2. Data Mining Is Not Data “Dredging”

In some circles, data mining has acquired a pejorative connotation. People who do not understand what data mining is about have a prejudice that data miners keep digging (“dredging”) in a data set, until they find “some” correlations. Because of the exhaustive searches, people fear these correlations are maybe based on spurious patterns that arose out of pure coincidence. For reasons unbeknown to your author, in particular econometrics professionals seem to be hanging on to this misconception that “if you torture the data long enough, they will confess to anything.”

Data mining has proven its value in the commercial space, repeatedly showing significant impact on bottom-line results (your ultimate litmus test). But apart from that, the idea that data mining might cause a pursuit of trivial correlations shows a lack of appreciation for the fundamental empirical nature of data mining. By splitting our mining set into two or three parts (for purposes of cross-validation), you manage the risk of finding a “coincidental” correlation (see also tip# 6).

3. Create An Environment That ‘Works’

Common knowledge holds that about 80%-95% of effort in data mining projects is “wasted” on getting the data ‘right’, and this effort is usually referred to as “data preparation” (see also tip# 7). In most companies, the “preparation” includes assembling the mining data set. Because data mining algorithms can only work on flat files, you need to create a corresponding (“flat”) file. A lot of this effort goes to denormalizing data from a Star Schema into a flat file structure. What that means is that every entity in your population of interest (e.g.: every “customer”, “shipment”, etc.) is represented by exactly one row in the data set, with as many columns appended as reasonably possible.

Because these kinds of queries can place a tremendous burden on CPU capacity and/or I/O, it is often a wise idea to involve a DBA in constructing (and scheduling) such queries. Even better: make these data sets a standard part of your architecture. Although the overhead for creation and storage is considerable, you’ll be able to fine tine that effort. And if you take your data mining serious, you’re ROI on data mining talent will pay off in spades by driving down the (non-value added) time they spend on gathering data sets.

4. Begin With The End Result In Mind

Hands on data mining means putting models to use. In practical business settings you need to begin data mining projects by thinking about how a data mining model will be used. Interestingly, this often shows that no model should be built at all… If you decide that a model is needed, focusing on the end product of your modeling effort (also called “implementation specification”) helps you ensure that model can effectively be used. There’s a huge difference between a “good” or accurate model, and one that is useful. With 20/20 hindsight we can now build the perfect models to prevent this past credit crisis. And the data to do so were available back then, too. However, this model, accurate as it may be, will not prevent the next credit crunch. So it would be absolutely useless today.

In other cases, a model to prevent churn may be what your client wants. But knowing who will defect, does no good unless you have measures available to do something about this faith. If, and only if some intervention aimed at these soon-to-be churners can be employed in time, and with a positive impact on the bottom-line, the model will be worthless form a business perspective. Academically, it may still be interesting.

5. Domain Knowledge Is Key

One can never trust data to be what they appear to be. Pyle (2003): “… it seems intent on leading the unwary astray. Invariably, the data that a data miner has to use seems particularly well constructed to promote frustration.” Besides understanding what data represent, it behooves the miner to find out how they were gathered. This data collection process can have tremendous impact on the kind of models that can or cannot be built. If I got a dime for every time I heard “we capture all the data”, only later to find out that some were censored out, I’d be a richer man. People close to the business can tell you much about the (primary) process that led to the data recording in the first place.

This holds for both interpreting data as they appear in databases, as well as making sure that models are applied properly. When you see mostly capital M or F, and then the occasional lower case m or f, this may be some sort of coincidence, but maybe not. Finding out with domain experts is probably your best move. “Experience has taught once brash analysts that those familiar with the domain are usually as vital to the solution as the technology brought to bear” (John Elder).

6. Split Your Mining Set In Three Parts If You Can Afford This (Otherwise Two)

Data miners always split their dataset to prevent stumbling on spurious results that don’t hold up. If you find an effect in one half of your data set, and it also holds in the other half, you are pretty safe. There are ingenious machine learning equivalents to this practice like bootstrapping and jack knifing, but a rough-cut data mining project tends to be performed under (considerable) time pressure which usually precludes most of these advanced techniques (see also tip# 9).

In data mining we usually speak of training, test, and validation sets. They are mutually exclusive subsets of your mining data set. You develop a model on the training set, and see how it performs on the test set. Then you go back to your training data, and try something different. Rinse and repeat. By the time you have settled on your final model, only then do you assess model performance on the validation set, which is the only time you get an unbiased estimate of the “true” expected model performance (accuracy of prediction). If your data is sparse, you’ll have to forego the validation set.

Take note that for some reason, SAS has acquired the habit of switching the names for testing and validation sets. With SAS, the terminology is such that you hold out the testing set until the final model is built.

7. Data Assembly Is Not Data Preparation

Many people confuse putting together a data set (data assembly) with data preparation. Although it is certainly true that you need a data set to ‘prepare’ yourself for a mining project, we suggest not confusing these two activities. Data assembly is the grunt work required to come up with a proper mining set. It provides the “raw material” to work with. Data preparation, on the other hand, is the skill to adjust the layout and transformations of the data set, for a specific modeling job. The specific target variable, associations within the data set, optimal strategy for dealing with missing values, and the modeling objective all need to be considered when determining how to best prepare a mining data set. Everything you do for data preparation is aimed at exposing the intrinsically available patterns so that your algorithms will do the best job of separating the signal (“true” effects) from the noise (chance fluctuations).

8. Gather As Much Feedback As You Can

In some organizations, monitoring the performance of data mining models comes as an afterthought. When the results of a campaign fail to please, people dive into the models to see why they’re not getting good results. You have your answer there, already. You get better at building models by constantly reviewing what works, for how long, and why. Investing in looking back at your own models is not a luxury, it’s a bare necessity if you want to get good at this craft. Hence, you just can’t afford to postpone it… Do it before you’re building new models, while you’re building new models, and after you’ve built a new model. In short: all the time ☺

9. Optimize The Model, As well As Your Workload

By tinkering with parameters, you can always get a more accurate model. You can also improve the accuracy by making it more elaborate, like for instance using a two-stage model (see a previous newsletter on, tip# 8). But all too often, the data miner’s capacity forms a bottleneck for the organization. Even though it may well be worth your time financially to strive for a better model (time spent is justified in light of improved model performance), maybe this comes at he expense of other modeling projects that are being delayed. So optimize your efforts relative to your workload, and not necessarily with regards to the specific model you happen to be working on.

10. ‘Better’ Data Mining Tools Have Drawbacks As Well

It is absolutely true that the modern generation of data mining tools (GUI driven) provide a tremendous advantage over the “old style” largely programmable and syntax driven software. Not only has this boosted productivity (building more models per day), but more importantly, this has created a vastly larger pool of potential data miners. And since domain knowledge comes at a premium (see also tip# 5), the “overall” quality of output improves, on average. But then, what does “average” mean? The statistics joke is that with your head in the oven, and your feet in a bucket of ice, you are pretty comfortable, on “average” ☺

The modern generation tools also imply that you can shoot yourself in the foot with more accuracy. You do this faster, more efficiently and more accurately. The need for metadata (see also a previous newsletter on meta data) goes up commensurately. Governance of model monitoring needs to grow in conjunction with the expansion of domains and frequency of use. But then, proper governance is required for any application of models! Traditional cost accounting models have become obsolete in Lean production systems (you could argue counter productive, even). Mortgage applications models need to governance, too, as was proven by this credit crunch...

Further reading

Some excellent books for Hands-on Data Mining:

Business Modeling and Data Mining
Dorian Pyle (2003)

ISBN# 155860653-X

Mastering Data Mining
Michael Berry & Gordon Linoff (2000)

ISBN# 0471331236

Data Preparation for Data Mining
Dorian Pyle (1999)

ISBN# 1558605290

Data Mining: Practical Machine Learning Tools and Techniques
Ian Witten & Eibe Frank (2005)

ISBN# 0120884070

Contact
XLNT Consulting
Tom Breur, Principal

E-mail
Email Tom Breur

Telephone
+31-6-463 468 75

Address
Langestraat 8-03
5038 SE Tilburg
the Netherlands