Tom’s Ten Data Tips – August 2010
Text Mining
The digital universe is growing at an astounding pace, and the majority of that growth is in unstructured information. Text mining plays a crucial role in extracting value from these oceans of data. An estimated 80-95% of all data is in unstructured format. Email, contracts, policies, patents, all potentially contain enormous value, if you know how to process them in an automated fashion.
In the past, structured and unstructured business intelligence (BI) have evolved largely independently. Now business users are becoming aware of missed opportunities as a result of this disconnect. This has spawned the field of Enterprise Information Management. Text mining has been defined as “… a knowledge intensive process in which users interact with document collections using a suite of analysis tools” (Feldman & Sanger).
1. There’s A Continuum From Structured Via Semi-Structured To Unstructured
Text mining is often thought of as taking place on unstructured text. However, that is only partially true. Even a simple document can contain a fair amount of semantic and syntactic structure. On top of that, typographical elements like punctuation, special characters in combination with white spacing, returns, bold type, italics, etc. all add a lot of structure that imbues meaning on the text.
2. Text Mining Helps To Close The “Gap” Between The Structured And The ‘Unstructured World’
Traditionally, BI has pertained to structured data exclusively. Facts and dimensions are combined using set logic to arrive at numeric outcomes. The ‘unstructured world’ deals with information in emails, contracts, complaint letters, notifications, promotional materials, proposals, etc. Evidently, there is enormous potential value contained in these sources, if only we’d know how to extract it. How many of our contracts have an insured value in excess of $100.000? Is the number of complaint letters really rising? Since when has this been the case? Which employees have been informed about project Aramac? Etc.
By unlocking the inherent value in unstructured data sources, an entirely new dimension of reporting can be created. Structured data provide a numeric foundation, if you may, and the unstructured information adds context and insight. It’s the difference between knowing when somebody called and knowing what they were talking about. Although you can report on the unstructured data itself, combining it with structured data provides “a story” behind the numbers.
3. “Number Crunchers” Mostly Outgun NLP
In text mining, an implicit “competition” has been going on between approaches that use language and its structure, and a parallel strand that started with Claude Shannon and is largely focused on the statistical (or information theoretic) properties of language, without direct reference to “meaning.” Natural Language Processing (NLP), sometimes also referred to as Computational Linguistics or Language Engineering, takes the meaning and structure of grammar as its primary view. With the evolution of computing power and improved algorithms, the NLP camp has slowly but gradually begun to loose this “arms race” to brute force computer power that treat text (largely) like a “bag of words.”
There is one “class” of problems where NLP solutions still seem superior, albeit at significantly higher (financial) cost of realizing a working solution. For those problems where it is essential that you retrieve all valid results (achieving very high recall, see also tip# 6), NLP approaches still seem superior.
4. Text Mining And Information Retrieval (IR) Have Different Objectives
Although the underlying technology for text mining and information retrieval has more similarities than differences, it is the objective that discriminates these fields. Text mining is used to extract patterns from within one, or a whole set of documents. Relating documents to each other often reveals patterns that could not have been found in individual ones. For example, over time our use of language evolves, some words become “fashionable”, others go extinct.
IR is used to select appropriate documents based on a user query. Finding web pages with a search engine is probably the best known example. But querying a company’s knowledgebase, searching for existing patents, selecting contracts with particular references, or finding law cases in a particular domain all come to mind.
5. Text Mining Is Mostly About Pre-Processing
The main difference between “regular” data mining (on structured data sets), and text mining lies in the role of pre-processing. Of course in structured data mining pre-processing is time consuming. But its role is limited to helping algorithms get “easier” acces to signal through the noise.
The preprocessing tasks in text mining are aimed at turning unstructured or semi structured documents into a structured format that can be mined at all. Only after the pre-processing can you do anything (automated) with text. It’s a necessary precondition to mining.
6. Precision And Recall Both Matter, For (Very) Different Reasons
The two most commonly reported statistics in text mining are precision and recall. In most problems (if not all), it is only possible to increase precision at the expense of recall, and vice versa. If you want to increase both, your only option is to find a better, more accurate algorithm. Let’s look at an example to illustrate these two concepts.
If
you search the internet, and we consider all pages that have been indexed by
Google to be our “universe”, then we might search for “
7. Uniform Terminology Facilitates Recall
Information extraction (IE) is the process to extract structured knowledge from unstructured text sources. One of the areas in which this has been (very) successfully applied is molecular biology. Because of the nature of the problem, recall needs to be very high. Researchers cannot accept to get a list of relevant results, but remain uncertain whether truly all relevant results have been obtained (see also tip# 6). Similar problem constraints occur in litigation and patent research. In all such cases, users would rather “put up” with some spurious results, than running the risk of missing some important ones. So high recall is required, if necessary at the expense of somewhat lower precision.
For more specific and unique terms, the task becomes easier. In contrast, when homonyms exist, the IE task becomes (progressively) more difficult. In general medical science, for instance, considerable “confusion” is possible about terminology: sometimes the same condition is described using different terms, and sometimes the same term is used in conjunction with different conditions. In such cases, the context needs to be taken into account. In molecular biology (where all jargon is uniquely defined) this confusion has been much, much less, and results are therefore better.
8. Evolution In Text Mining Has Been A Significant Driver Of Knowledge Management
The field of knowledge management (KM) is rather fragmented. Professionals trained in information or library sciences have been leading this field. Knowledge represented as taxonomies, knowledge maps and ontologies are primary outcomes. However, the (largely manual) process of codification is cumbersome and what’s worse: error-prone.
Technologies like information retrieval, search engine technology, etc., have generated systems that can automate (at least partially) many of these tasks. A taxonomy is a hierarchical classification system (the best known is probably for the animal kingdom). If you blend a taxonomy with a thesaurus (linking different terms), you can “turbo charge” information retrieval and search systems. They help users arrive at their desired results with the least possible mouse clicks.
9. Text Mining Provides Risk Management For Litigation And Compliance
One of the business drivers of text mining has been support for litigation and compliance. This sounds much like a “defensive” motive. When companies get sued by customers, for instance, they may need to find out all the contacts/communications they have had with this particular customer. Yet it is difficult to define what characterizes “all communications.” Given the way most document management systems are organized there are no obvious “hooks” for an exhaustive retrieval. Sifting through email can hardly be limited to exact matches to particular keywords, because missing one or two messages can seriously jeopardize your legal case.
Similar importance can occur when disputes about patents or contracts are at stake.
10. Text Mining Is But The Tip Of The Iceberg
We have focused (almost) exclusively on text mining in this newsletter. However, our need to make sense of unstructured information doesn’t stop there. Text mining happens to have evolved further, and seems (slightly) more mature. Annotating pictures or movies (face/image recognition), transcribing audio (spoken text), or unraveling genomes, are all fields that will evolve in years to come.
Other examples are mining humongous graphs like the web, or social networks. Both are phenomenal semi-structured tasks that still lie largely ahead of us.
Further reading
Some excellent books on Text Mining:
The Text Mining Handbook.
Ronen Feldman & James Sanger (2007)
ISBN#
0521836573







