Tom's Ten Data Tips - January 2012
“Big Data”
Our society is overflowing with data, and these data volumes keep growing at an unprecedented pace. Global growth in data volume is estimated at a staggering 60% per year, or about 10-fold in five years. Nothing seems to stop this tsunami coming in. Relational, SQL-based architectures don’t scale sufficiently to deal with this growth of -in particular- unstructured data. McKinsey’s Global Institute has labeled this trend “the next frontier for innovation, competition and productivity.”
1. “What Is “Big Data”, Really?
Nobody seems to agree on what “Big Data” really stands for. It is clear, though, that there’s a “big” hype involved. There is no agreement on a common definition, or even order of magnitude (for data) or appearance. Volume and diversity of data types seem common denominators. Since storage providers have been most eager to jump on this bandwagon, one would assume that disk space must be a differentiator. We would argue that “Big Data” signifies a sea change away from traditional relational formats. End-user needs that are driven by unprecedented low latency and scalability demands, are driving a search for alternative solutions to common data challenges. This originated largely with corporate powerhouses like Google, eBay, Facebook and Twitter that have led the way in leveraging gargantuan data volumes.
Unless you have (very) large volumes of unstructured or semi-structures data, and you need to analyze these in (near) real-time, then most likely the problem can be solved with “traditional” (RDBMS) means. Why would you resort to relatively new, quirky solutions like Hadoop, and a bewildering array of NoSQL solutions? Development is largely a step back in time when you submit your needs to nerdy so-called “data scientists” (see also tip# 6), relying on not-yet-widely-used technology. Unless there is a business case for going massively parallel (see also tip# 4), traditional (familiar) relational database management systems will usually be a lot more comfortable and easier to manage.
2. “Big Data” Involves A Change In BI Architecture(s)
Traditionally, BI was a downstream process from business operations. A source system created some data that were extracted and warehoused, and subsequently made available to the business.
When the pressure on BI is to deliver data faster, physically moving data interferes with the need to make these data available more quickly. Because of redundancy in storage, such architectures are less feasible for Petabyte scale data volumes. As a result, “Big Data” business intelligence solutions tend to be much more enmeshed in operational solutions, living alongside rather than downstream from primary systems that create data.
3. “Big Data” Has ‘Big’ Scalability Needs
For decades, the SQL relational model has dominated business intelligence and database technology. Traditional RDBMS are all based on SQL, for which an alternative has now come available: NoSQL, for Not only SQL. This doesn’t mean SQL databases will disappear any time soon, but the introduction of NoSQL platforms has played a pivotal role in enabling “Big Data” solutions like the Apache Hadoop solution eco-system.
The reason NoSQL solutions (see also tip# 5) are dominating “Big Data” projects is both because relational, SQL solutions are (largely) restricted to “vertical” upscaling, as well as the (abundant) possibilities for selecting low-cost commodity hardware in a “horizontally” scalable infrastructure. “Vertical” scaling means getting a faster server, more RAM, faster I/O, etc. By “horizontal” scaling we refer to expanding the size of a cluster or grid by adding more hardware. A cluster consists of identical pieces of hardware; a grid is composed of divergent types of servers. Horizontal scaling leads to a more or less linear rise in cost, vertical scaling typically leads to exponential rise in costs after some “natural” capacity ceiling has been reached.
4. Scalable, Massively Parallel Architectures Will Boldly Take You Where No Man Has Ever Gone Before
One of the ‘big’ things with Hadoop and similar “Big Data” solutions is that they provide the means to scale out in near linear fashion. With (Apache) Hadoop, you may launch your services on a cluster with only a few nodes, and then later gradually and smoothly expand this solution to multi-Terabyte scale. Without ever structurally revising your architecture.
The ability to scale out a solution, running on low-cost commodity hardware has brought heretofore-unimaginable data volumes within reach of medium to large enterprises. As the business grows, the hardware can “simply” grow along, without any serious need for architecture revision. The dramatic price drops in MPP (Massively Parallel Processing) technology, and commoditization of storage (racks of multi Terabyte disks serving every node in a cluster) have opened up business cases for leveraging data that previously only lived in the realm of mega corporations (with very deep pockets). This technology trend has significant impact for shifts in business competition!
5. NoSQL Solutions Come In Many Shapes And Sizes
In the “Big Data” space, more particularly for BI, (Apache) Hadoop seems to have acquired a rather dominant position. Hadoop should be considered an eco-system, though, rather than a solution per se. It has drawn many developers, hence it encompasses quite a wide variety of solution like Hive for data warehousing, MapReduce for distributed processing, ZooKeeper to coordinate distributed applications, Avro data serialization, the HDFS distributed file system, HBase database, and several others.
However, NoSQL extends way beyond Hadoop. It provides highly diverse solutions, geared to optimize a bewildering array of technical challenges. What these solutions share is that they deviate from the “traditional” (BI), or “classical” relational needs. NoSQL solutions revolve around unstructured or semi-structured data like storing images, video or audio, (mathematical) graph databases like the ones that provide connection recommendations in Facebook or LinkedIn, high-dimensional relations (as opposed to typical 2-dimensional relations in an RDBMS), text, GPS, etc. In short: all the areas where “Big Data” have been expanding into.
6. “Data Scientists” Are The Information ‘High Priests’ of “Big Data”
Jeff Hammerbacher (Cloudera) and DJ Patil (Greylock Partners) coined the term “data scientists” (in 2007) to describe a hybrid between “data analysts” and “research scientists.” Working with “Big Data” requires a specialized skill set. It helps to have affinity with programming. You need a solid understanding of experimental methodology, at least working knowledge of statistics, and experience in data management. This intersection of expertise is pretty rare, which explains why “data scientists” are in such high demand, at the moment.
“Big Data” structures are elusive, and since so few professionals know how to make sense of these oceans of data, nor (would) know how to cross-validate findings, business stakeholders ‘just’ need to rely on this new generation of data scientists to leverage their data assets in the best possible way. Even seemingly “simple” cross-checks on frequency count and correlations can be so challenging in “Big Data” environments that business stakeholders often feel at the mercy of their information “high priests” (=data scientists).
7. “Big Data” Leverages The “Long Tail”
One of the cornerstones of “Big Data” is that you can ‘outsmart’ your competitors if you have more and better data than them. “Better” data refers to richer customer profiles which you can accrue by consolidating information from a wide array of sources. “More” data refers to a bigger customer base, access to a larger share of the market. That is where the “Long Tail” comes in.
“Long Tail” marketing (Anderson, 2006) refers to tapping into ever smaller niches of the market, and (still) serving these niches with an offer that is as personalized and timely as it can be. Because of the nature of Gaussian (Normal) distributions, growing your base by a factor of two implies you will have (far) more than twice the volume in these long tail niches. This is where size truly matters! “Long Tail” marketing is all about “selling less of more”, or finding more profitable niches, yet serving these accurately, thus leveraging the data you have for this niche.
8. “Big Data” Is Here To Stay (And Grow Even More!)
Given the current growth rates in data volumes, nothing seems to stop further adoption of “Big Data” architectures. If anything, they are bound to gain (much) wider acceptance. As the technology matures, further innovations will continue to drive down TCO (Total Cost of Ownership).
Hardware cost is going down at a fairly “predictable” rate, and considerable drops in exploitation and maintenance (software) costs can be expected from stepwise innovations. Hadoop and most other “Big Data” applications are developed under the Open Source licensing model. The Open Source developer community has proven remarkably adept at finding effective solutions for the most time-consuming maintenance activities. Those seem to ‘automatically’ get resolved first. Developers “know” what kind of architectures lead to poorly maintainable solutions. As driven by some “invisible hand”, the Open Source model consistently leads to high quality solutions. This trend will foster more widespread, and more easily adoptable “Big Data” frameworks.
9. NoSQL Won’t Replace But Instead Will Live Alongside “Traditional” RDBMS
NoSQL environments, and in particular the Hadoop eco-system, play a central role in “Big Data” solutions. Their dominance will only increase as data volumes continue to grow. However, we’re unlikely to say goodbye to traditional (RDBMS) solutions any time soon. Relational models provide enormous flexibility, and the ACID principle (Atomicity, Consistency, Isolation, Durability) for database transaction processing still, and probably will have important advantages for the overwhelming majority of applications. It’s important enough that no replacement seems in sight.
For most applications, “Big Data” stores will serve as “just” another source for a relational data warehouse. It’s only the persistent, carefully modeled data that pass this “durability” test. Exporting them to a persistent store where historical analysis can be performed is likely to play an important function in driving corporations for many, many years to come. So both architectures can in fact live side-by-side quite elegantly.
10. Analytics Leverages The Deep Potential Of “Big Data”
“Big Data” by itself, regardless of the form, shape, size, or type it comes in, is worthless unless business users actually do something with it that delivers value to their organization. That’s where business analytics comes in.
Business intelligence has always been on the edge of data repositories that were bigger than could be comfortably handled by the technology available at the time. In that sense, “Big Data” is not all that new. When you attempt to compile a comprehensive history of the organization, and integrate this across business lines (as is customary in an EDW), this perforce will grow beyond what is being handled elsewhere in the company. Leveraging analytics in a “Big Data” environment is challenging, yet, the royal road to success.
Further reading
Some excellent books on “Big Data”:
Super Crunchers: Why Thinking-by-Numbers Is the New Way to Be Smart.
Ian Ayres (2006)
ISBN# 0935716025
Too Big to Know: Rethinking Knowledge Now That the Facts Aren't the Facts, Experts Are Everywhere, and the Smartest Person in the Room Is the Room.
David Weinberger (2012)
ISBN# 0465021425
Reinventing Discovery: The New Era of Networked Science.
Michael Nielsen (2011)
ISBN# 0691148902
The Long Tail: Why the Future of Business is Selling Less of More.
Chris Anderson (2006)
ISBN# 1401302378







