With all the hype and anti-hype surrounding Big Data, the data management practitioner is, in an ironic turn of events, inundated with information about Big Data. It is easy to get lost trying to figure out whether you have Big Data problems and, if so, how to solve them. It turns out the secret to taming your Big Data problems is in the detail data. This article explains how focusing on the details is the most important part of a successful Big Data project.
Big Data is not a new idea. Gartner coined the term a decade ago, describing Big Data as data that exhibits three attributes: Volume, Velocity, and Variety. Industry pundits have been trying to figure out what that means ever since. Some have even added more "Vs" to try and better explain why Big Data is something new and different than all the other data that came before it.
The cadence of commentary on Big Data has quickened to the extent that if you set up a Google News alert for "Big Data," you will spend more of your day reading about Big Data than implementing a Big Data solution. What the analysts gloss over and the vendors attempt to simplify is that Big Data is primarily a function of digging into the details of the data you already have.
Gartner might have coined the term "Big Data," but they did not invent the concept. Big Data was just rarer then than it is today. Many companies have been managing Big Data for ten years or more. These companies may have not had the efficiencies of scale that we benefit from currently, yet they were certainly paying attention to the details of their data and storing as much of it as they could afford.
A Brief History of Data Management
Data management has always been a balancing act between the volume of data and our capacity to store, process, and understand it.
The biggest achievement of the On Line Analytic Processing (OLAP) era was to give users interactive access to data, which was summarized across multiple dimensions. OLAP systems spent a significant amount of time up front to pre-calculate a wide variety of aggregations over a data set that could not otherwise be queried interactively. The output was called a "cube" and was typically stored in memory, giving end users the ability to ask any question that had a pre-computed answer and get results in less than a second.
Big Data is exploding as we enter the era of plenty high bandwidth, greater storage capacity, and many processor cores. New software, written after these systems became available, is different than its forebears. Instead of highly tuned, high-priced systems that optimize for the minimum amount of data required to answer a question, the new software captures as much data as possible in order to answer as-yet-undefined queries. With this new data captured and stored, there are a lot of details that were previously unseen.
Why More Data Beats Better Algorithms
Before I get into how detail data is used, it is crucial to understand at the algorithmic level the signal importance of detail data. Since the former Director of Technology at Amazon.com, Anand Rajaraman, first expounded the concept that "more data beats better algorithms," his claim has been supported and attacked many times. The truth behind his assertion is rather subtle. To really understand it, we need to be more specific about what Rajaraman said, then explain in a simple example how it works.
Figure 1: Using little data to estimate a relationship.
Experienced statisticians understand that having more training data can improve the accuracy of and confidence in a model. For example, say we believe that the relationship between two variables such as number of pages viewed on a website and percent likelihood to make a purchase is linear. Having more data points would improve our estimate of the underlying linear relationship. Compare the graphs in Figures 1 and 2, showing that more data will give us a more accurate and confident estimation of the linear relationship.
Figure 2: The same relationship with more data.
A statistician would also be quick to point out that we cannot increase the effectiveness of this pre-selected model by adding even more data. Adding another 100 data points to Figure 2, for example, would not greatly improve the accuracy of the model. The marginal benefit of adding more training data in this case decreases quickly. Given this example, we could argue that having more data does not always beat more-sophisticated algorithms at predicting the expected outcome. To increase accuracy as we add data, we would need to change our model.
The "trick" to effectively using more data is to make fewer initial assumptions about the underlying model and let the data guide which model is most appropriate. In Figure 1, we assumed the linear model after collecting very little data about the relationship between page views and propensity to purchase. As we will see, if we deploy our linear model, which was built on a small sample of data, onto a large data set, we will not get very accurate estimates. If instead we are not constrained by data collection, we could collect and plot all of the data before committing to any simplifying assumptions. In Figure 3, we see that additional data reveals a more complex clustering of data points.
Figure 3: Even more data shows a different relationship.
By making a few weak (that is, tentative) assumptions, we can evaluate alternative models. For example, we can use a density estimation technique instead of using the linear parametric model, or use other techniques. With an order of magnitude more data, we might see that the true relationship is not linear. For example, representing our model as a histogram as in Figure 4 would produce a much better picture of the underlying relationship.
Figure 4: The data in Figure 3 represented as a histogram.
Linear regression does not predict the relationship between the variables accurately because we have already made too strong an assumption that does not allow for additional unique features in the data to be captured such as the U-shaped dip between 20 and 30 on the x-axis. With this much data, using a histogram results in a very accurate model. Detail data allows us to pick a nonparametric model such as estimating a distribution with a histogram and gives us more confidence that we are building an accurate model.
If this were a much larger parameter space, the model itself, represented by just the histogram, could be very large. Using nonparametric models is common in Big Data analysis because detail data allows us to let the data guide our model selection, especially when the model is too large to fit in memory on a single machine. Some examples include item similarity matrices for millions of products and association rules derived using collaborative filtering techniques.
One Model to Rule Them All
The example in Figures 1 through 4 demonstrates a two-dimensional model mapping the number of pages a customer views on a website to the percent likelihood that the customer will make a purchase. It may be the case that one type of customer, say a homemaker looking for the right style of throw pillow, is more likely to make a purchase the more pages they view. Another type of customer for example, an amateur contractor may only view a lot of pages when doing research. Contractors might be more likely to make a purchase when they go directly to the product they know they want. Introducing additional dimensions can dramatically complicate the model; and maintaining a single model can create an overly generalized estimation.