Channels ▼


Understanding What Big Data Can Deliver

Customer segmentation can be used to increase the accuracy of a model while keeping complexity under control. By using additional data to first identify which model to apply, it is possible to introduce additional dimensions and derive more-accurate estimations. In this example, by looking at the first product that a customer searches for, we can select a different model to apply based on our prediction of which segment of the population the customer falls into. We use a different model for segmentation based on data that is related yet distinct from the data we use for the model that predicts how likely the customer is to make a purchase. First, we consider a specific product that they look at and then we consider the number of pages they visit.

Demographics and Segmentation No Longer Are Sufficient

Applications that focus on identifying categories of users are built with user segmentation systems. Historically, user segmentation was based on demographic information. For example, a customer might have been identified as a male between the ages of 25-34 with an annual household income of $100,000-$150,000 and living in a particular county or zip code. As a means of powering advertising channels such as television, radio, newspapers, or direct mailings, this level of detail was sufficient. Each media outlet would survey its listeners or readers to identify the demographics for a particular piece of syndicated content and advertisers could pick a spot based on the audience segment.

With the evolution of online advertising and Internet-based media, segmentation started to become more refined. Instead of a dozen demographic attributes, publishers were able to get much more specific about a customer's profile. For example, based on Internet browsing habits, retailers could tell whether a customer lived alone, were in a relationship, traveled regularly, and so on. All this information was available previously but it was difficult to collate. By instrumenting customer website browsing behavior and correlating this data with purchases, retailers could fine tune their segmenting algorithms and create ads targeted to specific types of customers.

Today, nearly every Web page a user views is connected directly to an advertising network. These ad networks connect to ad exchanges to find bidders for the screen real estate of the user's Web browser. Ad exchanges operate like stock exchanges except that each bid slot is for a one-time ad to a specific user. The exchange uses the user's profile information or their browser cookies to convey the customer segment of the user. Advertisers work with specialized digital marketing firms whose algorithms try to match the potential viewer of an advertisement with the available ad inventory and bid appropriately.

Real-Time Updating of Data Matters (People Aren't Static)

Segmentation data used to change rarely with one segmentation map reflecting the profile of a particular audience for months at a time; today, segmentation can be updated throughout the day as customers' profiles change. Using the same information gleaned from user behavior that assigns a customer's initial segment group, organizations can update a customer's segment on a click-by-click basis. Each action better informs the segmentation model and is used to identify what information to present next.

The process of constantly re-evaluating customer segmentation has enabled new dynamic applications that were previously impossible in the offline world. For example, when a model results in an incorrect segmentation assignment, new data based on customer actions can be used to update the model. If presenting the homemaker with a power tool prompts the homemaker to go back to the search bar, the segmentation results are probably mistaken. As details about a customer emerge, the model's results become more accurate. A customer that the model initially predicted was an amateur contractor looking at large quantities of lumber may in fact be a professional contractor.

By constantly collecting new data and re-evaluating the models, online applications can tailor the experience to precisely what a customer is looking for. Over longer periods of time, models can take into account new data and adjust based on larger trends. For example, a stereotypical life trajectory involves entering into a long-term relationship, getting engaged, getting married, having children, and moving to the suburbs. At each stage in life and in particular during the transitions, one's segment group changes. By collecting detailed data about online behaviors and constantly reassessing the segmentation model, these life transitions are automatically incorporated into the user's application experience.

Instrument Everything

We've shown examples of how detail data can be used to pick better models, which result in more accurate predictions. And I have explained how models built on detail data can be used to create better application experiences and adapt more quickly to changes in customer behavior. If you've become a believer in the power of detail data and you're not already drowning in it, you likely want to know how to get some.

It is often said that the only way to get better at something is to measure it. This is true of customer engagement as well. By recording the details of an application, organizations can effectively recreate the flow of interaction. This includes not just the record of purchases, but a record of each page view, every search query, or selected category, and the details of all items that a customer viewed. Imagine a store clerk, taking notes as a customer browses and shops or asks for assistance. All of these actions can be captured automatically when the interaction is digital.

Instrumentation can be accomplished in two ways. Most modern Web and application servers record logs of their activity to assist with operations and troubleshooting. By processing these logs, it is possible to extract the relevant information about user interactions with an application. A more direct method of instrumentation is to explicitly record actions taken by an application into a database. When the application, running in an application server, receives a request to display all the throw pillows in the catalog, it records this request and associates it with the current user.

Note: Automatic Data Collection

Some data is already collected automatically. Every Web server records details about the information requested by the customer's Web browser. While not well organized or obviously usable, this information often includes sufficient detail to reconstruct a customer's session. The log records include timestamps, session identifiers, client IP address and the request URL including the query string. If this data is combined with a session table, a geo-IP database and a product catalog, it is possible to fairly accurately reconstruct the customer's browsing experience.

Test Constantly

The result of collecting detail data, building more accurate models, and refining customer segments is a lot of variability in what gets shown to a particular customer. As with any model-based system, past performance is not necessarily indicative of future results. The relationships between variables change, customer behavior changes, and of course reference data such as product catalogs change. In order to know whether a model is producing results that help drive customers to success, organizations must test and compare multiple models.

A/B testing is used to compare the performance of a fixed number of experiments over a set amount of time. For example, when deciding which of several versions of an image of a pillow a customer is most likely to click on, you can select a subset of customers to show one image or another. What A/B testing does not capture is the reason behind a result. It may be by chance that a high percentage of customers who saw version A of the pillow were not looking for pillows at all and would not have clicked on version B either.

An alternative to A/B testing is a class of techniques called Bandit algorithms. Bandit algorithms use the results of multiple models and constantly evaluate which experiment to run. Experiments that perform better (for any reason) are shown more often. The result is that experiments can be run constantly and measured against the data collected for each experiment. The combinations do not need to be predetermined and the more successful experiments automatically get more exposure.


Big Data has seen a lot of hype in recent years, yet it remains unclear to most practitioners where they need to focus their time and attention. Big Data is, in large part, about paying attention to the details in a data set. The techniques available historically have been limited to the level of detail that the hardware available at the time could process. Recent developments in hardware capabilities have led to new software that makes it cost effective to store all of an organization's detail data. As a result, organizations have developed new techniques around model selection, segmentation, and experimentation. To get started with Big Data, instrument your organization's applications, start paying attention to the details, let the data inform the models — and test everything.

Aaron Kimball founded WibiData in 2010 and is the Chief Architect for the Kiji project. He has worked with Hadoop since 2007 and is a committer on the Apache Hadoop project. In addition, Aaron founded Apache Sqoop, which connects Hadoop to relational databases and Apache MRUnit for testing Hadoop projects.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.