One of the hippest tech memes in 2012 was certainly "Big Data." As shown in the image below, Google Trends shows the popularity of the search term "big data" making the hockey-stick shape starting in the last half of 2011.
What does that mean? It can't just be that people have discovered that most data is too "big" for a human to read comfortably before bedtime. No. That is not what "big data" means. Business people don't like to read anything longer than three pages, so by that metric, data has been huge since the 8-inch floppy was on sale at Radio Shack.
The secret is that "big data" means different things depending on whether you are on the engineering or business side of the problem.
On the engineering side, big data means, "I need more and faster storage, and by the way, we should try out some of this cool new software I've been reading about on the inter-tubes." It means using SSD instead of spinning platters; clustering; partitioning, and sharding. But this is not news to us. What we really have a hard time understanding (or a hard time caring to understand, perhaps?) is what our colleagues on the business side want from us when it comes to big data.
The Psyche of a Data-Hungry Business Person
On the business side, big data means, "How do I capitalize on the data we have internally?" Decision-makers, marketing people, and business analysts have read dozens of articles describing the immense advantage this company or that company has reaped by better understanding the usage and preference trends of its customers, and they are tired of pretending at cocktail parties that their data gets out and works for them, too.
The key to understanding the allure of big data to the marketer or business stakeholder is to understand that they view this data as a goose that lays golden eggs that they cannot get their hands on. To them, big data is data that is locked up away from the business analysts and stakeholders.
Understand What You Have and Where You Want To Go
For an engineer approaching a big data project, the first step is to map out what you have, and how it is structured. Because it's all so disparate and mad-whack, you will need a logic layer (that is, scripts) to tie it together and transform it into…you guessed it…even more data.
Now, when you collect data from all the myriad interactions humans have with technology all day every day, it's true that it piles up quickly. But just storing it isn't the problem. Moving it is the problem. What happens if you collect everything on magnetic tapes, and then suddenly a business stakeholder wants to move that data to Amazon S3? Better build in a three-month lead time. I worked on a project where it was faster to copy all the data that changed each day onto a huge pile of hard drives, and then truck those drives to the data center after work let out at 6PM than to transfer the data over a dedicated OC12. Once the primary data has parked itself, it hates to move.
Ready, Set, Stumble!
On the DBA and engineering side, direct access to the data is not the problem; rather, the difficulty lies in understanding the utility of that data. Engineers know how to access the data, but they don't know what questions the business side wants to ask. And the business side doesn't know what is there to ask about, what is possible and impossible, and generally doesn't have the training in statistics and analytics to phrase the question in a meaningful way. Most often, when they do form a question, it's of limited scope and takes a very long time for the engineering team to get around to answering, so that when the answer comes back, it just doesn't seem to have been worth the trouble. Often, the questions posed to the big data analyst by non-technical people reflect a lack of understanding of what information there actually is in the data, and sometimes seems to apply metrics found on one area of life to all other digital areas, regardless of applicability.
On the engineering side, a meaningful big data question usually involves connecting and correlating data in various forms from various sources, and this is the most interesting part of the problem for the engineers. You might have three different SQL databases containing row-oriented data in 20 tables, each with different nomenclature, relationships, and data types, plus a couple key/value stores that have no fixed schema, a couple of document stores where data is stored according to a schema designed to support access only via software code, as well as a bunch of text files storing log data in tab delimited lines.
The more software vendors, different systems, and new open-source projects you have, the more complex that collection of different data sources is going to be. The greater the time period the data spans, the greater the complexity of structures and formats you might encounter. An extreme example of this would be a project requiring the use of Fast Fourier analysis data stored on VHS cassettes used in conjunction with current social media graph data.
Automation and Advance Planning Are Your Friends
Engineers tend to have very long attention spans and move with a cautious deliberation that can be off-putting to a sales or marketing type accustomed to quick questions and quick answers. The reality is that if the data is inherited from various locations, mining it for answers takes time. Big data projects usually begin with offline, ETL-oriented processes that crawl data sources, mapping and reducing, condensing rarefied data into dense results over a period of hours or days. Marketing folks tend to want faster answers.
One solution gaining ground is the collection of data into structures that pre-compute and aggregate results as the data comes in. The aggregate data is kept up-to-date in hash tables or document stores, and is thus more available, reducing the time to query and return results. The structure of this pre-aggregate data is highly dependent on the resulting answers that the business team wants out of it. Document store schemas may be flexible, but if the data isn't consistent, it's hard to use. When pre-aggregating data, it is especially important to learn the business requirements and perform a needs analysis before the project gets too far along.
For the Business Folks
If you're on the business side trying to glean some insight into those elusive, often prickly creatures known as engineers by reading Dr. Dobb's, first off, I applaud your initiative. The fact that you're here is half the battle. That said, my advice is brutish and short: You have to articulate the question you are asking, the data you want to get back, and the hypothesis you are testing with this exercise. It's called the "scientific method," and big data is science. Without the business context, the big data query is simply an academic exercise.
Once you've coaxed your engineering partner into corralling the data you need, it's time to present it and showcase all of the hard-hitting, gloriously ROI-inducing results. You'll need to get comfortable with some data presentation tools, and I am not talking about Excel here (unless the data you want to present has maybe 10 bars you can put in a chart). I am talking about graphing, analytics, and statistics software such as R; or business intelligence software such as Pentaho or JasperSoft.
Success is Iterative
On both sides of the big data question, success depends on creating tools and processes to deal with the data you collect. Make tools to help manage your question and answer process. Try the tool, examine the result, and most of the time, the answer is going to be less useful than you had hoped. Iterate, make the tool better, try the analysis again. Try asking different questions! Show different answers.
As you gain practice, you will accelerate the feedback loop, thus getting better results. A successful big data project is far less about correlating disparate data structures and far more about articulating a testable hypothesis, designing that test, and evaluating the results.
Engineers, you will know when you are done because your business stakeholder will have a mandala-like glow of understanding surrounding them as they walk out of your office.
As CTO of Mogreet, Anthony Rossano directs all backend services, API, and messaging system development and operations. As a data geek, Anthony loves Ruby, NoSQL, clustered databases, key/value store, and stuff like that.