Channels ▼
RSS

Tools

Applying the Big Data Lambda Architecture


The Batch Layer

The raw data is first pre-processed and loaded into Hive. In Hive (remember, this constitutes the master dataset in the batch layer of our USN app) the following schema is used:

CREATE TABLE usn_base (
 actiontime STRING,
 originator STRING,
 action STRING,
 network STRING,
 target STRING,
 context STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';

To import the CSV data, to build the master dataset, the shell script batch-layer.sh executes the following HiveQL commands:

LOAD DATA LOCAL INPATH '../data/usn-base-data.csv' INTO TABLE usn_base;

DROP TABLE IF EXISTS usn_friends;

CREATE TABLE usn_friends AS
SELECT actiontime, originator AS username, network, 
       target AS friend, context AS note
FROM usn_base
WHERE action = 'ADD'
ORDER BY username, network, username;

With this, the USN app master dataset is ready and available in HDFS and I can move on to the next layer, the serving layer.

The Serving Layer of the USN App

The batch view used in the USN app is realized via an HBase table called usn_friends. This table is then used to drive the USN app front-end; it has the schema shown in Figure 4.

lambda
Figure 4: HBase schema used in the serving layer of the USN app.

After building the serving layer, I can use the HBase shell to verify if the batch view has been properly populated in the respective table usn_friends:

$ ./bin/hbase shell
hbase(main):001:0> describe 'usn_friends'
...
 {NAME => 'usn_friends', FAMILIES => [{NAME => 'a', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'N true
 ONE', REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL =>
  '-1', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK =>
  'true', BLOCKCACHE => 'false'}]}
1 row(s) in 0.2450 seconds

You can have a look at some more queries used in the demo user interface on the Wiki page of the GitHub repository.

Putting It All Together

After the batch and serving layers have been initialized and launched, as described, you can launch the user interface. To use the CLI, make sure that HBase and the HBase Thrift service are running and then, in the main USN app directory run:

$ ./usn-ui.sh
This is USN v0.0

u ... user listings, n ... network listings, l ... lookup, s ... search, h ... help, q ... quit

Figure 5 shows a screen shot of the USN app front-end in action:

lambda
Figure 5: Screen-shot of the USN app command line user interface.

The three main operations the USN front-end provides are as follows:

  • u ... user listing lists all acquaintances of a user
  • n ... network listing lists acquaintances of a user in a network
  • l ... lookup listing lists acquaintances of a user in a network and allows restrictions on the time range (from/to) of the acquaintanceship
  • s ... search provides search for an acquaintance over all users, allowing for partial match

An example USN app front-end session is available at the GitHub repo for you to study.

What's Next?

I have intentionally kept USN simple. Although fully functional, it has several intentional limitations (due to space restrictions here). I can suggest several improvements you could have a go at, using the available code base as a starting point.

  • Bigger data: The most obvious point is not the app itself but the data size. Only laughable 500 rows? This isn't Big Data I hear you say. Rightly so. Now, no one stops you generating 500 million rows or more and try it out. Certain processes such as pre-processing and the generating the layers will take longer but there are no architectural changes necessary, and this is the whole point of this USN app.
  • Creating a full-blown batch layer: Currently, the batch layer is a sort of one-shot, while it should really run in a loop and append new data. This requires partitioning of the ingested data and some checks. Pail, for example, allows you to do the ingestion and partitioning in a very elegant way.
  • Adding speed layer and automated import: It would be interesting to automate the import of data from the various social networks. For example, Google Takeout allows exporting all data in bulk mode, including G+ Circles. For a stab at the speed layer, one could try and utilize the Twitter fire-hose along with Storm.
  • More batch views: There is currently only one view (friend list per network, per user) in the serving layer. The USN app might benefit from different views to enable different queries most efficiently, such as time-series views of network growth or overlaps of acquaintanceships across networks.

I hope you have as much fun playing around with the USN app and extending it as I had writing it in the first place. I'd love to hear back from you on ideas or further improvements either directly here as a comment or via the GitHub issue tracker of the USN app repository.

Further Resources


Michael Hausenblas is the Chief Data Engineer EMEA, MapR Technologies.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video