Data Analysis with Open Source Tools Book Review
Given the broad number of open source data collection and analysis libraries and utilities freely available on the Internet, the concept of combining data analysis with open source tools is a topic worthy of deeper exploration. How well does author Philipp Janert fair with this effort? Read on to find out.This is the second book review I've written in the past month that was written by a physicist turned software developer and book author. However, unlike Ruby on Rails Tutorial author Michael Hartl, Data Analysis with Open Source Tools Mr. Janert has pursued a consulting practice in algorithm development, data analysis, and mathematical modeling. As such, his specialty makes him the ideal subject matter expert to write such a book.
The book is sectioned into four parts: Graphics, Analytics, Computation and Applications. The introduction opens with the sentence: "Imagine your boss comes to you and says: Here are 50 GB of logfiles - find a way to improve our business!" And thus the journey of learning how to dissect, differentiate, visualize and report vast quantities of data begins. Employing copious code samples and libraries in both the Python and R languages coupled with the appropriate math required for the algorithm's functionality, the author steps readers through several hefty lessons in algorithmic development and implementation. Even though the book was written the beginner and intermediate audience, some of the topics might be intimidating for those with weak math backgrounds. The author also opted to constrain the book's scope to general data analysis topics; as such, things like network analysis, natural language processing and "big data" are not covered.
The choice to open the first part of the book on the topic of graphics was a good one, since rendering the data in histograms, probability, rank-order, scatter and time-series plots with a little help from the NumPy, matplotlib and scipy Python libraries help readers see trends and patterns easier. Other open source tools demonstrated by the author include Gnuplot, the Gno Scientific Library (GSL), Pycluster (and the C Clustering Library), SimPy, Berkeley DB and SQLite to name a few. In addition to a basic overview of classical statistics, the author summons computational analytic examples using Bernoulli trials, Gaussian, Geometric, Poisson and Power-Law distributions.
The third part of the book reviews the mining of data via simulation, clustering, Principle Component Analysis (PCA), self-organizing maps (SOM's, a.k.a., Kohonen maps), and others. The book's forth and final part on applications puts the data to use via business intelligence, dashboards, and reports. Examples using financial calculation and modeling (ex: depreciation, direct/indirect and opportunity costs, fixed/variable costs, etc.), predictive analytics (ex: Bayesian, decision tree, instance and rule-based classifiers and nearest-neighbor methods) and others. The book also has three appendixes - one that reviews programming environments for scientific computation and data analysis, a rather lengthy one on various calculus tricks and techniques, and the final one on working with collections of data (featuring a phrase that I will no doubt use on occasion from now on, this being the "care and feeding of the data zoo").
Each chapter concludes with a workshop section featuring programming examples that reiterate and apply that respective chapter's principles. There are also insightful opinions interspersed throughout the book, such as an essay on why the promise of the map/reduce craze may not be the silver bullet its proponents advertise it to be, and the nature of statistical learning. Unlike other O'Reilly-published books, Data Analysis with Open Source doesn't have any of the usual 'animal track' tips and 'bear trap' warnings, so these brief commentaries by the author help fill in the theory with the real-world expectations and practices. The book's epilogue is aptly named "Facts Are Not Reality" reminding readers who may have become overly enthusiastic by the power of computation that "data-driven decision making is a contradiction in terms", "when the data contradicts appearances, appearances will win" and the most potent reminder of all, "The most important things in life can't be measured." But for that which can be, this book has plenty good advice and smart techniques to offer to its audience.
Title: Data Analysis with Open Source Tools: A Hands-On Guide for Programmers and Data Scientists Author: Philipp K. Janert Publisher: O'Reilly Media ISBN: 978-0-596-80235-6 Pages: 536 Price: $39.99