Channels ▼

Mike Riley

Dr. Dobb's Bloggers

Natural Language Processing with Python Book Review

October 11, 2009

Python is well known among programmers and system administrators alike to possess powerful libraries ranging from web frameworks and image processing to automated workflows and gaming.  A lesser known yet extremely powerful Python library is the Natural Language Toolkit.  Natural Language Processing with Python demonstrates how to leverage this toolkit to create sophisticated NLP applications.  Read on for my review.

Ever since I programmed my first interactive application in BASIC on a TRS-80 nearly 25 years ago, I dreamed of fluid, natural conversations with computers, ala various science fiction stories like Arthur C Clark's 2001: A Space Odyssey and Phillip K Dick's Do Androids Dream of Electric Sheep.  The computing world has evolved by leaps and bounds since then, though we have yet to attain that elusive vision of ubiquitous, natural conversational interaction with a computer.  Text to speech engines are fast approaching the sound and inflection of a convincing human voice and voice recognition has also greatly improved (my Android phone is highly accurate with short phrases - not quite 100%, but far better than what voice recognition was like even a few years ago).  Still, an intelligent back-end is required to hook all these technologies together in a cohesive, effortless user experience.  While Natural Language Processing with Python didn't quite attain this lofty goal, it educated me on the nuances of NLP and the difficult computing problems that need to be resolved before this futuristic vision can become commonplace.

The book starts off with the terms and concepts behind NLP and introduces the free, open source Natural Language Toolkit (NLTK), followed by installing and downloading the NLTK demo book data collection and running some simple Python scripts to show off the NLTK's functions and lexical diversity.  A fun exercise is running nltk.chat.chatbots() which shows how NLP can interact with users in a not-quite-there Turing Test sort of way.  The next 10 chapters delve into all things NLP, from accessing and processing large bodies of text (both text corpora and raw formats), a quick Python primer oriented toward NLP (complete with Mathplotlib and PyNum data visualization examples) in Chapter 4, using a part-of-speech tagger and automating such tagging via regular expressions, lookups and N-Gram tagging.  Text and sequence classification and recognizing textual entailment (ex: predicting the true/false relationships of text within a statement) are covered in Chapter 6.  Decision trees, information gain (a measure of "how much more organized the input values become when we divide them up using a given feature"), naive Bayes classifiers ("every feature gets a say in determining which label should be assigned to a given input value"), and other techniques: zero counts, smoothing, maximum entropy classifiers, linguistic pattern modeling, information extraction architecture from unstructured data, chunking, chinking and tag patterns, tree traversal, named entry recognition (NER), relation extraction and more.  Chapter 8 covers sentence structure analysis (i.e., dealing with ubiquitous ambiguity), context-free, dependency and weighted grammars, with feature-based grammars discussed in Chapter 9.  All this dense background comes together in Chapter 10 by applying an NLP interface to an underlying SQL-structured data source using propositional and first-order logic.  Understanding the semantics of English sentences via the Principle of Compositionality, lambda-Calculus, quantified NP's, transitive verbs and discourse representation structures (DRS).  The final chapter on managing linguistic data from various sources such as the web, word document files and spreadsheets is demonstrated in a TIMIT (a consortium of Texas Instruments and the Massachusetts Institute of Technology) Corpus, and concluding with an extended welcome to the Open Languages Archive Community (OLAC).  The book closes with an Afterword on engaging the reader in the various computational challenges in state-of-the-art NLP systems, the NLTK roadmap and a bold invitation to "build new language technologies to better serve the needs of the information society, and ultimately as a pathway into deeper understanding of the vast riches of human language."  Who could turn down such an offer?!

Each chapter concludes with a series of exercises ranging in difficulty; unfortunately, answers to the exercises are nowhere to be found, not even on the book's website.  Some of the more public-facing examples of NLP in action are on popular web sites including ask.com and wolframalpha.com.  While the authors fail to point readers to such commercial entities that have successfully incorporated the NLTK into their backend data processing applications, such websites no doubt employ the principles discussed in the book.

In summary, Natural Language Processing with Python delivers a solid education for any computing professional interested in the complexity and current state of the art in NLP systems.  Python programmers will find the book especially Pythonic in the NLTK's implementation and use of NLP principles.  While my dream of having an intelligent spoken word conversation with my computer may have to wait for another 25 years of computing evolution, this book helped me understand the complexities of the problem and ways to get closer to the solution.

 

 
Title:  Natural Language Processing with Python
Authors: Steven Bird, Ewan Klein and Edward Loper
Publisher: O'Reilly Media
ISBN: 978-0-596-51649-9
Pages: 512
Price: $44.99 US

 

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 


Video