Sentences in English have patterns that can be identified and extracted through Natural Language Processing (NLP) and computational linguistic techniques. Imagine the immense power a machine would have if it could construct simple or complex sentences by precisely comprehending the technique to construct meaningful sentences. Computational linguistics is a field that has long fascinated me and made me build various NLP-based systems for text extraction and mining. In this article, I explain algorithmically how various text fragments differ from each other.
POS Tagging and Learning
A few months back, I had this interesting idea of prototyping English constructs so that random well-formed sentences can be framed by bots/machines using predefined constructs. One way to achieve this is through computational linguistics and machine learning. I started my journey by analyzing all the "good" sentences in the "world" to build a system that identifies the grammatical structures of these sentences. Every sentence has a grammatical signature, which is not unique, but definitely evident. I've termed this grammatical signature "POSTALS Print." POSTALS (Part of Speech Tagging and Learning System) is an NLP system that I am building that identifies and learns these grammatical signatures. POSTALS Print is just like a fingerprint, with the notable difference that it is not unique.
For instance, consider this sentence:
"Happiness is but an occasional episode in the general drama of pain."
Let us explode this sentence to reveal its structure. I rely on the Penn Treebank bracketing guidelines:
(ROOT (S (NP (NN Happiness)) (VP (VBZ is) (ADVP (CC but)) (NP (NP (DT an) (JJ occasional) (NN episode)) (PP (IN in) (NP (NP (DT the) (JJ general) (NN drama)) (PP (IN of) (NP (NN pain))))))) (. .)))
Where:
S
= Simple declarative causeNP
= Noun PhraseNNS
= Plural NounVP
= Verb PhraseVBZ
= Third person singular present verbADVP
= Adverb PhraseCC
= Coordinating conjunctionNP
= Noun PhraseDT
= DeterminerJJ
= AdjectiveNN
= Singular NounPP
= Prepositional PhraseIN
= Subordinating conjunction
As you can see, English sentences can be easily prototyped into a POS tree. Now, let me further flatten the tree to get the POSTALS Print, which is the grammatical signature of the sentence:
S+NP+NN+VP+VBZ+ADVP+CC+NP+NP+DT+JJ+NN+PP+IN+NP+NP+DT+JJ+NN+PP+IN+NP+NN+
.
Change any one of the words while maintaining the same word categories and you can create many different sentences that have the same fingerprint.
Web Crawling and Text Mining
To enable the software to construct and validate text fragments, I need to train it with a reasonably accurate and massive text data mined from reliable text corpora on the Web. What is so massive and almost reliable? Wikipedia! If I can successfully extract all possible POSTALS Prints from all the Wikipedia articles, I will have the ability to analyze them and use them as a feed to train the system.
To do this and even extend it farther, I setup a cluster of 8 quad-core dedicated servers to extract and analyze some of the top eBooks from Project Gutenberg, popular Web pages, Wikipedia, and other artifacts from various literature sites. My clustered system, POSTALS, kept analyzing sentences and "learned" the usages of these sentences (Figure 1). I used a combination of Stanford's POS tagger and the LEX Parser.
Figure 1: POSTALS in action.
POSTALS also compared different sentences and computed scores similar to the popular Levenshtein distance, which is a string metric for measuring the difference between two sequences.
As I write this article, the system has already understood the construction semantics of text fragments through a set of 27 million sentences extracted from top 100 eBooks from Project Gutenberg and the sources I just mentioned.
A Little Math
Now, the biggest challenge for my system is to handle this massive data. Hypothetically, if
N
= Count of all possible POS tags,
S Min
= A
= Count of minimum words in a sentence,
S Max
= B
= Count of maximum words in a sentence,
Then, the total number of POSTALS Prints T
that can be created:
T=NA+NA+1+NA+2+…NB
So, to evaluate POSTALS Print for sentences having a minimum of 3 words and a maximum of 20 words with a total POS combination of 84 POS tags (using full PENN POS tags for maximum accuracy), we need to allow the system to learn:
843+844+845+…8420 text fragments.
This is a big number! Thus, POSTALS cannot truly aim at building so many POSTALS Prints. However, the good thing is that not all combinations of POS tags are valid English. In fact, later in my research, I did find the number to be much more manageable.
POS Tags Attract and Repel Each Other!
When I allowed POSTALS to crawl the Web, I was primarily focusing on creating POS bigrams for prominent POS tags. I was mostly interested in knowing which POS tags attract each other and which POS tags repel each other. After completing my the crawl through 27-million sentences, I computed a POS bigram distribution without probability computation. This bigram data can be used to computationally generate valid sentences on the go.
For example, from my research data, I know that for having a simple third-person singular present verb (VBZ
), I need to assume that the following POS tags appear more frequently before the VBZ
tag than do other tags:
[VP=22237435, RB=210375, VBZ=134798, SQ=27674, DT=7853, NN=6526, JJ=2556, CC=2396, SINV=2256, ,=2081, IN=1916, NNP=1527, RBR=1409, NNS=1363, PP=847, PRP=534, VBD=505, RBS=498, CD=402, "=367, JJS=341, NP=233, FW=120, RP=80, JJR=69, NNPS=64, -RRB-=47, VBN=31, TO=27, WRB=18, PRP$=15, WP=15, ''=12, NX=12, VB=11, WDT=10, :=8, VBP=7, POS=5, UH=1, VBG=1, SYM=0
]
And the following POS tags appear more frequently after the VBZ
tag:
[NP=8101503, VP=3964786, PP=2115363, ADJP=1994864, ADVP=1851702, S=1422499, SBAR=929467, RB=899301, .=513813, ,=310447, PRT=288317, CC=174890, UCP=44360, PRN=8591, :=4714, WHNP=3703, -RRB-=1920, -LRB-=1268, CONJP=1104, SQ=1059, "=769, ''=615, WHPP=571, NN=551, FRAG=437, SBARQ=257, INTJ=223, VBD=142, NNS=119, JJ=110, NNP=87, WHADJP=72, IN=54, X=23, VBP=13, JJR=9, VBZ=9, QP=8, VB=5, VBN=5, NNPS=2, CD=1, WHADVP=1
]
In the above example, VP=22237435
indicates the number occurrences of the tag VP
before the tag VBZ
is 22 million based on the crawled data with VBZ appearing 35134503 times in the Wikipedia text corpus.
Just to illustrate this case a little, to construct a valid sentence containing an adjective phrase (ADJP
), I use the following bigram distribution to compute the conditional probability of POS tags occurring with each other.
The following POS tags appear more frequently before the ADJP
tag:
[VBZ=1721580, DT=1204446, VBD=925102, VB=899263, VBP=833402, NP=726136, RB=711666, CC=485817, NN=362586, ,=343154, S=231026, VBN=199836, PRP=197600, NNS=197390, IN=175953, VBG=155692, JJ=91946, PRP$=86057, UCP=80301, FRAG=79791, NNP=75007, VP=32206, ADJP=27216, TO=21149, CD=17037, WRB=10900, SINV=8765, JJS=4479, JJR=4040, NNPS=3727, RBR=3627, WP$=3284, -LRB-=2364, "=2311, :=2304, RRC=2188, RP=1992, START=1542, ADVP=1505, NX=1429, EX=1238, -RRB-=1016, POS=824, FW=715, WHNP=525, SBAR=487, WDT=461, RBS=416, ''=321, SYM=219, PP=140, WP=109, PDT=30, UH=23, MD=14, LS=8, .=1
]
And the following POS tags appear more frequently after the ADJP
tag:
[JJ=5481218, RB=2569666, ADJP=423890, RBR=380790, RBS=236380, VBN=189200, JJR=147418, ADVP=101392, NP=90829, NN=56058, DT=49134, IN=44174, NNP=44171, JJS=40727, VBG=19958, PRP$=11235, RP=11029, WHADVP=6822, FW=5976, QP=5477, $=4369, PRN=4057, CD=3461, VB=3410, VBP=2738, "=1664, CC=1543, PP=1514, VBD=1218, #=1086, ,=1022, WRB=433, NNS=186, -LRB-=60, CONJP=58
]
From the above distribution, it is evident that English speakers often use a third-person singular present verbs before an adjective phrase. When I train this system to make it learn a billion POSTALS Prints, it will have the intelligence to deconstruct and construct sentences with precision.