Channels ▼
RSS

Design

Computational Linguistics and Text Mining


Distance Between Text Fragments

Having this research as a base, is it possible to find out how two text fragments are different from each other? In information theory, there are known ways to measure the edit distances between two fragments of sentences. These techniques are not effective since they don't evaluate the grammatical structure of the text fragments. For instance, Levenshtein proposed a model wherein single-character edits for transforming one text fragment to another is considered as the distance between these fragments). In the method that I devised using computational linguistics techniques, I have found an effective way to find the "grammatical distance" between text fragments.

Consider two similar text fragments, Fragment 1 and Fragment 2:

Grammatical deconstruction of the text based on POSTAL of Fragment 1: Life is an untold tale. generates the results in Figure 2.

TextDistance
Figure 2: Life is an untold tale.

Similarly, for Fragment 2: Acceptance is a subtle pain. the result is shown Figure 3.

TextDistance
Figure 3: Acceptance is a subtle pain.

Any string-comparison tool or distance-measuring tool will not treat these two fragments as "equal" though they are "grammatically equal." When I say that two text fragments are grammatically equal, they are identical in their POS sequences and ordering as shown in Figures 2 and 3.

Computing the Distance

The grammatical distance between these two strings is computed as follows: Both the POSTALS Prints are overlaid and the distance of one POS tag to the corresponding node in the other fragment is computed considering lesser weighting of distance for the earlier POS tags to a greater weighting of distance for the later POS tags.

Here is a reference implementation in Java:

    private static int getPostalsDistance(String s1, String s2) {
        String pennString1 = getPennString(s1);
        String cleanPenn1 = getCleanPenn(pennString1);
        String pennString2 = getPennString(s2);
        String cleanPenn2 = getCleanPenn(pennString2);
        StringTokenizer stok1 = new StringTokenizer(cleanPenn1, "+");
        ArrayList array1 = new ArrayList();
        ArrayList array2 = new ArrayList();
        while (stok1.hasMoreTokens()) {
            array1.add(stok1.nextToken());
        }
        StringTokenizer stok2 = new StringTokenizer(cleanPenn2, "+");
        while (stok2.hasMoreTokens()) {
            array2.add(stok2.nextToken());
        }
        int postalsDistance = 0;
        //Find the shorter pattern
        if (array1.size() < array2.size()) {
            for (int i = 0; i < array1.size(); i++) {
                String p1 = (String) array1.get(i);
                String p2 = (String) array2.get(i);
                if (!p1.equals(p2)) {
                    postalsDistance += i;
                }
            }
        } else {
            for (int i = 0; i < array2.size(); i++) {
                String p1 = (String) array2.get(i);
                String p2 = (String) array1.get(i);
                if (!p1.equals(p2)) {
                    postalsDistance += i;
                }
            }
        }
        return postalsDistance;
    }

What is more interesting is that when I have the ability to compute the POSTALS distance between two random text fragments, I can enable the system to construct similar sentences based on the POSTALS distance. I have built a GUI which, based on the input sentence and the POSTALS distance, generates random sentences (Figure 4).

TextDistance
Figure 4: GUI for the POSTALS system.

Conclusion

Using a combination of computational linguistics and text mining, we can build a complex text-generating system that can "talk" and "respond" like humans when trained effectively. What is the need of computing the POSTALS Print? How will this information be useful in linguistic studies? I believe:

  • In future, machines will have the ability to construct simple-to-complex English sentences with almost 100% accuracy provided the mined text base is reasonably huge and accurate.
  • The machines will start "understanding" the mood of sentences and be able to group sentences based on mood and other factors.
  • A grammar-checking system that flags bad grammar and auto-suggests right structures purely based on mined data is possible.
  • An AI-based learning system that can "understand" and "communicate" with us in English is possible.

When they do this, this kind of fingerprinting and categorization of sentences will play a crucial part in their linguistic capabilities.


Frank Jennings is a Senior Content and Community Lead at Adobe Systems.

[Update: Fixed number formatting and added some explanatory text.]


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video