Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

JVM Languages

Automating Release Notifications


Java Tools for Web Page Retrieval and Text Extraction

The Jakarta Commons HTTPClient class can be used to efficiently retrieve web pages. Listing One is code to get web page content. Java has built-in support for parsing HTML text. The HTMLEditorKit.ParserCallback class allows for text extraction. The Jericho HTML Parser (jerichohtml.sourceforge.net/doc/index.html) is another option. Regular expression parsing can be done using Java's java.util.regex.Matcher and Pattern classes. If full Perl 5 syntax support is required, the Jakarta ORO classes can be used. Multithreading to retrieve multiple website pages concurrently is greatly simplified by the java.util.concurrent package and HTTPClient's MultiThreadedHttpConnectionManager class. Java 6 introduced additional features that aid in the UI implementation: results are stored using Java 6's built-in JAXB support. The TableRowSorter and RowFilter classes are used for sorting and filtering the table of results.

Last of all, the Substance Look and Feel (https://substance.dev.java.net) is used to give the application a polished appearance.

public static String getContent(HttpClient httpClient, String url)
   {
   logger.info("getting content from url: " + url);
   GetMethod get = new GetMethod(url);
   get.setFollowRedirects(true);
   // impersonate browser
   get.setRequestHeader("User-Agent", "Mozilla/4.0 (compatible; MSIE 7.0; Windows
NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0; .NET CLR
2.0.50727; .NET CLR 3.0.04506.30)");
   get.setRequestHeader("Accept", "image/gif, image/x-xbitmap, image/jpeg,
image/pjpeg, application/x-shockwave-flash, application/xaml+xml,
application/vnd.ms-xpsdocument, application/x-ms-xbap,  application/x-msapplication,
*/*");
   get.setRequestHeader("Accept-Language", "en-us");
   get.setRequestHeader("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");
   try
   {
      int statusCode = httpClient.executeMethod(get);
      logger.info("http status code: " + statusCode);
      InputStream is = get.getResponseBodyAsStream();
      ByteArrayOutputStream baos = new ByteArrayOutputStream(2048);
      byte[] buff = new byte[2048];
      int bytesRead = 0;
      while ( (bytesRead = is.read(buff)) != -1)
      {
         baos.write(buff, 0, bytesRead);
      }
         is.close();
         baos.close();
         String text = baos.toString();
         logger.debug("raw html=" + text);
         return text;
      }
        catch (Exception ex)
     {
        logger.error(ex);
     }
     finally
     {
        get.releaseConnection();
     }
     return null;
}
Listing One


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.