Channels ▼
RSS

The Intractable Screen Scraping Paradox


Windows Security

Does providing web access to your data have to mean surrendering control over its use?

Screen scraping is a threat to any web site that dispenses information that may be of value to others when parsed and stored in a database. It is a more serious threat when that information could be misused in ways that would violate other people's expectations of privacy or peace.

The term "screen scraping" once had a specific technical meaning. Screen scraping was a programming technique used for interacting with legacy terminal-based information systems that would require data input and provide data output one green screen at a time. By parsing output based on its position on the screen, which was usually predictable, and automating keyboard buffer input with the correct whitespace characters provided in the right places to switch between input fields, legacy systems that were still reliable and operational could be integrated into new applications rather than being rewritten.

Then the Web happened, and people everywhere started seeing things of value on their computer screens. Spammers saw an unlimited supply of fresh e-mail addresses in domain registration contact information, articles written for Web publications, resumes, and repositories of stored electronic communications such as Usenet newsgroups. Competitors saw countless hours of data entry effort paid for by others and realized that it would be cheaper to pay for a script to slurp all of that data and dump it to a database than to pay an army of data entry staff to reproduce the work.

Today's "screen scraping" is an information security threat faced by software or services that are valuable because they contain data but need to moderate access to that data to ensure that it is not misused or copied by a malicious third party.

Screen Scraping and Fair Use

Screen scraping is a paradox because it is only a problem for Web sites and applications that dispense data that other people are not supposed to use or possess. There is an implicit, and sometimes explicit, agreement between the user and the information provider that because ownership of the data remains with the copyright holder even after the communication has occurred in which the data was dispensed, that the recipients will self-police their behavior with respect to that data. As long as you abide by the principle of "fair use" there is no copyright violation in receiving, using, and possessing data from the Web, even if you use a screen scraper or Web crawler. Copyright law presently makes no distinction between different types of software that are equally capable of receiving data even though we consider one type of software to be bad and another type to be good.

In the absence of a contract that establishes additional obligations for the recipient, or a copy protection or access control mechanism as defined by the Digital Millennium Copyright Act (DMCA) that the recipient (or the recipient's software) must circumvent illegally in order to receive the data, nothing can be done to stop screen scraping, other than to prevent it from being possible in the first place. The best way to do that is not to dispense data.

The protection we can achieve in practice is to ensure that it is not so trivial to screen scrape our systems that just anyone could do it in a few minutes. Unfortunately, that is precisely the case with Web sites that are not designed in advance to put up technical barriers to screen scraping. Consider the following C# code, which enters a loop and retrieves the full HTML output provided by a Web site that accepts a single name/value pair to indicate which of a number of possible database records to display. If the range of records is between 1 and 100,000 as indicated in this code, then at completion this program will have succeeded in requesting every byte of data that can be retrieved from the Web site because all 100,000 possible database records will have been saved locally.

Listing 1: A Web Site Screen Scraper in C#


using System;
using System.Net;
using System.Threading;
using System.IO;

namespace WebCrawler {
  class ScreenScraper {
  [STAThread]
  static void Main(string[] args) {
  string root = "c:\\crawler\\";
  if(!System.IO.Directory.Exists(root)) {
  System.IO.Directory.CreateDirectory(root); }
  int maxValue = 100000;
  int minValue = 1;
  string dir = "";
  Random rnd = new Random((int)(DateTime.Now.Ticks % 
                                DateTime.Now.Millisecond));
  int b = 0;
  for(int a = minValue; a <= maxValue; a++) {
   WebClient web = new WebClient();
   b = 0;
   while(b == 0) {
   while(b < minValue || b > maxValue) { b = rnd.Next(); }
   if(b > 100) {
   dir = root + (b % 100) + "\\"; }
   else {
   dir = root + "100\\"; }
   if(!System.IO.Directory.Exists(dir)) {
   System.IO.Directory.CreateDirectory(dir); }
   string sFile = dir + b + ".htm";
   if(!File.Exists(sFile)) {
   Uri u = new Uri("http://FQDN/query.aspx?b=" + b);
   web.DownloadFile(u.ToString(),sFile);
   web.Dispose();
   Thread.Sleep(60000); // minimum 1 minute delay between requests
   }
   else {
   b = 0;
}}}}}}

Many people defend against first generation technology for Web-based screen scraping of e-mail addresses by spammers by decorating their address in some fashion. Examples are seen all over the Web, such as <jasonc (at) science.org> or the clever variation that contains a self-modification instruction such as <jasoncREMOVE@science.org>. This makes it more difficult for spammers to parse the Web pages they retrieve, and it is in the parsing that the screen scraping is complete. Prior to being parsed the data can't be used automatically.

The second generation of defense against screen scraping imposes intelligent rate limits on database-driven Web sites. We know that a human user driving a browser can only make a small number of requests per minute, so logic on the server that detects too many requests per minute could presume that screen scraping is taking place and prevent access from the offending IP address for some period of time. This may prevent non-malicious Web crawlers and search engines from indexing the content of a Web site, but that is a desired side effect in most cases where robots.txt asks such crawlers not to index database content, anyway. Most search engine crawlers won't do as the sample C# code shown here does and loop through every possible value of a name/value pair, anyway, and so any automated requests for every possible record could be presumed to be malicious and could be prevented.

The sample C# program shown in this article puts the active thread to sleep for one minute between requests to avoid overloading the server. If such a delay between requests is all that a screen scraper need do to avoid triggering your rate limits, then they may not be sufficient protection alone and should be supplemented with other safeguards.

Third generation defenses include prove-you're-human countermeasures that display a visual graphic or other Web page element that is difficult to parse under any circumstances except by a human who can easily see the hidden message in the noise. By requiring the hidden message to be typed by the human user, non-human users (screen scrapers and Web crawlers) can be effectively denied access.

Fourth generation defenses include context-sensitive request processing logic that layers-in awareness of how a resource is really used when it is used according to its intended purpose by a human user. We know, for example, that no client-whether human, automated script, or typing chicken-should be allowed to request every possible name/value pair in succession until they have viewed all 100,000 database records. Therefore we could prevent access to any additional records for a period of time from any IP address that requests more than a few records by name/value pair in a sequence. This only works when the data is keyed in such a way that normal use of the system won't typically result in such sequenced requests.

However, the C# program shown in this article randomizes the name/value pair until every possible value has been used, saving approximately 1,000 files per directory in directories named 0 through 100. The filenames determine which values have been retrieved and saved previously. There is thus little value in any technical defense that is so easily thwarted.

The real teeth of your anti-screen scraping approach has to be, unfortunately, one of those terrible facts of life today: a click-through contract. Force the people who access your site, or your software, to agree to be bound contractually by the terms of an agreement that prevents the misuse of your data. Then, if your data is misused, file a lawsuit. The first question of law to consider is whether that click-through contract applied to a software agent that we can only indirectly attribute to the defendant by proximity of the defendant to the computer on which the agent executed. It would be a good idea to gather some proof that a human at one time must have seen that there was a contract and "clicked" through it to access the data, otherwise you'll be arguing that a software program applied the defendant's digital signature by proxy. Proving which human it was who "digitally signed" the contract with the "click-through" may be unnecessary, depending upon the circumstances of the case. Screen scraping is an intractable problem, like spam, that can only be dealt with properly through legislation and jurisprudence to unambiguously establish the standards of behavior we are all required to follow in the course of our electronic communications and virtual relationships.


Jason Coombs is Director of Forensic Services for PivX Solutions Inc. (NASDAQ OTCBB: PIVX), a provider of security solutions, computer forensics, and expert witness services. Reach him at jasonc@science.org.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video