Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Parsing Web Form Input in CGI Shell Scripts


UnixReview.com
January 2007

This month's Shell Corner is for CGI programmers. Chris Johnson provides the Bash (version 2.0 or greater) shell function parse_query for processing web forms.

Parsing Web Form Input in CGI Shell Scripts

by Chris F.A. Johnson

CGI scripts for processing web forms can be written in any language, even shell scripts. The hardest part is parsing the input from the form.

The input is delivered using one of two methods, GET or POST. The GET method adds the information to the end of the URL, and the web server makes it available to the CGI program in the environment variable, QUERY_STRING. The POST method places it on the program's standard input.

Whichever method is used, the input is in the form name=value, with multiple name-value pairs being separated by ampersands, e.g.: user=john&firstname=John&lastname=Doe.

To ensure safe transmission across 7-bit networks, and to prevent ambiguity, all 8-bit characters and most non-alphanumeric characters are converted to a 2-digit hex code with a leading percent sign. For example, all dollar signs, $, are converted to %24. Spaces are converted to plus signs. A CGI program must split the string using the ampersand as the delimiter, convert plus signs to spaces and hex codes to ASCII characters, as well as separate the values from the names.

When I wrote my first CGI scripts, more than 10 years ago, I used some CGI utilities, written in C, to do the parsing and hex conversion. Since then, I have found ways of doing it in the shell itself. My latest version, which I wrote for my revised Word Finder and Anagram Solver recently, is not as portable as my previous scripts (it requires bash version 2 or later), but it is compact and flexible.

It is a single function, parse_query, that takes a list of variable names as its arguments. Values will only be assigned to the variables in the list. Since anyone can bypass a web form and submit any name-value pairs they like to the script, this ensures that malicious variable names have no effect.

Here is the function without interruption; a trial run and a detailed, line-by-line explanation follow:

parse_query()  #@ USAGE: parse_query var [var ...]
{
    local var val
    local IFS='&'
    vars="&$*&"
    [ "$REQUEST_METHOD" = "POST" ] && read QUERY_STRING
    set -f
    for item in $QUERY_STRING
    do
      var=${item%%=*}
      val=${item#*=}
      val=${val//+/ }
      case $vars in
         *"&$var&"* )
             case $val in
                 *%[0-9a-fA-F][0-9a-fA-F]*)
                      val=$( printf "%b" "${val//\%/\\x}." )
                      val=${val%.}
             esac
             eval "$var=\$val"
             ;;
      esac
    done
    set +f
}   
Trial Run Put the parse_query function and the following lines into a file and execute it:
unset REQUEST_METHOD  ## just in case
QUERY_STRING="name=Jane%20Doe&shell=bash"e=%22"
parse_query name shell quote
printf "%s\n" "name=$name" "shell=$shell" "quote=$quote"
echo
unset name shell quote
parse_query name quote
printf "%s\n" "name=$name" "shell=$shell" "quote=$quote"
The result should be:
name=Jane Doe
shell=bash
quote="

name=Jane Doe
shell=
quote="
How it works I always use the POSIX form of function definition, rather than the ksh variety, and I always (well, almost always) put a comment with a character to aid grepping for documentation. My good intentions to comment all my code often get no further than that:
parse_query()  #@ USAGE: parse_query var [var ...]
{
The variables var and val are used to hold the name of the variable and its value respectively. They are made local to the function and cannot be used as names in the web form:
    local var val
The Internal Field Separator is set to an ampersand so that the QUERY_STRING is broken up automatically:
    local IFS='&'
When $* is used inside double quotes, the arguments are separated by the first character of IFS; vars will contain all the allowable variables, and each will be preceded and followed by an ampersand:
vars="&$*&"
If the method used by the form is GET, QUERY_STRING will already contain the form information. If it uses POST, this line reads it into QUERY_STRING:
    [ "$REQUEST_METHOD" = "POST" ] && read QUERY_STRING
File globbing is normally performed when variables are expanded; set -f turns it off. This shouldn't be necessary, since any wildcard characters in QUERY_STRING will be in hex format, but it doesn't hurt and is a good habit to get into:
    set -f
Since IFS is set to an ampersand, QUERY_STRING will be broken up into its constituent name=value segments, and a for loop processes each one in turn:
    for item in $QUERY_STRING
    do
The name of the variable is everything before the first equals sign:
      var=${item%%=*}
Its value is everything after the first equals sign:
      val=${item#*=}
Plus signs are converted to spaces using the non-standard parameter expansion found in bash:
      val=${val//+/ }
The case statement checks that $var is found between ampersands in the $vars variable:
      case $vars in
           
*"&$var&"* )
If it does, the nested case statement checks whether the value contains a hex code:
             case $val in
                 *%[0-9a-fA-F][0-9a-fA-F]*)
Here is where the conversion is done. There are four steps involved in this line: 1. Percent signs are converted to backslash-x (\x) using the parameter expansion mentioned above; 2. The built-in command printf's %b specifier converts the resulting hex code to the ASCII character; 3. a period is added to prevent command substitution stripping any trailing newlines; and 4. the result of printf via command substitution is assigned to val:
                                 val=$( printf "%b" "${val//\%/\\x}." )
The superfluous period is removed:
                            val=${val%.}
                esac
Finally, the value is assigned to the variable:
                eval "$var=\$val"
                ;;
      esac
    done 
Purists will say that the next line is wrong; rather than turning globbing back on, the function should have saved the state of the options (opts=$-) before turning it off, and turn it on only if it was on when the function was entered (case $opts in *f*) ;; *) set +f;; esac). They'd be right, but as I never turn off globbing except for specific commands, I can get away with it.
    set +f
} 

Here is the unHTMLized function .

After 20 years in magazine and newspaper publishing (as a writer, editor, graphic designer, and production manager), Chris F.A. Johnson now earns his living composing cryptic crossword puzzles, teaching chess, and dabbling in Unix systems administration and programming. He is also the author of Shell Scripting Recipes: A Problem-Solution Approach.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.