Command-Line Argument Processing & the Argv Library

Argv, the extendible Java-based argument-handling library Oliver presents here, lets you parse argument types and string values.

November 01, 2003
URL:http://www.drdobbs.com/jvm/command-line-argument-processing-the-ar/184405482

Nov03: Command-Line Argument Processing and The Argv Library

Command-Line Argument Processing and The Argv Library

Dr. Dobb's Journal November 2003

An object-oriented interface that cleans up any program

By Oliver Goldman

Oliver is an architect at Adobe Systems. He can be reached at [email protected].

Code for processing command-line arguments is the inauspicious start of many a program. In C/C++, you can use getopt() to handle a bit of the work, but with or without such helper functions, somewhere along the way, you've got to write a loop to walk through those arguments one by one. The body of that loop is often ugly, error-prone, repetitive, and far from object oriented.

In this article, I examine the current state of affairs of argument handling, then present Argv, a library I wrote to avoid argument-handling problems. A quantum leap beyond getopt(), Argv provides a convenient and object-oriented interface that cleans up any program. Argv can parse typical argument types, such as Boolean flags, string values, and more, and can be extended to handle more complex cases.

Argv is written in Java and is available electronically from DDJ (see "Resource Center," page 5) and at http://software.charlie-dog.com/ under an open-source license. The technique I'll describe here can also be applied in other languages.

The Argument Parsing Problem

The fundamental problem to be solved when parsing command-line arguments is translating untyped (as in variable typing as opposed to typed on a keyboard) information (which is encoded in an array of strings) into typed information stored in program variables. In fact, you can think of the goal as transforming an invocation of main(String argv[]) to an invocation of some function f, such as f(T1 a1,T2 a2,...).

Prototypical command-line parsing code looks something like Listing One. (Lest you think I made up a particularly poor example, this listing borrows heavily from the Solaris getopt() manual pages.) This approach might be called "string centric," in that the code is organized around the strings to be parsed: -a, -b, and -o.

The usage string provides some interesting additional information about the types of the arguments: -a and -b are actually two values in the same enumeration—only one can be specified for any invocation. This information fails to translate cleanly into the code itself: Information about the relationship between -a and -b is split among two different cases of the switch statement, and the use of -a and -b are recorded separately in aflg and bflg. Thus, to understand the transformation between the untyped command-line arguments and the typed information stored in program variables, you must understand the interplay between each iteration through the switch statement and each case statement it contains. The size of the switch statement grows linearly with the number of options to the program, but because of the potential interplay, the complexity of the parsing code tends to grow much more quickly.

If you read enough command-line programs, you might get the impression that more than one programmer thinks this solution is less than elegant. For example, it's common to see this parsing loop moved out of main and into an auxiliary function with a name like parse_arguments(). The instinct is good, but this particular factoring does little to simplify the situation. First, it doesn't address the code complexity of the string-centric approach. Second, the parse_arguments() routine is, itself, complicated in that it must communicate with main() regarding a large number of variables—at least one for each possible command-line argument. Of course, this can be handled by long argument lists or even a structure to wrap up the corresponding variables into a single argument, but those don't seem to be popular choices. I have also seen parse_arguments() written to record its results into a set of member variables that the main() method can also access, thus operating entirely by side effect.

A Better Solution: The Argv Library

To find a simpler implementation, you need to refactor your solution so it is centered around the arguments to f(), not main(). As such, the processing of each individual argument to f() should be centralized, even at the cost of distributing the processing across all string arguments to main().

The fundamental abstraction in the Argv library is an Argument, which binds all the information about an argument to f() in one location. That is, an Argument class binds together the type of some argument to f(), such as an enumerated value, along with the logic required to parse f() from command-line strings, even if the value is spread among more than one string. For example, the -a/-b enumerated argument in the getopt() example would be represented by a single argument instance in the Argv library. Again, contrast this with the typical getopt() loop in which the logic is associated with the command-line string, not the variable.

The library contains implementations for a number of common types with parameterized command-line string values, including:

Boolean. False by default. True if the corresponding command-line switch is specified one or more times. Typical examples include -h for help and -v for verbose output.
String. An argument with a String value and an optional default value. Values are specified on the command line after a specified switch; for example, -f /dev/null.
Pair. An argument with two String values; for instance, -x a b. I've found this useful for programs that transform (or otherwise process) a named input to a named output.
Number. An argument with a BigDecimal value and an optional default value. Values are specified on the command line following a specified switch; for example, -n 10. BigDecimal is used instead of a floating-point type because it captures both the value and the precision specified by users.
List. Vacuums up all command-line strings it sees, and returns them as a list of strings. It's useful at the end of a command line for, say, a list of input files.

Argv's parsing process is managed by an instance of the ArgumentParser class. Before parsing begins, each Argument to the program is registered with an instance of ArgumentParser via the addArgument() method. Order is important in that Arguments are given the opportunity to parse the command line in the same order as they are added to the ArgumentParser. This typically only has an effect, however, for the List argument type.

The actual command-line string array is passed to the ArgumentParser.parse() method. When this method completes, each registered Argument has, in turn, been invoked to apply its own logic to parsing these switches; see Listing Two.

The parse() method is the heart of the refactoring and consists of a double loop (Listing Three). The outer loop iterates through each command-line switch; this is similar to the getopt() solution. However, on each pass through the loop, the current argument list is dispatched to the parse() method of each individual Argument. Thus, the complicated parsing code that previously cluttered the getopt() loop has been cleanly factored out.

There are a couple of key steps in these loops that may not be immediately obvious. First, the outer loop is invoked not once for each element of the command-line argument array, but once for each element in the array that could be a command-line switch. For example, consider the command line -o /dev/null -a in Listing One. The first iteration considers -o. When this argument is processed, both the -o and /dev/null are consumed. Thus, the second pass considers -a.

Second, each Argument is given an opportunity to process each possible command-line switch, even if an argument has previously accepted a value. This is necessary because only the Argument, itself, knows what to do should its command-line switch happen to appear more than once. Arguments are passed command-line switches in the order in which the argument instances are added to the ArgumentParser. The Argv package makes no checks for duplicated command-line switches, and such. It is up to the invoking code to ensure that the overall argument set is consistent.

The parse() method returns a list of any arguments not consumed by the parse. In some applications, extra arguments may indicate an error condition; if so, the application can simply return an error if this list is not empty. The application may also elect to use this list as additional arguments. For example, such arguments might be a list of source files being passed to a compiler. However, extra arguments collected by ArgumentParser may have appeared anywhere in the argument list, and variable-length lists of input files typically appear only at the end of the command-line switches. If you have such a list that can appear only at the end of a command line, registering a ListArgument as the last registered argument gives you the desired behavior.

Once the parse is completed, each Argument instance can be queried for the value it parsed. By convention, this value is obtained from a getValue() method returning the appropriate type. More sophisticated argument types might, if appropriate, provide additional methods for dealing with the argument value.

If an error condition does occur during the parse, whether it be a required argument with no value, extra arguments that weren't parsed, or anything else, the ArgumentParser.printUsage() method can be used to help construct an appropriate usage message. This method requires a PrintWriter as input and simply passes it to the printUsage() method of each registered Argument. Each Argument type in the Argv library follows the same formatting convention for its printUsage() method, resulting in an easily readable result.

Extending Argv

The Argv library can easily be extended with new types of Arguments by creating new classes that implement the Argument interface. This interface contains only two methods requiring implementation.

The first method, parse(), is where the parsing of individual arguments occurs. This method receives, as its sole argument, the list of command-line argument strings that have not yet been parsed. The method implementation must examine the first item in this list and determine whether that string matches this argument type. The method may examine as many additional elements in the list as necessary. For example, an argument taking both a switch, like -x, and a separate value, like fubar, would examine the first two elements in the list.

This method must return the number of items in the list consumed by this argument. If the first string in this list did not match, this argument must return zero. Elements in the list reported as processed by a parse() method will not be available for any other argument to parse.

Remember that the parse() method will be invoked once for each position in the command-line argument array at which an argument could begin. What this means to the argument value depends on the argument semantics you've chosen to define: Subsequent values could be ignored or they could replace earlier values. Specifying an argument more than once could be an error condition or change the value of the associated Argument object.

The second method you must implement is printUsage(). This method is invoked by the ArgumentParser if ArgumentParser.printUsage() is invoked to generate a usage message on the specified PrintWriter. Arguments in the Argv library all indent each line of their usage message two spaces; if you follow the same convention, your usage messages will look that much better.

Implementing the Argument interface does not require any explicit specifications of things like switch values or usage text; this is all handled by the parse() and printUsage() methods. When implementing a new argument type, you can choose whether you wish to parameterize these values. Because they are intended for general use, the argument types included with the Argv library allow switches and usage strings to be set via constructor arguments. Thus, for example, the BooleanArgument class can be used to register any Boolean argument controlled by a single flag. Complicated or one-off argument types, however, might reasonably embed the flag and usage strings within the argument implementation itself.

While the small set of argument types currently in the Argv library has proven useful for a wide variety of programs I have written, there are also many possible command-line features it does not yet support. If you develop additional Argument types or functionality that you feel may be generally useful and would like to contribute to this library, please feel free to contact me.

Conclusion

Processing command-line arguments involves taking an untyped list of strings and transforming it into a useable set of typed values. Although it is typical (and even encouraged by library calls such as getopt) to deal with command-line arguments as strings, this is an error-prone process and tends to clutter even the cleanest programs. The Argv library presented here takes a different approach, immediately transforming the command-line argument strings into fully typed information. Once this transformation is accomplished, the remainder of the program can be written in a clean, object-oriented fashion. Argv is easy to use and easy to get right.

DDJ

Listing One

int main( int argc, char **argv ) {
    int c;
    extern char *optarg;
    extern int optind;
    int aflg = 0;
    int bflg = 0;
    int errflg = 0;
    char *ofile = NULL;

    while ((c = getopt(argc, argv, "abo:")) != EOF)
        switch (c) {
          case 'a':
            if (bflg)
              errflg++;
            else
              aflg++;
            break;
          case 'b':
            if (aflg)
              errflg++;
            else
              bflg++;
            break;
          case 'o':
            ofile = optarg;
            break;
          case '?':
            errflg++;
        }
        if (errflg) {
           fprintf( stderr, "usage: cmd [-a|-b] [-o <filename>] files...\n" );
           return 2;
        }
    }
    ...
}

Back to Article

Listing Two

import com.charliedog.argv.*;
import java.io.PrintWriter;
import java.util.List;

public static void main( String[] argv ) {

    // Initialize arguments that may appear in the command line. By convention,
    // '-help' simply prints the command line usage and exits.
    StringArgument destination = 
       new StringArgument( "-dest", "localhost", "Destination for requests" );
    BooleanArgument help = 
       new BooleanArgument( "-help", "Describe command line args" );
    // Initialize and invoke the parser. Arguments not consumed during parse
    // are returned in case they may be subject to additional processing, etc.
    // Variable 'args' is assumed to contain String array passed to main().

    ArgumentParser parser = new ArgumentParser();
    parser.addArgument( destination );
    parser.addArgument( help );
    List extra = parser.parse( argv );

    // For this application, extra arguments will be treated as a usage error.
    if( !extra.isEmpty() || help.getValue()) {
        PrintWriter out = new PrintWriter( System.out );
        parser.printUsage( out );
        out.close();
        System.exit( 0 );
    }
    // Continue, using destination.getValue()...
}

Back to Article

Listing Three

public List parse( String args[] ) {
    List values = Arrays.asList( args );
    List extras = new LinkedList();
perValue: while( !values.isEmpty()) {
        // Give each Argument a shot at parsing the list in its
        // current form. Stop on the first match.

        Iterator i = arguments.iterator();
        while( i.hasNext()) {
            Argument arg = (Argument)( i.next());
            int numArgsConsumed = arg.parse( values );
            if( numArgsConsumed > 0 ) {
                values = values.subList( numArgsConsumed, values.size());
                continue perValue;
            }
        }
        // If no matches were found, move the first value to the extras
        // list and try again. Don't use values.remove( 0 ) here because
        // it is an optional method.
        extras.add( values.get( 0 ));
        values = values.subList( 1, values.size());
    }
    return extras;
}

Back to Article