Channels ▼
RSS

Tools

Program Configuration in Python


Computer programs are made of code. However, most nontrivial programs can be configured to behave in different ways without changing the code. There are many ways of configuring a program such as: command-line arguments, environment variables, configuration files, reading configuration information from a database, and reading configuration data over the network. Each form of configuration is appropriate for certain situations. Many programs combine several forms of configuration. In this article, I explore the spectrum of configuration options for single programs, distributed processes (same program running on multiple cores and/or machines), and distributed systems (a collection of different programs running on multiple cores and/or machines). I will also present a Python package that can help with managing configuration when dealing with systems composed of multiple configurable components.

While the sample code is in Python, the concepts apply just the same to any programming language including compiled languages.

Configuration is typically information provided to the program that controls operational aspects at runtime. Some examples are:

  • How much memory to use
  • The size of the thread pool
  • The database URL
  • Number of retries when an operation fails
  • The delay in seconds between retry attempts
  • What port to listen on

There are many configuration mechanisms. Let's consider the pros and cons of each one and how multiple mechanisms can be used together to provide several options to configure the same program.

Command-Line Arguments

Command-line arguments are very flexible. You can launch different instances of the same program with different command-line arguments. Most programming languages also have good modules or libraries for documenting and parsing command-line arguments.  In Python, the argparse module is all the rage these days.

However, command-line arguments have disadvantages: You have to enter them each and every time you want to run a program. This can be a tedious and error-prone process, especially if you are dealing with large volumes. Command-line arguments don't excel when the configuration contains spaces and/or quotes and needs to be escaped to avoid mis-parsing. Also, command-line arguments are not great for hierarchical data for obvious reasons. Another problem with command-line arguments for configuration is that the user specifies these arguments, which can lead to a steep learning-curve for non-techies and the consequent errors. Finally, command-line arguments can be passed only once. There is no way to update command-line arguments while the program is running.

Using Launchers with Fixed Command-Line Arguments

If you launch a program with the same command-line arguments over and over again, you may work around the drudgery by defining a launcher (shell script, alias, shortcut, or .bat file on Windows) that encodes the command-line arguments.

For example, if I always want to run sum.py with the same argument, I could write a little shell launcher called sum.sh:

python sum.py input.txt 3

Now, I can run it anytime just by typing sum.sh (assuming it is executable).

Launchers can also encode just some of the command-line arguments and let the user enter additional arguments. This allows sysadmins more control over the operation of the program by providing a launcher that exposes to the user only input arguments, such as the name of the input file, but encodes operational arguments, such as how long to wait after the program is done.

Environment Variables

Environment variables are often used for configuration because you can set them once per user and they will persist for all invocations of the program. Here is how to read from an environment variable in Python:

# Get the delay from the environ variable SUM_DELAY and convert to a float
delay = float(os.environ['SUM_DELAY'])

There are various ways for sysadmins to control the user environment, and often certain programs are assigned to specific accounts with a managed environment or launchers set the environment variables before invoking the target program.

But environment variables are not a configuration panacea either. They suffer from many of the issues associated with command-line arguments. They are not suitable for large amounts of configuration or hierarchical configuration (although there are no parsing issues, as each environment variable is separate).

A potentially big problem with environment variables involves naming conflicts. Suppose I named the environment variable DELAY, and another program also used an environment variable called DELAY. If both programs run in the same environment, only one environment variable called DELAY can be set and both will share the its value. Thus, a common practice is to prefix the environment variable with a (hopefully) unique prefix.

A final note on environment variable limitations: The environment is a shared resource and polluting it with lots of variables needed by a specific program is just bad form.

Configuration Files

The configuration file is the workhorse of configuration management. It is the preferred method of configuration for lots of programs that have many configuration options or hierarchical configuration. Configuration files come in an assortment of formats.

In the early days of Windows, .ini files were king (very simple, but only one native level of nesting). Then XML came along and it became a popular configuration file format (sadly, still with us on platforms like Android).

The Web 2.0 and Ajax revolution brought JSON as a more popular alternative. On the UNIX side, simple key = value .conf files are very common.

In Python, a common option that blurs the lines between code and configuration is to import a Python file. I will not dwell on it because it is solely a dynamic language trick, but check the configuration documentation for Web frameworks like Django or Flask if you're curious.

Listing One reads the delay from a configuration file. I'll use my favorite YAML format. Here is the config file conf.yaml,: {delay:3}

Listing One

import sys
import time
import yaml

# Get the name of the input file
filename = sys.argv[1]

# Get the delay and convert to a float
delay = yaml.load(

# Read the numbers from file
numbers = [int(x) for x in open(filename).read().split()]

# Print the sum
print (sum(numbers))

# Wait the designated number of seconds
time.sleep(delay)

Configuration files can provide a lot of flexibility via search paths. For example, a default configuration file for some component X may be in some standard location such as /etc/X/X.conf. But, if there is a configuration file in the current user's directory ~/.X/X.conf, then it overrides the configuration in the default X.conf; and if a configuration filename is provided via an environment variable or command-line argument (as in the example above), then it takes precedence.

Configuration files are handy, but they have to be managed carefully. There are several issues that can arise with configuration files, especially when dealing with large deployments.

The first issue is, where do you keep them? Some configuration files contain sensitive information such as user names, passwords, API keys, etc. Storing them in plaintext in your version-control system may not be appropriate. Another issue is that different environments may require different configurations, so while there may be a single version of the code running in each environment (dev, staging, production), there may be different configuration files per environment or even per machine. This issue exists with other forms of configuration like command-line arguments and environment variables, too, but the difference is that people often assume that you can just treat configuration files like source code files. Another issue with very modular systems is that there may be many configuration files for different components/modules/libraries. A single program may be composed of multiple components and each one may have its own configuration file. There may be other programs using the same component, but with a different configuration. For example, consider a logging package, where you need to provide a log filename and other information via a configuration file. Multiple programs will need to use this logging component and each one may want to have its own log filename and other parameters (such as minimal log level). In large systems composed of multiple programs that share multiple configurable components, this can lead to a combinatorial explosion of configuration files, which is a nightmare to manage.

Distributed Configuration

Distributed configuration is an approach for managing configuration in a distributed system. It addresses many of the problems associated with using configuration files. The concept is that the configuration data is stored in a centrally available store. You could roll your own using a database or a shared file system, but it is probably in your best interest to use an existing distributed configuration store. A distributed configuration store has an API and/or client libraries you can use from anywhere to get and set configuration information (if you have access). In this article, I'll use etcd as a distributed configuration store.

The benefits are that a single store can be globally maintained and provide a solid foundation to solve hard problems like synchronization, rolling updates, and versioning. Because the configuration store is a critical component of the system, you have to make sure it is sufficiently available and redundant.

Combining Options

Sometimes configuration uses a combination of multiple mechanisms: some distributed configuration, some configuration files, some environment variables, and some command-line arguments. In reality, every nontrivial enterprise with multiple programs and services running is pretty much guaranteed to be combining options. There are several reasons for this situation:

  • Some organizations don't have an explicit policy regarding configuration and every developer or team comes up with their own solution.
  • Different mechanisms are appropriate in different situations as discussed earlier.
  • Legacy code has staying power. Even if there is a good policy in place it doesn't mean that all existing software should be modified.

Distributed Configuration vs. One Big Configuration File

Some people have a natural aversion to distributed configuration, even for distributed systems. The argument goes like this: We have a distributed system and we can already deploy code reliably to N servers. Why can't we just have one big configuration file per application, and whenever needed, just deploy it to all the N servers the application is running on?

There are several drawbacks to this approach:

  • A big configuration update might often partially fail, and you end up with some servers running old configurations and some servers running new configurations.
  • Some configuration settings (such as a database URL) are shared by multiple applications. You will need to duplicate these settings in the big configuration file of each application.
  • Different environments (dev, staging, production) will differ in some of their settings (for example, that same database URL). You will have to maintain different configuration files for each environment.

If you have three or four environments and tens or hundreds of applications that share different subsets of their configuration, things become very cumbersome.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video