Channels ▼


Parsing XML Files in .NET Using C#

Download the code for this issue

Parsing XML files is an unglamorous task that can be time consuming and tricky. In the days before .NET, programmers were forced to read XML as a text file line by line and then use string functions and possibly regular expressions. This is a time-consuming and error-prone process, and just not very much fun.

While I was writing .NET test automation that had test case data stored in XML files, I discovered that the .NET Framework provides powerful new ways of parsing XML. But in conversations with colleagues, I also discovered that there are a variety of opinions on which way of parsing XML files is the best.

I set out to determine how many different ways there are to parse XML using .NET and to understand the pros and cons of each technique. After some experimentation, I learned that there are five fundamentally different ways to parse XML, and that the "best" method depends both on the particular development situation you are in and on the style of programming you prefer.

In the sections that follow, I will demonstrate how to parse a testCases.xml file using five different techniques. Each technique is based on a different .NET Framework class and its associated methods:

  • XmlTextReader
  • XmlDocument
  • XPathDocument
  • XmlSerializer
  • DataSet

After I explain each technique so you can modify my examples to suit your needs, I will give you guidance on which technique should be used in which situation. Knowing these five methods for parsing XML files will be a valuable addition to your .NET skill set. I'm assuming that you're familiar with C#, VS.NET, the creation and use of class libraries, and have a working knowledge of XML files.

The XML File to Parse and the Goal

Let's examine the testCases.xml file that we will use for all five parsing examples. The file contents are shown in Listing One.

Listing One: XML file to parse

<?xml version="1.0" encoding="utf-8" ?> 

  <testcase id="001" kind="bvt">

  <testcase id="002" kind="drt">

  <testcase id="003" kind="bvt">

Note that each of the three test cases has five data items: id, kind, arg1, arg2, and expected. Some of the data is stored as XML attributes (id and kind), and arg1 and arg2 are stored as XML elements two levels deep relative to the root node (suite). Extracting attribute data and dealing with nested elements are key tasks regardless of which parsing strategy we use.

The goal is to parse our XML test cases file and extract the data into memory in a form that we can use easily. The memory structure we will use for four of the five parsing methods is shown in Listing Two. (The method that employs an XmlSerializer object requires a slightly different memory structure and will be presented later.)

Listing Two: CommonLib.dll definitions

using System;
using System.Collections;

namespace CommonLib
  public class TestCase
    public string id;
    public string kind;
    public string arg1;
    public string arg2;
    public string expected;

  public class Suite
    public ArrayList items = new ArrayList();
    public void Display()
      foreach (TestCase tc in items)
        Console.Write( + " " + tc.kind + " " + tc.arg1 + " ");
        Console.WriteLine(tc.arg2 + " " + tc.expected);
  } // class Suite
} // ns 

Because four of the five techniques will use these definitions, for convenience we can put the code in a .NET class library named "CommonLib." A TestCase object will hold the five data parts of each test case, and a Suite object will hold a collection of TestCase objects and provide a way to display it.

Once the XML data is parsed and stored, the result can be represented as shown in >Figure 1. The data can now be easily accessed and manipulated.

Figure 1 XML data stored in memory

Parsing XML with XmlTextReader

Of the five ways to parse an XML file, the most traditional technique is to use the XmlTextReader class. The example code is shown in Listing Three.

Listing Three: Parsing XML using XmlTextReader

using System;
using System.Xml;
using CommonLib;

namespace Run
  class Class1
    static void Main(string[] args)
      CommonLib.Suite s = new CommonLib.Suite();
      XmlTextReader xtr = new XmlTextReader("..\\..\\..\\..\\testCases.xml");
      xtr.WhitespaceHandling = WhitespaceHandling.None;
      xtr.Read(); // read the XML declaration node, advance to <suite> tag

      while (!xtr.EOF) //load loop
        if (xtr.Name == "suite" && !xtr.IsStartElement()) break;

        while (xtr.Name != "testcase" || !xtr.IsStartElement() ) 
          xtr.Read(); // advance to <testcase> tag

        CommonLib.TestCase tc = new CommonLib.TestCase(); = xtr.GetAttribute("id");
        tc.kind = xtr.GetAttribute("kind");
        xtr.Read(); // advance to <inputs> tag
        xtr.Read(); // advance to <arg1> tag
        tc.arg1 = xtr.ReadElementString("arg1"); // consumes the </arg1> tag
        tc.arg2 = xtr.ReadElementString("arg2"); // consumes the </arg2> tag
        xtr.Read(); // advance to <expected> tag
        tc.expected = xtr.ReadElementString("expected"); // consumes the </expected> tag
        // we are now at an </testcase> tag
        xtr.Read(); // and now either at <testcase> tag or </suite> tag
      } // load loop

      s.Display(); // show the suite of TestCases

    } // Main()
  } // class Class1
 } // ns Run

After creating a new C# Console Application Project in Visual Studio .NET, we add a Project Reference to the CommonLib.dll file that contains definitions for TestCase and Suite classes. We start by creating a Suite object to hold the XML data and an XmlTextReader object to parse the XML file.

The key to understanding this technique is to understand the Read() and ReadElementString() methods of XmlTextReader. To an XmlTextReader object, an XML file is a sequence of nodes. For example,

<?xml version="1.0" ?>

has 6 nodes: the XML declaration, <foo>, <bar>, 99, </bar>, and </foo>.

The Read() method advances one node at a time. Unlike many Read() methods in other classes, the System.XmlTextReader.Read() does not return significant data. The ReadElementString() method, on the other hand, returns the data between the begin and end tags of its argument, and advances to the next node after the end tag. Because XML attributes are not nodes, we have to extract attribute data using the GetAttribute() method.

Figure 2 shows the output of running this program. You can see that we have successfully parsed the data from testCases.xml into memory.

Figure 2 Output from the XmlTextReader technique

The statement xtr.WhitespaceHandling = WhitespaceHandling.None; is important because without it you would have to Read() over newline characters and blank lines.

The main loop control structure that I used is not elegant but is more readable than the alternatives:

while (!xtr.EOF) //load loop
        if (xtr.Name == "suite" && !xtr.IsStartElement()) break;

It exits when we are at EOF or an </suite> tag.

When marching through the XML file, you can either Read() your way one node at a time or get a bit more sophisticated with code like the following:

while (xtr.Name != "testcase" || !xtr.IsStartElement() ) 
          xtr.Read();          // advance to <testcase> tag

The choice of technique you use is purely a matter of style.

Parsing an XML file with XmlTextReader has a traditional, pre-.NET feel. You walk sequentially through the file using Read(), and extract data with ReadElementString() and GetAttribute(). Using XmlTextReader is straightforward and effective and is appropriate when the structure of your XML file is relatively simple and consistent. Compared to other techniques we will see in this article, XmlTextReader operates at a lower level of abstraction, meaning it is up to you as a programmer to keep track of where you are in the XML file and Read() correctly.

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.