Dr. Dobb's | Parsing XML Files in .NET Using C#

Parsing XML Files in .NET Using C#

The .NET Framework provides several ways to extract data from an XML file into memory. We'll demonstrate the best uses of five fundamentally different techniques.

July 01, 2003
URL:http://www.drdobbs.com/windows/continuous-linq/windows/parsing-xml-files-in-net-using-c/184416669

Download the code for this issue

Parsing XML files is an unglamorous task that can be time consuming and tricky. In the days before .NET, programmers were forced to read XML as a text file line by line and then use string functions and possibly regular expressions. This is a time-consuming and error-prone process, and just not very much fun.

While I was writing .NET test automation that had test case data stored in XML files, I discovered that the .NET Framework provides powerful new ways of parsing XML. But in conversations with colleagues, I also discovered that there are a variety of opinions on which way of parsing XML files is the best.

I set out to determine how many different ways there are to parse XML using .NET and to understand the pros and cons of each technique. After some experimentation, I learned that there are five fundamentally different ways to parse XML, and that the "best" method depends both on the particular development situation you are in and on the style of programming you prefer.

In the sections that follow, I will demonstrate how to parse a testCases.xml file using five different techniques. Each technique is based on a different .NET Framework class and its associated methods:

XmlTextReader
XmlDocument
XPathDocument
XmlSerializer
DataSet

After I explain each technique so you can modify my examples to suit your needs, I will give you guidance on which technique should be used in which situation. Knowing these five methods for parsing XML files will be a valuable addition to your .NET skill set. I'm assuming that you're familiar with C#, VS.NET, the creation and use of class libraries, and have a working knowledge of XML files.

The XML File to Parse and the Goal

Let's examine the testCases.xml file that we will use for all five parsing examples. The file contents are shown in Listing One.

Listing One: XML file to parse


<?xml version="1.0" encoding="utf-8" ?> 
<suite>

  <testcase id="001" kind="bvt">
    <inputs>
      <arg1>4</arg1>
      <arg2>7</arg2>
    </inputs>
    <expected>11.00</expected>
  </testcase>

  <testcase id="002" kind="drt">
    <inputs>
      <arg1>9</arg1>
      <arg2>6</arg2>
    </inputs>
    <expected>15.00</expected>
  </testcase>

  <testcase id="003" kind="bvt">
    <inputs>
      <arg1>5</arg1>
      <arg2>8</arg2>
    </inputs>
    <expected>13.00</expected>
  </testcase>
<
/suite>

Note that each of the three test cases has five data items: id, kind, arg1, arg2, and expected. Some of the data is stored as XML attributes (id and kind), and arg1 and arg2 are stored as XML elements two levels deep relative to the root node (suite). Extracting attribute data and dealing with nested elements are key tasks regardless of which parsing strategy we use.

The goal is to parse our XML test cases file and extract the data into memory in a form that we can use easily. The memory structure we will use for four of the five parsing methods is shown in Listing Two. (The method that employs an XmlSerializer object requires a slightly different memory structure and will be presented later.)

Listing Two: CommonLib.dll definitions


using System;
using System.Collections;

namespace CommonLib
{
  public class TestCase
  {
    public string id;
    public string kind;
    public string arg1;
    public string arg2;
    public string expected;
  }  

  public class Suite
  {
    public ArrayList items = new ArrayList();
    public void Display()
    {
      foreach (TestCase tc in items)
      {
        Console.Write(tc.id + " " + tc.kind + " " + tc.arg1 + " ");
        Console.WriteLine(tc.arg2 + " " + tc.expected);
      }
    }
  } // class Suite
} // ns

Because four of the five techniques will use these definitions, for convenience we can put the code in a .NET class library named "CommonLib." A TestCase object will hold the five data parts of each test case, and a Suite object will hold a collection of TestCase objects and provide a way to display it.

Once the XML data is parsed and stored, the result can be represented as shown in >Figure 1. The data can now be easily accessed and manipulated.

Figure 1 XML data stored in memory

Parsing XML with XmlTextReader

Of the five ways to parse an XML file, the most traditional technique is to use the XmlTextReader class. The example code is shown in Listing Three.

Listing Three: Parsing XML using XmlTextReader


using System;
using System.Xml;
using CommonLib;

namespace Run
{
  class Class1
  {
    [STAThread]
    static void Main(string[] args)
    {
      CommonLib.Suite s = new CommonLib.Suite();
            
      XmlTextReader xtr = new XmlTextReader("..\\..\\..\\..\\testCases.xml");
      xtr.WhitespaceHandling = WhitespaceHandling.None;
      xtr.Read(); // read the XML declaration node, advance to <suite> tag

      while (!xtr.EOF) //load loop
      {
        if (xtr.Name == "suite" && !xtr.IsStartElement()) break;

        while (xtr.Name != "testcase" || !xtr.IsStartElement() ) 
          xtr.Read(); // advance to <testcase> tag

        CommonLib.TestCase tc = new CommonLib.TestCase();
        tc.id = xtr.GetAttribute("id");
        tc.kind = xtr.GetAttribute("kind");
        xtr.Read(); // advance to <inputs> tag
        xtr.Read(); // advance to <arg1> tag
        tc.arg1 = xtr.ReadElementString("arg1"); // consumes the </arg1> tag
        tc.arg2 = xtr.ReadElementString("arg2"); // consumes the </arg2> tag
        xtr.Read(); // advance to <expected> tag
        tc.expected = xtr.ReadElementString("expected"); // consumes the </expected> tag
        // we are now at an </testcase> tag
        s.items.Add(tc);
        xtr.Read(); // and now either at <testcase> tag or </suite> tag
      } // load loop

      xtr.Close();
      s.Display(); // show the suite of TestCases

    } // Main()
  } // class Class1
 } // ns Run

After creating a new C# Console Application Project in Visual Studio .NET, we add a Project Reference to the CommonLib.dll file that contains definitions for TestCase and Suite classes. We start by creating a Suite object to hold the XML data and an XmlTextReader object to parse the XML file.

The key to understanding this technique is to understand the Read() and ReadElementString() methods of XmlTextReader. To an XmlTextReader object, an XML file is a sequence of nodes. For example,

<?xml version="1.0" ?>
<foo>
  <bar>99</bar>
</foo>

has 6 nodes: the XML declaration, <foo>, <bar>, 99, </bar>, and </foo>.

The Read() method advances one node at a time. Unlike many Read() methods in other classes, the System.XmlTextReader.Read() does not return significant data. The ReadElementString() method, on the other hand, returns the data between the begin and end tags of its argument, and advances to the next node after the end tag. Because XML attributes are not nodes, we have to extract attribute data using the GetAttribute() method.

Figure 2 shows the output of running this program. You can see that we have successfully parsed the data from testCases.xml into memory.

Figure 2 Output from the XmlTextReader technique

The statement xtr.WhitespaceHandling = WhitespaceHandling.None; is important because without it you would have to Read() over newline characters and blank lines.

The main loop control structure that I used is not elegant but is more readable than the alternatives:

while (!xtr.EOF) //load loop
      {
        if (xtr.Name == "suite" && !xtr.IsStartElement()) break;

It exits when we are at EOF or an </suite> tag.

When marching through the XML file, you can either Read() your way one node at a time or get a bit more sophisticated with code like the following:

while (xtr.Name != "testcase" || !xtr.IsStartElement() ) 
          xtr.Read();          // advance to <testcase> tag

The choice of technique you use is purely a matter of style.

Parsing an XML file with XmlTextReader has a traditional, pre-.NET feel. You walk sequentially through the file using Read(), and extract data with ReadElementString() and GetAttribute(). Using XmlTextReader is straightforward and effective and is appropriate when the structure of your XML file is relatively simple and consistent. Compared to other techniques we will see in this article, XmlTextReader operates at a lower level of abstraction, meaning it is up to you as a programmer to keep track of where you are in the XML file and Read() correctly.

Parsing XML with XmlDocument

The second of five ways to parse an XML file is to use the XmlDocument class. The example code is shown in Listing Four.

Listing Four: Parsing XML using XmlDocument


using System;
using System.Xml;
using CommonLib;

namespace Run
{
  class Class1
  {
    [STAThread]
    static void Main(string[] args)
    {
      CommonLib.Suite s = new CommonLib.Suite();

      XmlDocument xd = new XmlDocument();
      xd.Load("..\\..\\..\\..\\testCases.xml");
      
      XmlNodeList nodelist = xd.SelectNodes("/suite/testcase"); // get all <testcase> nodes

      foreach (XmlNode node in nodelist) // for each <testcase> node
      {
        CommonLib.TestCase tc = new CommonLib.TestCase();
        
        tc.id = node.Attributes.GetNamedItem("id").Value;
        tc.kind = node.Attributes.GetNamedItem("kind").Value;

        XmlNode n = node.SelectSingleNode("inputs"); // get the one <input> node
        tc.arg1 = n.ChildNodes.Item(0).InnerText;
        tc.arg2 = n.ChildNodes.Item(1).InnerText;

        tc.expected = node.ChildNodes.Item(1).InnerText;

        s.items.Add(tc);
      } // foreach <testcase> node
      
      s.Display();

    } // Main()
  } // class Class1

} // ns Run

XmlDocument objects are based on the notion of XML nodes and child nodes. Instead of sequentially navigating through a file, we select sets of nodes with the SelectNodes() method or individual nodes with the SelectSingleNode() method. Notice that because XML attributes are not nodes, we must get their data with an Attributes.GetNamedItem() method applied to a node.

After loading the XmlDocument, we fetch all the test case nodes at once with:

XmlNodeList nodelist = xd.SelectNodes("/suite/testcase");

Then we iterate through this list of nodes and fetch each <input> node with:

XmlNode n = node.SelectSingleNode("inputs");

and then extract the arg1 (and similarly arg2) value using:

tc.arg1 = n.ChildNodes.Item(0).InnerText;

In this statement, n is the <inputs> node; ChildNodes.Item(0) is the first element of <inputs>, i.e., <arg1> and InnerText is the value between <arg1> and </arg1>.

The output from running this program is shown in Figure 3. Notice it is identical to the output from running the XmlTextReader technique and, in fact, all the other techniques presented in this article.

Figure 3 Output from the XmlDocument technique

The XmlDocument class is modeled on the W3C XML Document Object Model and has a different feel to it than many .NET Framework classes that you are familiar with. Using the XmlDocument class is appropriate if you need to extract data in a nonsequential manner, or if you are already using XmlDocument objects and want to maintain a consistent look and feel to your application's code.

Let me note that in discussions with my colleagues, there was often some confusion about the role of the XmlDataDocument class. It is derived from the XmlDocument class and is intended for use in conjunction with DataSet objects. So, in this example, you could use the XmlDataDocument class but would not gain anything.

Parsing XML with XPathDocument

The third technique to parse an XML file is to use the XPathDocument class. The example code is shown in Listing Five.

Listing Five: Parsing XML using XPathDocument


using System;
using System.Xml.XPath;
using CommonLib;

namespace Run
{
  class Class1
  {
    [STAThread]
    static void Main(string[] args)
    {
      CommonLib.Suite s = new CommonLib.Suite();

      XPathDocument xpd = new XPathDocument("..\\..\\..\\..\\testCases.xml");
      XPathNavigator xpn = xpd.CreateNavigator();
      XPathNodeIterator xpi = xpn.Select("/suite/testcase");
      
      while (xpi.MoveNext()) // each testcase node
      {
        CommonLib.TestCase tc = new CommonLib.TestCase();
        tc.id = xpi.Current.GetAttribute("id", xpn.NamespaceURI);
        tc.kind = xpi.Current.GetAttribute("kind", xpn.NamespaceURI);

        XPathNodeIterator tcChild = xpi.Current.SelectChildren(XPathNodeType.Element);
        while (tcChild.MoveNext()) // each part (<inputs> and <expected>) of <testcase>
        {
          if (tcChild.Current.Name == "inputs")
          {
            XPathNodeIterator tcSubChild = tcChild.Current.SelectChildren(XPathNodeType.Element);
            while (tcSubChild.MoveNext()) // each part (<arg1>, <arg2>) of <inputs>
            {
              if (tcSubChild.Current.Name == "arg1")
                tc.arg1 = tcSubChild.Current.Value;
              else if (tcSubChild.Current.Name  == "arg2")
                tc.arg2 = tcSubChild.Current.Value;
            }
          }
          else if (tcChild.Current.Name == "expected")
            tc.expected = tcChild.Current.Value;
        }
        s.items.Add(tc);

      } // each testcase node
      
      s.Display();
      
    } // Main()
  } // class Class1

} // ns Run

Using an XPathDocument object to parse XML has a hybrid feel that is part procedural (as in XmlTextReader) and part functional (as in XmlDocument). You can select parts of the document using the Select() method of an XPathNavigator object and also move through the document using the MoveNext() method of an XPathNodeIterator object.

After loading the XPathDocument object, we get what is in essence a reference to the first <testcase> node into an XPathNodeIterator object with:

XPathNavigator xpn = xpd.CreateNavigator();
XPathNodeIterator xpi = xpn.Select("/suite/testcase");

Because XPathDocument does not maintain "node identity," we must iterate through each <testcase> node with this loop:

while (xpi.MoveNext())

Similarly, we have to iterate through the children with:

while (tcChild.MoveNext())

The XPathDocument class is optimized for XPath data model queries. So using it is particularly appropriate when the XML file to parse is deeply nested or has a complex structure. You might also consider using XPathDocument if other parts of your application code use that class so that you maintain a consistent coding look and feel.

Parsing XML with XmlSerializer

The fourth technique we will use to parse an XML file is the XmlSerializer object. The example code is shown in Listing Six.

Listing Six: Parsing XML using XmlSerializer

using System;
using System.Xml.Serialization;
using System.IO;
using SerializerLib; // defines a Suite class compatible with testCases.xml

namespace Run
{
  class Class1
  {
    [STAThread]
    static void Main(string[] args)
    {
      XmlSerializer xs = new XmlSerializer(typeof(Suite));
      StreamReader sr = new StreamReader("..\\..\\..\\..\\testCases.xml");
      SerializerLib.Suite s = (SerializerLib.Suite)xs.Deserialize(sr);
      sr.Close();
      s.Display();
    } 
  } // class Class1
} // ns Run

Using the XmlSerializer class is significantly different from using any of the other classes because the in-memory data store is different from the CommonLib.Suite we used for all other examples. In fact, observe that pulling the XML data into memory is accomplished in a single statement:

SerializerLib.Suite s = (SerializerLib.Suite)xs.Deserialize(sr);

I created a class library named "SerializerLib" to hold the definition for a Suite class that corresponds to the testCases.xml file so that the XmlSerializer object can store the XML data into it. The trick, of course, is to set up this Suite class.

Creating the Suite class is done with the help of the xsd.exe command-line tool. You will find it in your Program Files\Microsoft Visual Studio .NET\FrameworkSDK\bin folder. I used xsd.exe to generate a Suite class and then modified it slightly by changing some names and adding a Display() method.

The screen shot in Figure 4 shows how I generated the file testCases.cs, which contains a Suite definition that you can use directly or modify as I did. Listings Seven and Eight show the classes generated by XSD and my modified classes in the SerializerLib library.

Figure 4 Generating testCases.cs definitions using XSD

Listing Seven: XSD-generated suite definition


// This source code was auto-generated by xsd, Version=1.0.3705.288.
// 
using System.Xml.Serialization;

[System.Xml.Serialization.XmlRootAttribute("suite", Namespace="", IsNullable=false)]
public class suite {
    [System.Xml.Serialization.XmlElementAttribute("testcase")]
    public suiteTestcase[] Items;
}

public class suiteTestcase {
    public string expected;
    [System.Xml.Serialization.XmlElementAttribute("inputs")]
    public suiteTestcaseInputs[] inputs;
    [System.Xml.Serialization.XmlAttributeAttribute()]
    public string id;
    [System.Xml.Serialization.XmlAttributeAttribute()]
    public string kind;
}

public class suiteTestcaseInputs {
    public string arg1;
    public string arg2;
}

Listing Eight: Modified suite definition

using System;
using System.Xml.Serialization;

namespace SerializerLib
{
  [XmlRootAttribute("suite")]
  public class Suite 
  {
    [XmlElementAttribute("testcase")]
    public TestCase[] items; // changed name from xsd-generated code
    public void Display() // added to xsd-generated code
    {
      foreach (TestCase tc in items)
      {
        Console.Write(tc.id + " " + tc.kind + " "  + tc.inputs.arg1 + " ");
        Console.WriteLine(tc.inputs.arg2 + " " + tc.expected);
      }
    }
  }

  public class TestCase  // changed name from xsd-generated code
  {
    [XmlAttributeAttribute()]
    public string id;
    [XmlAttributeAttribute()]
    public string kind;
    [XmlElementAttribute("inputs")]
    public Inputs inputs; // change from xsd-generated code: no array
    public string expected;
  }

  public class Inputs // changed name from xsd-generated code
  {
    public string arg1;
    public string arg2;
  }
}

Using the XmlSerializer class gives a very elegant solution to the problem of parsing an XML file. Compared with the other four techniques in this article, XmlSerializer operates at the highest level of abstraction, meaning that the algorithmic details are largely hidden from you. But this gives you less control over the parsing and lends an air of magic to the process.

Most of the code I write is test automation, and using XmlSerializer is my default technique for parsing XML. XmlSerializer is most appropriate for situations not covered by the other four techniques in this article: fine-grained control is not required, the application program does not use other XmlDocument objects, the XML file is not deeply nested, and the application is not primarily an ADO .NET application (as we will see in our next example).

Parsing XML with DataSet

The fifth and final method we will use to parse an XML file into memory uses the DataSet class. The example code is shown in Listing Nine.

Listing Nine: Parsing XML using DataSet

using System;
using System.Xml;
using System.Data;
using CommonLib; // Suite class definition
using InfoLib; // DisplayInfo() method 

namespace Run
{
  class Class1
  {
    [STAThread]
    static void Main(string[] args)
    {
      DataSet ds = new DataSet();
      ds.ReadXml("..\\..\\..\\..\\testCases.xml");
    
      InfoLib.DataSetInfo.DisplayInfo(ds); // show table, column, relation names

      CommonLib.Suite s = new CommonLib.Suite();
      foreach (DataRow row in ds.Tables["testcase"].Rows)
      {
        CommonLib.TestCase tc = new CommonLib.TestCase();
        tc.id = row["id"].ToString();
        tc.kind = row["kind"].ToString();
        tc.expected = row["expected"].ToString();

        DataRow[] children = row.GetChildRows("testcase_inputs"); // relation name

        tc.arg1 = (children[0]["arg1"]).ToString(); // there is only 1 row in children
        tc.arg2 = (children[0]["arg2"]).ToString();
        
        s.items.Add(tc);
      }

      s.Display();
 
    } // Main()

  } // class Class1
} // ns

We start by reading the XML file directly into a System.Data.DataSet object using the ReadXml() method. A DataSet object can be thought of as an in-memory relational database. The XML data ends up in two tables, "testcase" and "inputs," that are related through a relation "testcase_inputs." The key to using this DataSet technique is to know the way to determine how the XML data gets stored into the DataSet object.

Although we could create a custom DataSet object with completely known characteristics, it is much quicker to let the ReadXml() method do the work and then examine the result. I wrote a helper function DisplayInfo() that accepts a DataSet as an argument and displays the information we need to extract the data from the DataSet's tables.

To keep the main parse program uncluttered, I put DisplayInfo() into a class library named "InfoLib." The code is shown in Listing Ten. The output from running the parse program is shown in Figure 5.

Listing Ten: Code to display DataSet information


using System;
using System.Data;

namespace InfoLib
{
  public class DataSetInfo
  {
    public static void DisplayInfo(DataSet ds) // names of tables, columns, relations in ds
    {
      foreach (DataTable dt in ds.Tables)
      {
        Console.WriteLine("\n===============================================");
        Console.WriteLine("Table = " + dt.TableName + "\n");
        foreach (DataColumn dc in dt.Columns)
        {
          Console.Write("{0,-14}", dc.ColumnName);
        }
        Console.WriteLine("\n-----------------------------------------------");

        foreach (DataRow dr in dt.Rows)
        {
          foreach (object data in dr.ItemArray)
          {
            Console.Write("{0,-14}", data.ToString());

          }
          Console.WriteLine();
        }
        Console.WriteLine("===============================================");
      } // foreach DataTable

      foreach (DataRelation dr in ds.Relations)
      {
        Console.WriteLine("\n\nRelations:");
        Console.WriteLine(dr.RelationName + "\n\n");
      }

    } // DisplayInfo()
  } // class DataSetInfo
} // ns InfoLib

Figure 5 Output from the DataSet technique

The first table, "testcase," holds the data that is one level deep from the XML root: id, kind, and expected. The second table, "inputs," holds data that is two levels deep: arg1 and arg2. In general, if your XML file is n levels deep, ReadXml() will generate n tables.

Extracting the data from the parent test case table is easy. We just iterate through each row of the table and access by column name. To get the data from the child table inputs, we get an array of rows using the GetChildRows method:

DataRow[] children = row.GetChildRows("testcase_inputs");  // relation name

Because each <testcase> node has only one <inputs> child node, the children array will only have one row.

The trickiest aspect of this technique is to extract the child data:

tc.arg1 = (children[0]["arg1"]).ToString();  // there is only 1 row in children

Using the DataSet class to parse an XML file has a very relational database feel. Compared with other techniques in this article, it operates at a middle level of abstraction. The ReadXml() method hides a lot of details but you must traverse through relational tables.

Using DataSet to parse XML files is particularly appropriate when your application program is using ADO .NET classes so that you maintain a consistent look and feel. Using a DataSet object has high overhead and would not be a good choice if performance is an issue. Because each level of an XML file generates a table, if your XML file is deeply nested then using DataSet would not be a good choice.

Further Discussion

There are several related issues not yet covered: namespaces, generalization, error handling, validation, filtering, and performance. In the context of parsing XML data files, XML namespaces are a mechanism to prevent name clashes. Each of the techniques we've used can deal with namespaces. The MSDN Library will give you all the information you need to handle XML files with namespaces.

The techniques we have seen were not written to be particularly general. If you have a different XML structure, you will have to write different code. There is always a trade-off between writing code for a specific situation and making the code more generalized.

The code in this article does not have any error handling. Parsing XML files is quite error prone and in a production scenario, you would need to add lots of try-catch blocks to create a robust parser.

Additionally, I didn't address XML validation with schema files, but once again, in a production environment you would need to generate XML schema files and validate your XML data files against them before attempting to parse. It is possible to add validation to your parsing code, but I recommend validating before parsing.

In every example, we have read all the XML data into memory. In many cases, you will want to filter and just read in some data. All the techniques in this article can be modified to provide front-end filtering. The XPathDocument class has especially nice filtering capabilities by way of XPath syntax.

If performance is an issue — usually in the case where you are parsing many small XML files — you will have to run some timing measurements to determine if your chosen technique is fast enough. Performance is too tricky to make many general statements and the only way to know if your performance is acceptable is to try your code. As a guideline, however, XmlTextReader has the best performance characteristics.

A Key Skill

XML data files are a key component of Microsoft's .NET developer environment. The ability to parse data from XML files into memory is a key skill in a .NET setting. Each of the five techniques, based on the XmlTextReader, XmlDocument, XPathDocument, XmlSerializer, and DataSet classes, is significantly different in terms of coding mechanics, coding mind set, and scenarios for usage. The .NET Framework gives you great flexibility in parsing XML data files and makes this essential task much easier and less error prone than using non-.NET techniques.

References

XML in .NET Overview, http://msdn.microsoft.com/msdnmag/issues/01/01/xml/xml.asp

Consume XML C# app, http://msdn.microsoft.com/library/en-us/vcedit/html/

vcwlkVisualCApplicationsConsumingXMLData.asp

XML Schema, http://msdn.microsoft.com/msdnmag/issues/02/04/xml/xml0204.asp

XML Namespaces, http://msdn.microsoft.com/msdnmag/issues/01/07/xml/default.aspx

Dr. James McCaffrey works for Volt Information Sciences Inc. where he manages technical training for software engineers working at Microsoft's Redmond, WA campus. He has worked on several Microsoft products, including Internet Explorer and MSN Search.