Channels ▼
RSS

Parallel

Parsing XML Files in .NET Using C#


Parsing XML with DataSet

The fifth and final method we will use to parse an XML file into memory uses the DataSet class. The example code is shown in Listing Nine.

Listing Nine: Parsing XML using DataSet

using System;
using System.Xml;
using System.Data;
using CommonLib; // Suite class definition
using InfoLib; // DisplayInfo() method 

namespace Run
{
  class Class1
  {
    [STAThread]
    static void Main(string[] args)
    {
      DataSet ds = new DataSet();
      ds.ReadXml("..\\..\\..\\..\\testCases.xml");
    
      InfoLib.DataSetInfo.DisplayInfo(ds); // show table, column, relation names

      CommonLib.Suite s = new CommonLib.Suite();
      foreach (DataRow row in ds.Tables["testcase"].Rows)
      {
        CommonLib.TestCase tc = new CommonLib.TestCase();
        tc.id = row["id"].ToString();
        tc.kind = row["kind"].ToString();
        tc.expected = row["expected"].ToString();

        DataRow[] children = row.GetChildRows("testcase_inputs"); // relation name

        tc.arg1 = (children[0]["arg1"]).ToString(); // there is only 1 row in children
        tc.arg2 = (children[0]["arg2"]).ToString();
        
        s.items.Add(tc);
      }

      s.Display();
 
    } // Main()

  } // class Class1
} // ns

We start by reading the XML file directly into a System.Data.DataSet object using the ReadXml() method. A DataSet object can be thought of as an in-memory relational database. The XML data ends up in two tables, "testcase" and "inputs," that are related through a relation "testcase_inputs." The key to using this DataSet technique is to know the way to determine how the XML data gets stored into the DataSet object.

Although we could create a custom DataSet object with completely known characteristics, it is much quicker to let the ReadXml() method do the work and then examine the result. I wrote a helper function DisplayInfo() that accepts a DataSet as an argument and displays the information we need to extract the data from the DataSet's tables.

To keep the main parse program uncluttered, I put DisplayInfo() into a class library named "InfoLib." The code is shown in Listing Ten. The output from running the parse program is shown in Figure 5.

Listing Ten: Code to display DataSet information


using System;
using System.Data;

namespace InfoLib
{
  public class DataSetInfo
  {
    public static void DisplayInfo(DataSet ds) // names of tables, columns, relations in ds
    {
      foreach (DataTable dt in ds.Tables)
      {
        Console.WriteLine("\n===============================================");
        Console.WriteLine("Table = " + dt.TableName + "\n");
        foreach (DataColumn dc in dt.Columns)
        {
          Console.Write("{0,-14}", dc.ColumnName);
        }
        Console.WriteLine("\n-----------------------------------------------");

        foreach (DataRow dr in dt.Rows)
        {
          foreach (object data in dr.ItemArray)
          {
            Console.Write("{0,-14}", data.ToString());

          }
          Console.WriteLine();
        }
        Console.WriteLine("===============================================");
      } // foreach DataTable

      foreach (DataRelation dr in ds.Relations)
      {
        Console.WriteLine("\n\nRelations:");
        Console.WriteLine(dr.RelationName + "\n\n");
      }

    } // DisplayInfo()
  } // class DataSetInfo
} // ns InfoLib

Figure 5 Output from the DataSet technique


The first table, "testcase," holds the data that is one level deep from the XML root: id, kind, and expected. The second table, "inputs," holds data that is two levels deep: arg1 and arg2. In general, if your XML file is n levels deep, ReadXml() will generate n tables.

Extracting the data from the parent test case table is easy. We just iterate through each row of the table and access by column name. To get the data from the child table inputs, we get an array of rows using the GetChildRows method:

DataRow[] children = row.GetChildRows("testcase_inputs");  // relation name

Because each <testcase> node has only one <inputs> child node, the children array will only have one row.

The trickiest aspect of this technique is to extract the child data:

tc.arg1 = (children[0]["arg1"]).ToString();  // there is only 1 row in children

Using the DataSet class to parse an XML file has a very relational database feel. Compared with other techniques in this article, it operates at a middle level of abstraction. The ReadXml() method hides a lot of details but you must traverse through relational tables.

Using DataSet to parse XML files is particularly appropriate when your application program is using ADO .NET classes so that you maintain a consistent look and feel. Using a DataSet object has high overhead and would not be a good choice if performance is an issue. Because each level of an XML file generates a table, if your XML file is deeply nested then using DataSet would not be a good choice.

Further Discussion

There are several related issues not yet covered: namespaces, generalization, error handling, validation, filtering, and performance. In the context of parsing XML data files, XML namespaces are a mechanism to prevent name clashes. Each of the techniques we've used can deal with namespaces. The MSDN Library will give you all the information you need to handle XML files with namespaces.

The techniques we have seen were not written to be particularly general. If you have a different XML structure, you will have to write different code. There is always a trade-off between writing code for a specific situation and making the code more generalized.

The code in this article does not have any error handling. Parsing XML files is quite error prone and in a production scenario, you would need to add lots of try-catch blocks to create a robust parser.

Additionally, I didn't address XML validation with schema files, but once again, in a production environment you would need to generate XML schema files and validate your XML data files against them before attempting to parse. It is possible to add validation to your parsing code, but I recommend validating before parsing.

In every example, we have read all the XML data into memory. In many cases, you will want to filter and just read in some data. All the techniques in this article can be modified to provide front-end filtering. The XPathDocument class has especially nice filtering capabilities by way of XPath syntax.

If performance is an issue — usually in the case where you are parsing many small XML files — you will have to run some timing measurements to determine if your chosen technique is fast enough. Performance is too tricky to make many general statements and the only way to know if your performance is acceptable is to try your code. As a guideline, however, XmlTextReader has the best performance characteristics.

A Key Skill

XML data files are a key component of Microsoft's .NET developer environment. The ability to parse data from XML files into memory is a key skill in a .NET setting. Each of the five techniques, based on the XmlTextReader, XmlDocument, XPathDocument, XmlSerializer, and DataSet classes, is significantly different in terms of coding mechanics, coding mind set, and scenarios for usage. The .NET Framework gives you great flexibility in parsing XML data files and makes this essential task much easier and less error prone than using non-.NET techniques.

References

XML in .NET Overview, http://msdn.microsoft.com/msdnmag/issues/01/01/xml/xml.asp

Consume XML C# app, http://msdn.microsoft.com/library/en-us/vcedit/html/

vcwlkVisualCApplicationsConsumingXMLData.asp

XML Schema, http://msdn.microsoft.com/msdnmag/issues/02/04/xml/xml0204.asp

XML Namespaces, http://msdn.microsoft.com/msdnmag/issues/01/07/xml/default.aspx


Dr. James McCaffrey works for Volt Information Sciences Inc. where he manages technical training for software engineers working at Microsoft's Redmond, WA campus. He has worked on several Microsoft products, including Internet Explorer and MSN Search.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video