Channels ▼
RSS

JVM Languages

Parsing Big Records with Json.NET


Several token types can be returned by the reader. These are all enumerated in the JsonToken enum and include StartObject, StartArray, PropertyName, String, Integer, Float, EndObject, EndArray, and Null. HandleToken looks at the token type that is passed to it and takes appropriate action.

For each of the token types, I created a corresponding C# class. For a String token, there is a corresponding JsonString class. For an Integer or Float token, there is a JsonNumber class. For objects, there is the JsonObject class. For properties, there is the JsonProperty class. And for arrays, there is the JsonArray class. All these classes are derived from a base class that I call JsonRoot. Because jsonStack is a stack of JsonRoot objects, any subtype can be pushed onto the stack.

When HandleToken sees a StartObject token, it creates a JsonObject and pushes it onto jsonStack. If HandleToken sees a StartArray token, it creates a JsonArray and pushes it onto jsonStack. If a PropertyName token is seen, a JsonProperty is created and pushed onto the stack.

If a String is seen, a JsonString is created. However, rather than immediately pushing the string onto the stack, I peek at the top of the stack and add the string to whatever JsonObject, JsonArray, or JsonProperty is there by calling the base class method AddValue (defined in JsonRoot). The string is then added to a list of JsonRoot items maintained by the object, array, or property.

If the top of the stack is a JsonProperty, the overloaded AddValue method sets the string as the property's value. At that point, the property will have both a name and a value. It is popped off the stack and added to whatever is now on the top of the stack — either another property or perhaps a JsonArray or a JsonObject.

EndObject tokens cause the stack to be popped. The freshly popped object is then added to whatever is on top of the stack. If the item on top is a JsonProperty, it, too, is popped off the stack and added to the next item on the stack.

Integers and Floats are handled similarly to Strings. Both are converted to doubles and stored in a JsonNumber, which is then added to whatever is on top of the stack. If the top is a property, it is also popped and added to the next item on the stack.

An Example

After the final token is read, the top of the stack will contain a single object that is the root of a tree of objects (assuming that the JSON was well formed to begin with). Let's take a look at a small example. Here is some GIS data formatted as GeoJSON:

{
	"type": "FeatureCollection",
	"features": 
	[
		{ 
			"type": "Feature", 
			"properties": 
			{ 
				"PIN": "123456789",
				"OWNER_NAME": "JOHN DOE", 
				"OWNER_ADDR": "201 W MAIN STREET", 
				"OWNER_CITY": "DENVER", 
				"OWNER_STAT": "CO", 
				"OWNER_ZIP": "80202", 
				"LAND_VALUE": 1500, 
			}, 
			"geometry": 
			{ 
				"type": "Polygon", 
				"coordinates": 
				[ 
					[ 
						[ -104.93463194899994, 39.74666194100007 ], 
						[ -104.93472410299989, 39.746661817000074 ], 
						[ -104.9347782879999, 39.746661743000061 ], 
						[ -104.9348690409999, 39.746661620000054 ], 
						[ -104.93486923799992, 39.746668218000082 ], 
						[ -104.93487055899993, 39.746712491000039 ], 
						[ -104.9347868239999, 39.746712625000043 ], 
						[ -104.93477841599992, 39.746712639000066 ], 
						[ -104.9347542669999, 39.746712677000062 ], 
						[ -104.93472422899993, 39.746712725000066 ], 
						[ -104.9347148899999, 39.746712740000078 ], 
						[ -104.93463268599993, 39.746712871000057 ], 
						[ -104.9346320439999, 39.746668489000058 ], 
						[ -104.93463194899994, 39.74666194100007 ] 
					] 
				] 
			} 
		}
	]
}

The outermost curly braces indicate that the data is enclosed within a single object. That object has two properties: type and features. The type property is the simple string FeatureCollection, while features is an array as indicated by the square bracket. Although it is an array, the features property in this instance contains a single object. That object, in turn, contains three properties.

The first property is type and its associated value indicates that the type is a feature. The second property is called properties (which admittedly is a bit confusing, but such is life). The third property is called geometry.

The properties property is an array of strings giving the PIN, OWNER_NAME, OWNER_ADDR, and so forth — that is, basic information about the feature (here, a parcel of land). The geometry property is a polygon that describes the boundaries of that parcel using the given coordinates (in latitude and longitude).

Parsing this, GeoJSON results in a tree where the root is a JsonObject. This JsonObject has a list of two JsonObjects, both of which are JsonProperty. The name of the first JsonProperty is type and its value is the string FeatureCollection.

The name of the second JsonProperty is features and its value is a JsonArray. Within this JsonArray is a single JsonObject that, in turn, has a list of three JsonProperties. The first corresponds to the type property. The second corresponds to the properties property. And the third corresponds to the geometry property. Within the geometry property are two JsonProperties for the type (a polygon) and the coordinates (an array of latitudes and longitudes).

Using the Tree

Once I have constructed the tree, I can traverse it to satisfy the application's queries. Just as a stack was used to build the tree, a stack can also be used to traverse it. The root of the tree will either be a JsonObject or JsonArray. JsonObjects have properties, while JsonArrays have elements. In both cases, these are modeled as a List<JsonRoot> and the method GetValues returns the list.

To start a traversal, push the root of the tree onto the stack. Then, immediately pop the stack and save it to another variable that we can call node. Do something with the node. Then call node.GetValues() and push all the items in the list (the node's children) onto the stack. Then repeat the loop and continue until the stack is empty. The general code is as follows:

            if (_root == null) return;

            jsonStack.Clear();
            JsonRoot node = null;
            JsonRoot current = _root;
            jsonStack.Push(_root);
            while (jsonStack.Count > 0)
            {
                node = jsonStack.Pop();

                // *** do something with the node ***

                // process the node’s children
                if (node.GetValues() != null)
                {
                    foreach (JsonRoot v in node.GetValues())
                        jsonStack.Push(v);
                }
            }

Very large trees of JSON objects can be created quickly with a 64-bit .NET application. The only limitation is the amount of available memory (because the tree is held entirely in memory). Since all the objects are in memory, traversing the tree can be done very quickly. However, the tree is not indexed and further improvements can be made. One possibility would be to traverse the tree once and build an index. Future searches could then consult the index and retrieve nodes directly, thus avoiding the time for traversing through the tree to locate the node.

Conclusion

Data is getting bigger, and getting that data from file to memory can be challenging. But once it is in memory, you can begin to manipulate it and do new and useful things with it for your customers and clients. Json.NET, as shown here, is a useful tool in this process.


David Cox is a Principal Scientist at ABB Corporate Research in Raleigh, NC.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video