Database

Weather Data Distribution & System Monitoring

By Chris McClellen, January 01, 2002

One of the hallmarks of The Weather Channel is that it localizes weather data to thousands of geographic regions of the United States, no matter how large or small. Chris examines how the data is processed and transmitted from The Weather Channel's systems to the satellite.

Jan02: Programmer's Toolchest

An open-source ORB for a rainy day

Chris is a consultant for Transparent Technologies working as the Director of Software Architecture at The Weather Channel. He can be reached at [email protected].

One of the hallmarks of The Weather Channel (http://www.weather.com/) is that it localizes the display of weather data to thousands of geographic regions of the United States, regardless of whether a region is large or small. What you have is a system where there are clients in the field at cable head ends across the country. The current point of localization is at the cable television head-end level. Clearly, distributing data to remote clients is a problem worth solving.

To provide the overall localization, there are several issues to understand:

Clients are connected via a one-way satellite link.
There are various generations of clients, most with incompatible protocols.
There are various sources of weather data, and these sources change over time.

In this article, I examine how data is processed and transmitted from Weather Channel systems to the satellite. This is the back-end system that actually takes weather data and translates it to a format that the end clients understand. Discussion of the protocol and synchronization of the clients is beyond the scope of this article.

Although the back-end system has a lot of interesting facets, I'll focus on the two most significant areas — data distribution and monitoring.

The Client

The client is where the visual magic takes place. In a nutshell, the client receives audio, video, and data. The client's primary job is to take weather information and use it to modify the audio and video appropriately. Home viewers see the output as modified by the client — this is why I say the point of localization is at the cable head end. One client serves an entire head end; all viewers attached to that cable head end see the same modified audio and video signals. Fortunately, individual head ends serve a local enough area that enables us to show local weather to the viewers. Digital Broadcast Satellite currently suffers in this arena. We can't play the localization trick as easily since it looks like one large virtual head-end — one client serves the entire country. In short, it's hard to show local weather for every local region in the country in the span of under two minutes.

Essentially, The Weather Channel is operating many virtual TV networks. I say this because during the times where localized weather information is displayed, people in different areas of the country see a broadcast made just for them.

MICO/CORBA

Early on in the development of the back-end system we decided to use CORBA to ease our data-distribution troubles. Luckily, there happens to be some high-quality open-source ORBs, and the one we chose was MICO (http://www.mico.org/). (For more information on MICO, see "The MICO CORBA-Compliant System," by Arno Puder, DDJ, November 1998; and "Examining CORBA Interoperability," by Eric Ironside, Letha Etzkorn, and David Zajac, DDJ, June 2001.) It has turned out to be quite solid for us and CORBA — and thus MICO — has become the foundation of our data distribution and monitoring abilities.

By making the decision to use CORBA early on, we were able to use CORBA facilities in our design phase. It also took a big chunk out of the implementation time of the system because the communication framework didn't have to be coded.

The Back-End System

While The Weather Channel back-end system consists of any number of important subsystems (embedded scripting, priority-based packet transmission, human notification of critical errors, and the like), I'll focus here on two key components — data distribution and monitoring.

The basic problem we're solving with data distribution is actually simple in concept — we need to be able to receive data from our data providers, then send it to clients in the field. Of course, reality is seldom simple and it turns out that we have a few other problems to tackle as well. First, we change data providers fairly often. Second, we have multiple generations of clients with different protocols. Third, different types of data have different priorities.

As Figure 1 illustrates, the overall tasks that must be done to data include:

Data must be received from providers.
Data must be validated.
Data must be translated into a form clients can understand.
Translated data must be sent based on its priority.

An important fact to consider is that data does have priority: A tornado warning, for instance, is considered much more important than tomorrow's forecast. The system has to have the ability to discriminate what's most important to send at any given time. Fortunately, this problem is solved in the transmission phase — there are various transmission queues of different priorities. All the programs must do is put their packets in the right queue.

The most basic principle is that some data provider sends weather data to the back end. The back end then translates the data as appropriate and sends it on to the clients. But what happens if we need a new provider? Generally weather data is weather data — only the format changes if you add a new provider. Perhaps one provider changes its format, or we find a better one — we need a way to isolate the changes from those who translate. What happens if someone wants to do something with the data besides translate? It would be nice to reuse the data reception and validation code, right?

CORBA comes to our rescue in this case. As Figure 2 shows, we can use event channels to put data through the system.

The idea is that weather providers receive weather data and put it into a neutral format, one that can be pushed over event channels. The thinking is that the data content won't change that often; thus IDL changes will be minimal. It's okay if the format of the data changes, since we only have to change the code for the data provider. It doesn't matter where we get something such as a forecast from — it still has to have the forecast data in it. The neutral ground, then, is the data description in IDL. The listeners, on the other hand, are the entities that act on the data. Providers don't care what the listeners do — they can do anything they want.

What you get in this case is reuse and isolation. Every listener in our system is guaranteed that the data is validated upon receipt. Thus, any listener who wishes to hear weather data can rely on the fact that it has all been validated. This keeps us from having to write validation code for every listener in the system. Listeners and sources can change independently of each other, so long as there are no IDL changes. Also, because we are using CORBA, there is no reason why any of these entities must be on the same physical machine.

Data Providers

What do data providers do in our system? They have several responsibilities. First, they receive the data in whatever form the actual provider sends it in; perhaps it's a file, or maybe over a socket. Next, they have to validate the data they're getting. Then they generally separate weather data by major type onto different event channels. This lets listeners attach to channels that carry the data types they wish to hear, rather than having to attach to one channel and throw away a ton of data searching for the few types they are interested in.

In this arrangement, providers are isolated from what listeners do. When listeners of the data need changes, their changes do not affect providers, and vice versa. Providers only pump out data; they don't care what happens to it. Unless there is an IDL change, one data provider can be swapped out with another without touching listeners. In general, listeners are ignorant of even this level of change.

Providers are isolated from each other as well. For example, a data provider in the system is ignorant of other providers. There are no dependencies between them.

Data Listeners

The listeners' job seems obvious — they take action on the data they are listening for. What is done with the data is up to individual listeners; it can be anything they want. Listeners can reside anywhere on the network. They can rely on the distribution of data coming from data providers because the data has been structured and validated. An IDL compiler generates the code necessary to receive weather data elements. This makes life easier on systems that wish to receive a data feed.

With this arrangement, new listeners can easily be added and they can reside on any machine (thanks to CORBA). Listeners can do anything they need to do with the data pushed to them; providers don't care. For instance, we can add another set of listeners that transform the data to yet another protocol; they just "plug" into the system.

Listeners are isolated from each other as well. There is no dependency between listeners; they act as if they are the only listener around. They only depend on themselves.

Currently, we have two sets of listeners. One set does the translation to the protocol that a block of clients can understand. They receive the data from the event service, translate it, and hand off the results to the transmission subsystem. The other set is for data analysis. Another system needed to have access to the same data the back-end system was dealing with for analysis purposes. We discovered that instead of writing a bunch of parsing code that had to change when providers changed, the analysis system could become a listener of our event channels. Since the data is expressed in IDL, interfacing with the analysis system was easy. It was definitely a win: The analysis system got to reuse the receipt and validation of our system. And, as data providers change, the analysis system keeps right up to date.

Putting It All Together

Figure 3 illustrates a potential hookup of providers and listeners. The idea here is that a provider can put data into multiple channels and that a listener can listen to more than one channel. There can even be no listeners to a channel. The providers put data out onto the channels, and anyone who needs to listens for the data. In the end it is a simple idea, but leads to a lot of flexibility.

Since the system has been in operation for over a year, a few things can be said about this arrangement. On the good side:

High degree of modularity. We can swap out listeners and providers with no impact to the others. We have already successfully swapped out providers with no changes to listeners.
Other systems can leverage data reception and validation by becoming listeners. The data is packaged up for them and described via IDL.
We can send data to future clients even if they have a different protocol, merely by adding more listeners. We don't have to rearchitect the whole system — we just bolt on the new listeners.

On the bad side, however:

IDL changes can be painful. You've got to synchronize IDL with all constituent systems. The IDL isn't supposed to change often, but when it does, all affected systems have to get the new IDL and be able to handle it.
MICO's event service could exhaust memory if a listener hangs or stops processing pushes.
Listeners that aren't attached to event channels miss pushes. The CORBA event service is not like messaging middleware. The upside for weather data is that it expires. This means it doesn't make too much sense to try and store missed data messages. So if a listener crashes and comes back up, we have some missed data, but if the listener is down for any length of time, any data it missed could very well be useless.

What about the actual results? We have swapped data-distribution components in and out with little or no impact to the rest of the system. MICO's event service can have queue limits so you can keep from exhausting memory when there is an errant listener. Missing data hasn't been a problem; weather data expires, so queuing up missed data is not really useful.

Other systems inside The Weather Channel have successfully become data listeners to get structured and validated data. This has saved those systems having to write their own data parsing and validation, and they don't have to worry about data provider changes. In practice, the changing of providers has not led to changes in listeners thus far.

System Monitoring

System monitoring is another important part of the system, since the data we do (or don't) send will be visible all across the nation on television. Consequently, it's nice to be able to alert operations staff when something is going wrong — hopefully in time to fix the problem before it shows up on everyone's television.

The kinds of things we wish to monitor are straightforward: Are we getting data? Is the data good? Are the processes executing normally or are they hung? As it turns out, the monitoring solution — which also utilizes CORBA — is simple, yet extremely powerful.

The basic idea to our system monitoring is that our CORBA servers publish data through a simple interface. The data is published by name and the name of the interface is MonitorTarget. Example 1 presents the definition of the interface.

Each CORBA server that wishes to support external monitoring implements the MonitorTarget interface. The CORBA server in effect publishes named data items. An external program can use the GetValues function to ask for a list of values by name. Each server registers a monitor target with the naming service (which MICO provides). Clients can then get access to these monitor targets by name. Typically, a target represents an entire system process and thus any CORBA objects handled by that process. Usually the name of the target will be MonitorTarget.process_name. (The dot in the name represents different name components in the naming service.)

It turns out that the implementation of the MonitorTarget interface is simple and can be reused. A C++ library object implements the interface. There are calls on the C++ object that allows one to publish data items. Items are stored via pointers so they can be instantly referenced. You can think of the class as a symbol table — pointers are stored by name. The implementation is straightforward: We use an STL map of strings to pointers.

In other words, all a server needs to do is instantiate the library object implementation of MonitorTarget. By doing so, the object registers with the name service and is ready to answer queries. To publish data, one function call must be made per data item to publish. The item can be a string, an integer, or a functor. The implementation converts the data item to a string when it is returning it via CORBA. There is also a call, of course, to stop an item from being published.

In short, here's what the interface does:

GetValues takes a list of strings. Each item in the list is a name of a published data item. It returns a list of values that correspond to each item asked for. The names typically look like "ObjectName.objectAttribute;" for example, Transmitter.numBadPackets. If any names aren't known, a list of the unknown names are made and thrown as an exception.
Ping's primary purpose is to allow an external entity to see if the program is alive. For convenience, it returns the last error logged by the server. A successful return means the process is running. A CORBA exception is generated if it's not. Just the fact that the function returns without an exception means that the process is alive.
GetUpTime gets the number of seconds the process has been up. We decided to make a function here, rather than force people to go through GetValues for efficiency reasons.

Monitoring Rules

The ability to publish values is nice, but this is only half the story. There must also be some rules that let you know if the system is healthy. Consequently, external monitoring is done — that is, external programs query published data to assess the health of the system. The monitoring rules are expressed in a scripting language. This gives the ability to change the rules without a full system rebuild, and also gives the system administrators the power to totally customize the rules.

Initially, the system utilized S-Lang (http://www.s-lang.org/) as the scripting language. For a variety of reasons, we later opted for Python (http://www.python.org/). Since there is an ORB that has native Python bindings (OMNIOrb, http://www.uk.research.att.com/omniORB/omniORB.html), we could get rid of most of our glue code that tied S-Lang to CORBA.

Monitoring rules are expressed in Python. The Python scripts can query published data values from servers to assess the health of the system. If it is determined that there are errors that require attention, human beings are notified. The amount of monitoring that can be done is staggering. A full-featured object-oriented scripting language is available to express the monitoring rules. Not only that, but it is highly accessible as there are many books and web pages available for learning the language.

The Consequences

Although the monitoring system is powerful, it is nonetheless easy to deal with. One reason for this is that there is no real burden on server programmers, who can just use a precanned object and publish data right away. Once they publish data, they can then deal with the data as they normally would — no calls are necessary to "update" the data. When someone calls to query a value, the caller gets the instantaneous value of the data. So to the server programmer, adding items for monitoring is generally nonintrusive.

Secondly, the separation of the monitoring rules versus what is actually being monitored is clear cut. Servers aren't cluttered with monitoring code per se — they just publish data during startup in general. The code partition means that we can make arbitrarily complex monitoring rules while keeping that complexity out of the server. This greatly simplifies system maintenance, and it is hard to convey just how much this has done so.

Finally, the extensibility of the system makes it easy to add to this arrangement. For instance, if someone wanted to do SNMP queries against the system, we wouldn't have to rewrite anything. Instead, we would put an SNMP agent in the system that could map from an OID (Object Interface Definition) to a monitor target value. The agent then would just become a client of the MonitorTarget interface. The system would not need to change much to support queries. Yes, a new process would need to be written, but that's it.

The system, which has been in production for over a year, has clearly proven itself. The operations staff, in most cases, has been notified of errors that home viewers would also notice. The operations staff has been able to fix the problems before the viewers actually notice. Also, as the system has matured, more and more monitoring rules have been added, leading to better preventive maintenance.

DDJ

1 2 3 4 5 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Database

Weather Data Distribution & System Monitoring