Channels ▼
RSS

RESTful Web Service in Go Powered by the Google App Engine


Cloud computing has been gaining increasing popularity over the last few years. Moreover, it is getting cheaper and cheaper, and even individual developers can afford using this technology for free with some limitation on volumes.

Google App Engine (GAE) is one cloud option. If you need a robust and scalable Web service, but you don't want to build the infrastructure yourself, GAE might be exactly what you need. With GAE, you focus only on the business side of your application, and GAE does the rest (load balancing, caching, backup, scheduling, etc.)

However, your application has to be designed according to GAE's rules. The application has some special APIs for database, caching, networking, and logging. You still can use the standard libraries of your programming language (with minor exceptions related to peculiarities of the cloud deployment). For example, the state of the application should not be stored in the memory of your app instance, and services such as the datastore should be used to persist state.

At the moment, GAE provides runtime and APIs for Java, Python, and Go. In this article, I build a sample Web service in Go and deploy it to GAE.

"Houston, We Have A Problem"

I recently faced a problem that I will use for an example Web service. The problem is real, useful, and not too sophisticated from the business-side perspective.

In the past year, I applied for a U.S. visa a couple of times at the embassy in London. Unfortunately, each time, my application required what they call "administrative processing," which can take a few weeks or even months. The updates regarding a visa are published in a PDF file on the embassy's website. The contents of that file are a table indexed by batch numbers (they assign you a batch number when you submit your application). So, to get an update of visa processing status, you download the PDF file, press CTRL-F, and search for a particular row containing your batch number.

I came up with the idea of putting a simple RESTful API on top of this manual process. Instead of fiddling with the PDF file, I'd simply go to a specific URL — for example, http://some-visa-service/batch/123456789 (where "123456789" would be the batch number) — hit enter, and it would print out the status of the application. Moreover, such an API can be easily adopted by any third-party software. The service could load the PDF in advance, parse it, and keep the contents in an internal representation. It would allow doing batch status retrievals without processing the entire PDF on each request.

This problem with U.S. visas occurred before I started playing with GAE. Initially, I implemented a simple standalone Web server in Go parsing the PDF and exposing the information via a RESTful API. Luckily, in Go, an advanced concurrent Web server can be implemented in a very few lines of code quite elegantly. Then I started thinking about hosting (VPS, for instance) where I could have my server permanently running. But such an approach gave me nothing new or exciting, and an idea about GAE emerged.

Dissecting the Problem

There are two sub problems here. The first is to programmatically parse our PDF file and extract the information. The second is the GAE deployment.

1. Parsing the PDF

Based on the PDF specifications available on the Adobe website, the contents of my particular PDF can be parsed this way:

  • Load up the PDF into memory.
  • Each fragment of the user information in a PDF is identified with "stream\r\n" and "endstream\r\n" markers. The blocks of data embraced by these markers are packed using the zlib inflate algorithm. So we need to find any such blocks using the markers and deflate their contents.
  • Each unzipped (deflated) block of data is now a plain text containing PDF markup tags. We need to locate blocks of the text embraced by "BT\r\n" (Begin Text) and "ET\r\n" (End Text) tags. All these blocks will be collected in a list of strings.
  • Inside each block of text identified in step 2 (elements of the collected list of strings), we need to find all substrings embraced by parentheses — "(" and ")'"characters. We join all these substrings, and the resulting text will be a plain text representation of the user information we are looking for. So in this step, we will be processing each string found in step 2 by removing all substrings that are not surrounded by parentheses.
  • Finally, we have to identify the structure of the plain text. The batch status information in our PDF is formatted as a three column table. The columns are: batch number, status, and date. Unfortunately, the PDF contains some extra text (headers, footers, etc.) that will also appear in our parsed plain text along with the useful information (the list of strings from the step 3 will contain not only useful payload, but also some junk). We will identify the table column fields by iterating sequentially along the list of strings. If the current string looks like a batch number (11 decimal digits), we take it as the batch number, and the following two strings will be interpreted as the status update and the data accordingly. We take these three strings and jump over them further to find next potential batch number.

It all looks a little complicated, but with regular expressions, it only takes just a few lines of code to implement.

Of course, we pretty much hard-coded the format, so if clerks in the U.S. embassy change their format, it may not work anymore. But at the moment, these four steps work fine for parsing the existing PDF.

2. GAE deployment

Now, let's define our Web interface. Our application will be a Web service employing the RESTful approach in a very simple form.Let's say our service domain is "some-service.com." We will have three functions:

  1. http://some-service.com/batch/<number> to retrieve the status of a particular batch ("<number>" could be something like "20121234567")
  2. http://some-service.com/action/print/ to print out the entire table.
  3. http://some-service.com/action/refresh/ to reload the PDF and update the information in our Web service. This URL also will be a scheduled task executed automatically on a regular basis, keeping our information up-to-date.

The last two functions should be available only to the administrator.

Our internal representation of the data will be a map (a dictionary), in which the key is a string representing the batch number, and the value is a list of pairs holding the status and date. I found that the PDF may contain more than one record per batch number, so we need to extract all of them.

The pair is a structure defined in Go as:

    type BatchUpdate struct {
      Status, Date string
    }

And the map is defined as:

type BatchTable map[string][]BatchUpdate

When the service is started, the storage (map) is empty. When a first request comes in from the /batch function, it downloads the PDF, parses it, populates the data to the map, and replies back to the request. Subsequent requests will already have all the data in storage. I can maintain a timestamp indicating when the PDF was last parsed, and within each request, I can check whether the data is older than a certain period of time (and if yes, reload the PDF). I can also reload the data at regular intervals; for instance, every hour. It is important to remember that because the reloads may by initiated concurrently, I must synchronize access to the storage to avoid race conditions. It is simple to manage when the storage is in memory of the same process, but if an external persistent storage is used, things can be tricky. I will come back to this question of synchronization shortly..

After the PDF is parsed, I place its contents in a variable of the BatchTable type.

There is another problem here. According to GAE requirements, the application must be stateless across HTTP requests because they can be processed on physically different machines. The only way to persist is to use the Datastore API, Blobstore API, or Memcache API. Datastore is a semi-structured store allowing queries using a SQL-like approach. We could have a table in this store where the primary key is the batch number, and the status and the date are columns. Unfortunately, once a piece of data is written, it becomes visible to other instances of your application, which in turn also may update the storage at the same time. If we treat the batch records individually, we may easily end up with an inconsistency because the records from multiple concurrent refreshing processes can be mixed in the same database. Another issue is that when we load an updated PDF, we have to delete all existing records first, which, again, is not an atomic operation. Of course, transactional updates are still possible, but let's take a look at "lighter" alternatives that may be more suitable in our particular case.

Alternatively, we can serialize our map object to a byte blob and stick it in Blobstore. Blobstore can hold a lot of binary data (for instance, images). With Blobstore, we don't face the concurrency issue anymore because we will read and write the table as a whole atomically. But here, because Blobstore also keeps data on disk, we still have the redundant overhead of reading and writing the table on the disk back and forth.

Eventually, I implemented a hybrid approach using Memcache. Memcache is quite similar to Blobstore in the sense of dealing with blobs addressed by keys. Also Memcache data is visible to all instances of the application over the network. The principal difference is that Memcache keeps data in memory only. Note that the GAE documentation says that the cloud may decide to restart your application (for example, moving it to another cluster) or purge some content from Memcache to free space. So our application should expect that content may disappear from the Memcache store at any time. It could be a major problem to store individual records in Memcache. This is where the "hybrid" comes in. I will store all our data as a single record in Memcache. If it disappears, I will simply reload and parse the PDF again. Reads and writes to Memcache are atomic, so there is no concurrency issue.The GAE Memcache API has one more limitation: The length of a record value cannot be more than 1MB. The length of our PDF is about 1.6MB, but the length of the information represented as a map container is only about 200KB, so for the moment I amfar below that limit.

Here is a structure I will store in Memcache:

    type Table struct {
        UpdateTime string
        Batches    BatchTable
    }

It incorporates BatchTable and has a field called UpdateTime allowing to maintain the age of the data. The record key will be a hard-coded value of table. If the table record exists in Memcache, we retrieve it, deserialize, and use. If not, we load up the PDF, parse it, serialize, and store to Memcache.

Talk Is Cheap, Show Me Your Code

Now let's write some code. I'll be describing a fully functional and complete implementation of the service.

In the first lines, I declare the packages required for our application. Note, the first three packages are not standard Go packages: They are available only in the GAE.

	package usvisa
      	
	import (
	  "appengine"
	  "appengine/memcache"
	  "appengine/urlfetch"
	  "bytes"
	  "compress/zlib"
	  "fmt"
	  "io/ioutil"
	  "net/http"
	  "regexp"
	  "strconv"
	  "strings"
	  "text/template"
	  "time"
	  "errors" )

In the next lines, an init() function is run at program initialization. Here, I need to set the handlers corresponding to application URLs (functions).

	func init() {
	  http.HandleFunc("/batch/", Batch)
	  http.HandleFunc("/action/refresh/", Refresh)
	  http.HandleFunc("/action/print/", Print)
	}

In the next lines, I declare a constant holding a URL of the PDF file.

	const (
	  PdfUrl = "http://photos.state.gov/libraries/unitedkingdom/164203/cons-visa/admin_processing_dates.pdf"
	)

Next, I declare a data structure for the contents of our PDF. BatchTable is a map, where the key is a batch number (a string) and the value is an array of pairs. Each pair (BatchUpdate type) holds a status and date.

	type BatchUpdate struct {
	  Status, Date string
	}
      	
	type BatchTable map[string][]BatchUpdate

The next lines condtain regular expressions to dissect the PDF. TextBlockRE matches blocks of text surrounding by "BT\r\n" and "ET\r\n" tags. TextRE does the same for substrings in parentheses.

var (
	  TextBlockRE = regexp.MustCompile(`(?ms)BT\\r\\n(.+?)ET\\r\\n`)
	  TextRE = regexp.MustCompile(`\\((.+?)\\)`)
	)

Next come two string constants marking data streams in a PDF document.

	const (
	  StreamStartMarker = "stream\x0D\x0A"
	  StreamEndMarker   = "endstream\x0D\x0A"
	)

Now the GAE fun begins. I emphasize again that GAE applications must keep their state via the context. The context is passed by GAE along with the request object. We will see where the context is coming from a bit further down. Moreover, instead of the standard HTTP client in Go, I must use the urlfetch API provided by GAE because App Engine applications don't access network sockets. They must use the App Engine APIs to communicate with the outside world.

In the next lines of code, there is a function loading up a file from a given URL. After the file is read, it calls the parse() function. Note the lines where I do logging. Again, instead of using the standard log package, I use logging facilities provided by the cloud and addressable through the application context (the c variable).

	func LoadBatchTable(c appengine.Context, url string) (BatchTable, error) {
	  c.Infof("Started downloading")
	  duration, _ := time.ParseDuration("1m")
	  client := &http.Client{
	    Transport: &urlfetch.Transport{
	      Context:                       c,
	      Deadline:                      duration,
	      AllowInvalidServerCertificate: true,
	    },
	  }
	  response, err := client.Get(url)
	  if err != nil {
	    c.Errorf("GET failed, [%v]", err)
	    return nil, errors.New("GET failed")
	  }
	  defer response.Body.Close()
	  contents, err := ioutil.ReadAll(response.Body)
	  if err != nil {
	    c.Errorf("GET read failed, [%v]", err)
	    return nil, errors.New("ReadAll failed")
	  }
	  c.Infof("Loaded %d bytes\n", len(contents))
	  return parse(c, contents)
	}

In the next lines, there is a function implementing our 4th step to parse the PDF. It identifies a closest data stream in the file. If there are no markers found, we exit. Then we copy out the block and cut it from the data.

 func parse(c appengine.Context, pdf []byte) (BatchTable, error) {
	  table := make(BatchTable)
	  for {
	    begin := bytes.Index(pdf, []byte(StreamStartMarker))
	    if begin == -1 {
	      break
	    }
	    pdf = pdf[begin+len(StreamStartMarker):]
	    end := bytes.Index(pdf, []byte(StreamStartMarker))
	    if end == -1 {
	      break
	    }
	    section := pdf[0:end]
	    pdf = pdf[end+len(StreamEndMarker):]

Next, I unzip the data, and if it fails, I ignore the block and go for the next one.

 buf := bytes.NewBuffer(section)
	    zr, err := zlib.NewReader(buf)
	    if err != nil {
	      c.Errorf("Unzip initialization failed, [%v]", err)
	      return table, errors.New("Unzip initialization failed")
	    }
	    unzipped, err := ioutil.ReadAll(zr)
	    if err != nil {
	      c.Errorf("Unzip failed, [%v]", err)
	      return table, errors.New("Unzip failed")
	    }

Then, I apply the two regular expressions and accumulate found strings in the records variable, which is a list of strings.

	    var records []string
	    for _, group := range TextBlockRE.FindAllSubmatch(unzipped, -1) {
	      var lines [][]byte
	      for _, group := range TextRE.FindAllSubmatch(group[1], -1) {
	        lines = append(lines, group[1])
	      }
	      records = append(records, string(bytes.Join(lines, []byte{})))
	    }


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 


Video