Cloud computing has been gaining increasing popularity over the last few years. Moreover, it is getting cheaper and cheaper, and even individual developers can afford using this technology for free with some limitation on volumes.
Google App Engine (GAE) is one cloud option. If you need a robust and scalable Web service, but you don't want to build the infrastructure yourself, GAE might be exactly what you need. With GAE, you focus only on the business side of your application, and GAE does the rest (load balancing, caching, backup, scheduling, etc.)
However, your application has to be designed according to GAE's rules. The application has some special APIs for database, caching, networking, and logging. You still can use the standard libraries of your programming language (with minor exceptions related to peculiarities of the cloud deployment). For example, the state of the application should not be stored in the memory of your app instance, and services such as the datastore should be used to persist state.
At the moment, GAE provides runtime and APIs for Java, Python, and Go. In this article, I build a sample Web service in Go and deploy it to GAE.
"Houston, We Have A Problem"
I recently faced a problem that I will use for an example Web service. The problem is real, useful, and not too sophisticated from the business-side perspective.
In the past year, I applied for a U.S. visa a couple of times at the embassy in London. Unfortunately, each time, my application required what they call "administrative processing," which can take a few weeks or even months. The updates regarding a visa are published in a PDF file on the embassy's website. The contents of that file are a table indexed by batch numbers (they assign you a batch number when you submit your application). So, to get an update of visa processing status, you download the PDF file, press CTRL-F, and search for a particular row containing your batch number.
I came up with the idea of putting a simple RESTful API on top of this manual process. Instead of fiddling with the PDF file, I'd simply go to a specific URL — for example, http://some-visa-service/batch/123456789 (where "123456789" would be the batch number) — hit enter, and it would print out the status of the application. Moreover, such an API can be easily adopted by any third-party software. The service could load the PDF in advance, parse it, and keep the contents in an internal representation. It would allow doing batch status retrievals without processing the entire PDF on each request.
This problem with U.S. visas occurred before I started playing with GAE. Initially, I implemented a simple standalone Web server in Go parsing the PDF and exposing the information via a RESTful API. Luckily, in Go, an advanced concurrent Web server can be implemented in a very few lines of code quite elegantly. Then I started thinking about hosting (VPS, for instance) where I could have my server permanently running. But such an approach gave me nothing new or exciting, and an idea about GAE emerged.
Dissecting the Problem
There are two sub problems here. The first is to programmatically parse our PDF file and extract the information. The second is the GAE deployment.
1. Parsing the PDF
Based on the PDF specifications available on the Adobe website, the contents of my particular PDF can be parsed this way:
- Load up the PDF into memory.
- Each fragment of the user information in a PDF is identified with "stream\r\n" and "endstream\r\n" markers. The blocks of data embraced by these markers are packed using the zlib inflate algorithm. So we need to find any such blocks using the markers and deflate their contents.
- Each unzipped (deflated) block of data is now a plain text containing PDF markup tags. We need to locate blocks of the text embraced by "BT\r\n" (Begin Text) and "ET\r\n" (End Text) tags. All these blocks will be collected in a list of strings.
- Inside each block of text identified in step 2 (elements of the collected list of strings), we need to find all substrings embraced by parentheses — "(" and ")'"characters. We join all these substrings, and the resulting text will be a plain text representation of the user information we are looking for. So in this step, we will be processing each string found in step 2 by removing all substrings that are not surrounded by parentheses.
- Finally, we have to identify the structure of the plain text. The batch status information in our PDF is formatted as a three column table. The columns are: batch number, status, and date. Unfortunately, the PDF contains some extra text (headers, footers, etc.) that will also appear in our parsed plain text along with the useful information (the list of strings from the step 3 will contain not only useful payload, but also some junk). We will identify the table column fields by iterating sequentially along the list of strings. If the current string looks like a batch number (11 decimal digits), we take it as the batch number, and the following two strings will be interpreted as the status update and the data accordingly. We take these three strings and jump over them further to find next potential batch number.
It all looks a little complicated, but with regular expressions, it only takes just a few lines of code to implement.
Of course, we pretty much hard-coded the format, so if clerks in the U.S. embassy change their format, it may not work anymore. But at the moment, these four steps work fine for parsing the existing PDF.
2. GAE deployment
Now, let's define our Web interface. Our application will be a Web service employing the RESTful approach in a very simple form.Let's say our service domain is "some-service.com." We will have three functions:
- http://some-service.com/batch/<number> to retrieve the status of a particular batch ("<number>" could be something like "20121234567")
- http://some-service.com/action/print/ to print out the entire table.
- http://some-service.com/action/refresh/ to reload the PDF and update the information in our Web service. This URL also will be a scheduled task executed automatically on a regular basis, keeping our information up-to-date.
The last two functions should be available only to the administrator.
Our internal representation of the data will be a map (a dictionary), in which the key is a string representing the batch number, and the value is a list of pairs holding the status and date. I found that the PDF may contain more than one record per batch number, so we need to extract all of them.
The pair is a structure defined in Go as:
type BatchUpdate struct {
Status, Date string
}
And the map is defined as:
type BatchTable map[string][]BatchUpdate
When the service is started, the storage (map) is empty. When a first
request comes in from the /batch function, it downloads the PDF,
parses it, populates the data to the map, and replies back to the request.
Subsequent requests will already have all the data in storage. I can
maintain a timestamp indicating when the PDF was last parsed, and within each
request, I can check whether the data is older than a certain period of time
(and if yes, reload the PDF). I can also reload the data at regular
intervals; for instance, every hour. It is important to remember that because
the reloads may by initiated concurrently, I must synchronize access to
the storage to avoid race conditions. It is simple to manage when the storage
is in memory of the same process, but if an external persistent storage is
used, things can be tricky. I will come back to this question of synchronization
shortly..
After the PDF is parsed, I place its contents in a variable of the
BatchTable type.
There is another problem here. According to GAE requirements, the application must be stateless across HTTP requests because they can be processed on physically different machines. The only way to persist is to use the Datastore API, Blobstore API, or Memcache API. Datastore is a semi-structured store allowing queries using a SQL-like approach. We could have a table in this store where the primary key is the batch number, and the status and the date are columns. Unfortunately, once a piece of data is written, it becomes visible to other instances of your application, which in turn also may update the storage at the same time. If we treat the batch records individually, we may easily end up with an inconsistency because the records from multiple concurrent refreshing processes can be mixed in the same database. Another issue is that when we load an updated PDF, we have to delete all existing records first, which, again, is not an atomic operation. Of course, transactional updates are still possible, but let's take a look at "lighter" alternatives that may be more suitable in our particular case.
Alternatively, we can serialize our map object to a byte blob and stick it in Blobstore. Blobstore can hold a lot of binary data (for instance, images). With Blobstore, we don't face the concurrency issue anymore because we will read and write the table as a whole atomically. But here, because Blobstore also keeps data on disk, we still have the redundant overhead of reading and writing the table on the disk back and forth.
Eventually, I implemented a hybrid approach using Memcache. Memcache is quite similar to Blobstore in the sense of dealing with blobs addressed by keys. Also Memcache data is visible to all instances of the application over the network. The principal difference is that Memcache keeps data in memory only. Note that the GAE documentation says that the cloud may decide to restart your application (for example, moving it to another cluster) or purge some content from Memcache to free space. So our application should expect that content may disappear from the Memcache store at any time. It could be a major problem to store individual records in Memcache. This is where the "hybrid" comes in. I will store all our data as a single record in Memcache. If it disappears, I will simply reload and parse the PDF again. Reads and writes to Memcache are atomic, so there is no concurrency issue.The GAE Memcache API has one more limitation: The length of a record value cannot be more than 1MB. The length of our PDF is about 1.6MB, but the length of the information represented as a map container is only about 200KB, so for the moment I amfar below that limit.
Here is a structure I will store in Memcache:
type Table struct {
UpdateTime string
Batches BatchTable
}
It incorporates BatchTable and has a field called
UpdateTime allowing to maintain the age of the data. The record
key will be a hard-coded value of table. If the
table record exists in Memcache, we retrieve it, deserialize, and
use. If not, we load up the PDF, parse it, serialize, and store to Memcache.
Talk Is Cheap, Show Me Your Code
Now let's write some code. I'll be describing a fully functional and complete implementation of the service.
In the first lines, I declare the packages required for our application. Note, the first three packages are not standard Go packages: They are available only in the GAE.
package usvisa
import (
"appengine"
"appengine/memcache"
"appengine/urlfetch"
"bytes"
"compress/zlib"
"fmt"
"io/ioutil"
"net/http"
"regexp"
"strconv"
"strings"
"text/template"
"time"
"errors" )
In the next lines,
an init() function is run at program initialization. Here, I
need to set the handlers corresponding to application URLs (functions).
func init() {
http.HandleFunc("/batch/", Batch)
http.HandleFunc("/action/refresh/", Refresh)
http.HandleFunc("/action/print/", Print)
}
In the next lines, I declare a constant holding a URL of the PDF file.
const ( PdfUrl = "http://photos.state.gov/libraries/unitedkingdom/164203/cons-visa/admin_processing_dates.pdf" )
Next, I declare a data structure for the contents of our PDF.
BatchTable is a map, where the key is a batch number (a string)
and the value is an array of pairs. Each pair (BatchUpdate type)
holds a status and date.
type BatchUpdate struct {
Status, Date string
}
type BatchTable map[string][]BatchUpdate
The next lines condtain regular expressions to dissect the PDF.
TextBlockRE matches blocks of text surrounding by "BT\r\n" and
"ET\r\n" tags. TextRE does the same for substrings in
parentheses.
var ( TextBlockRE = regexp.MustCompile(`(?ms)BT\\r\\n(.+?)ET\\r\\n`) TextRE = regexp.MustCompile(`\\((.+?)\\)`) )
Next come two string constants marking data streams in a PDF document.
const ( StreamStartMarker = "stream\x0D\x0A" StreamEndMarker = "endstream\x0D\x0A" )
Now the GAE fun begins. I emphasize again that GAE applications must keep
their state via the context. The context is passed by GAE along with the
request object. We will see where the context is coming from a bit further
down. Moreover, instead of the standard HTTP client in Go, I must use
the urlfetch API provided by GAE because App Engine applications don't access
network sockets. They must use the App Engine APIs to communicate with the
outside world.
In the next lines of code, there is a function loading up a file from a
given URL. After the file is read, it calls the parse() function.
Note the lines where I do logging. Again, instead of using the standard log
package, I use logging facilities provided by the cloud and addressable
through the application context (the c variable).
func LoadBatchTable(c appengine.Context, url string) (BatchTable, error) {
c.Infof("Started downloading")
duration, _ := time.ParseDuration("1m")
client := &http.Client{
Transport: &urlfetch.Transport{
Context: c,
Deadline: duration,
AllowInvalidServerCertificate: true,
},
}
response, err := client.Get(url)
if err != nil {
c.Errorf("GET failed, [%v]", err)
return nil, errors.New("GET failed")
}
defer response.Body.Close()
contents, err := ioutil.ReadAll(response.Body)
if err != nil {
c.Errorf("GET read failed, [%v]", err)
return nil, errors.New("ReadAll failed")
}
c.Infof("Loaded %d bytes\n", len(contents))
return parse(c, contents)
}
In the next lines, there is a function implementing our 4th step to parse the PDF. It identifies a closest data stream in the file. If there are no markers found, we exit. Then we copy out the block and cut it from the data.
func parse(c appengine.Context, pdf []byte) (BatchTable, error) {
table := make(BatchTable)
for {
begin := bytes.Index(pdf, []byte(StreamStartMarker))
if begin == -1 {
break
}
pdf = pdf[begin+len(StreamStartMarker):]
end := bytes.Index(pdf, []byte(StreamStartMarker))
if end == -1 {
break
}
section := pdf[0:end]
pdf = pdf[end+len(StreamEndMarker):]
Next, I unzip the data, and if it fails, I ignore the block and go for
the next one.
buf := bytes.NewBuffer(section)
zr, err := zlib.NewReader(buf)
if err != nil {
c.Errorf("Unzip initialization failed, [%v]", err)
return table, errors.New("Unzip initialization failed")
}
unzipped, err := ioutil.ReadAll(zr)
if err != nil {
c.Errorf("Unzip failed, [%v]", err)
return table, errors.New("Unzip failed")
}
Then, I apply the two regular expressions and accumulate found strings in
the records variable, which is a list of strings.
var records []string
for _, group := range TextBlockRE.FindAllSubmatch(unzipped, -1) {
var lines [][]byte
for _, group := range TextRE.FindAllSubmatch(group[1], -1) {
lines = append(lines, group[1])
}
records = append(records, string(bytes.Join(lines, []byte{})))
}




