Often during my consulting engagements I run into people who say, "some things just can’t be made asynchronous" even after they agree about the inherent scalability that asynchronous communications pattern bring. One often-cited example is user authentication - taking a username and password combo and authenticating it against some back-end store. For the purpose of this post, I’m going to assume a database. Also, I’m not going to be showing more advanced features like ETags to further improve the solution.
Just so that the example is in itself secure, we’ll assume that the password is one-way hashed before being stored. Also, given a reasonable network infrastructure our web servers will be isolated in the DMZ and will have to access some application server which, in turn, will communicate with the DB. There’s also a good chance for something like round-robin load-balancing between web servers, especially for things like user login.
Before diving into the meat of it, I wanted to preface with a few words. One of the commonalities I’ve found when people dismiss asynchrony is that they don’t consider a real deployment environment, or scaling up a solution to multiple servers, farms, or datacenters.
The Synchronous Solution
In the synchronous solution, each one of our web servers will be contacting the app server for each user login request. In other words, the load on the app server and, consequently, on the database server will be proportional to the number of logins. One property of this load is its data locality, or rather, the lack of it. Given that user U logged in, the DB won’t necessarily gain any performance benefits by loading all username/password data into memory for the same page as user U. Another property is that this data is very non-volatile - it doesn’t change that often.
I won’t go to far into the synchronous solution since its been analysed numerous times before. The bottom line is that the database is the bottleneck. You could use sharding solutions. Many of the large sites have numerous read-only databases for this kind of data, with one master for updates - replicating out to the read-only replicas. That’s great if you’re using a nice cheap database like mySql (of LAMP), not so nice if you’re running Oracle or MS Sql Server.
Regardless of what you’re doing in your data tier, you’re there. Wouldn’t it be nice to close the loop in the web servers? Even if you are using Apache, that’s going to be less iron, electricity, and cooling all around. That’s what the asynchronous solution is all about - capitalizing on the low cost of memory to save on other things.
The Asynchronous Solution
In the asynchronous solution, we cache username/hashed-password pairs in memory on our web servers, and authenticate against that. Let’s analyse how much memory that takes:
Usernames are usually 12 characters or less, but let’s take an average of 32 to be sure. Using Unicode we get to 64 bytes for the username. Hashed passwords can run between 256 and 512 bits depending on the algorithm, divide by 8 and you have 64 bytes. That’s about 128 bytes altogether. So we can safely cache 8 million of these with 1GB of memory per web server. If you’ve got a million users, first of all, good for you Second, that’s just 128 MB of memory - relatively nothing even for a cheap 2GB web server.
Also, consider the fact that when registering a new user we can check if such a username is already taken at the web server level. That doesn’t mean it won’t be checked again in the DB to account for concurrency issues, but that the load on the DB is further reduced. Other things to notice include no read-only replicas and no replication. Simple. Our web servers are the "replicas".
The Authentication Service
What makes it all work is the "Authentication Service" on the app server. This was always there in the synchronous solution. It is what used to field all the login requests from the web servers, and, of course, allowed them to register new users and all the regular stuff. The difference is that now it publishes a message when a new user is registered (or rather, is validated - all a part of the internal long-running workflow). It also allows subscribers to receive the list of all username/hashed-password pairs. It’s also quite likely that it would keep the same data in memory too.
The same message can be used to publish both single updates, and returning the full list when using NServiceBus. Let’s define the message:
And the message that the web server sends when it wants the full list:
And the code that the web server runs on startup looks like this (assuming constructor injection):
And the code that runs in the Authentication Service when the GetAllUsernamesMessage is received:
And the class on the web server that handles a UsernameInUseMessage when it arrives:
When the app server sends the full list, multiple objects of the type UsernameInUseMessage are sent in one physical message to that web server. However, the bus object that runs on the web server dispatches each of these logical messages one at a time to the message handler above.
So, when it comes time to actually authenticate a user, this the web page (or controller, if you’re doing MVC) would call:
When registering a new user, the web server would of course first check its cache, and then send a RegisterUserMessage that contained the username and the hashed password.
When the RegisterUserMessage arrives at the app server, a new long-running workflow is kicked off to handle the process:
That UsernameInUseMessage would eventually arrive at all the web servers subscribed.
When looking deeper into this workflow we realize that it could be implemented as two separate message handlers, and have the email address take the place of the workflow Id. The problem with this alternate, better performing solution has to do with security. By removing the dependence on the workflow Id, we’ve in essence stated that we’re willing to receive a UserValidatedMessage without having previously received the RegisterUserMessage.
Since the processing of the UserValidatedMessage is relatively expensive - writing to the DB and publishing messages to all web servers, a malicious user could perform a denial of service (DOS) attack without that many messages, thus flying under the radar of many detection systems. Spoofing a guid that would result in a valid workflow instance is much more difficult. Also, since workflow instances would probably be stored in some in-memory, replicated data grid the relative cost of a lookup would be quite small - small enough to avoid a DOS until a detection system picked it up.
Improved Bandwidth & Latency
The bottom line is that you’re getting much more out of your web tier this way, rather than hammering your data tier and having to scale it out much sooner. Also, notice that there is much less network traffic this way. Not such a big deal for usernames and passwords, but other scenarios built in the same way may need more data. Of course, the time it takes us to log a user in is much shorter as well since we don’t have to cross back and forth from the web server (in the DMZ) to the app server, to the db server.
The important thing to remember in this solution is doing pub/sub. NServiceBus merely provides a simple API for designing the system around pub/sub. And publishing is where you get the serious scalability. As you get more users, you’ll obviously need to get more web servers. The thing is that you probably won’t need more database servers just to handle logins. In this case, you also get lower latency per request since all work needed to be done can be done locally on the server that received the request.
ETags make it even better
For the more advanced crowd, I’ll wrap it up with the ETags. Since web servers do go down, and the cache will be cleared, what we can do is to write that cache to disk (probably in a background thread), and "tag" it with something that the server gave us along with the last UsernameInUseMessage we received. That way, when the web server comes back up, it can send that ETag along with its GetAllUsernamesMessage so that the app server will only send the changes that occurred since. This drives down network usage even more at the insignificant cost of some disk space on the web servers.
And in closing…
Even if you don’t have anything more than a single physical server today, and it acts as your web server and database server, this solution won’t slow things down. If anything, it’ll speed it up. Regardless, you’re much better prepared to scale out than before - no need to rip and replace your entire architecture just as you get 8 million Facebook users banging down your front door.
So, go check out NServiceBus and get the most out of your iron.