Dr. Dobb's | The Road to Our Scripting Future

The Road to Our Scripting Future

Lightweight languages are primed to make huge inroads into the corporate market.

February 02, 2007
URL:http://www.drdobbs.com/web-development/the-road-to-our-scripting-future/197002917

Peter is Founder and CTO of ActiveGrid, and formerly CTO of Sun Microsystems' Application Server Division. He can be contacted at www.activegrid.com.

To execute, applications must be presented to the computer as binary-coded machine instructions specific to a given CPU model or family. However, programmers have a number of language options for generating those machine instructions. Perhaps most relevant here is the degree of abstraction a language provides. More abstraction means fewer operations for developers to direct.

Machine languages are the native languages of computers, the only languages directly understood by CPUs. With the exception of programmable microcode, machine languages are the lowest level of programming language. As such, they offer no abstraction whatsoever. Consisting entirely of numbers, machine languages are rarely used to write programs because developers must manually code—in numerical code—each and every instruction associated with the application's business logic, as well as its underlying services such as sockets, registers, memory addresses, and call stacks.

Considering the labor associated with machine languages, developers desiring complete control over all aspects of application performance normally use assembly language. Machine languages and assembly languages contain the same instructions, making them essentially the same thing. The advantage of assembly languages is the thin layer of abstraction they create by presenting instructions in the form of names. These mnemonic instructions make it easier to write programs, which are then transformed into machine language by assemblers.

Midlevel programming languages provide the next level of abstraction, while letting programmers maintain a high degree of overall control. Typified by C, midlevel languages provide low-level access to memory and require you to explicitly code much of the application's underlying services. Yet these languages can also relieve you of other duties, such as coding functions, variables, and expression evaluation.

Perhaps one of the most significant advantages by most midlevel languages is portability, which enables machine-independent coding. Unlike high-level languages, though, the portability enabled by midlevel languages is not based on a virtual machine or a common machine-independent environment. Rather, the application is compiled for different computer platforms and operating systems with minimal change to its source code.

High-level programming languages allow an even greater degree of abstraction, so you can more fully focus attention on the application's business logic instead of the services required to support the CPU. High-level languages often handle thread management, garbage collection, APIs, and other services natively. Java, for example, relies on a virtual machine that abstracts all operating system functions to provide its famous "write once, run anywhere" capability. Other high-level languages include a variety of interpreted and compiled languages including Basic, C++, C#, Cobal, Perl, PHP, and Python.

Finally, natural languages deserve mention. Simply put, natural languages overwhelm the human/machine interface. Huge, continually expanding vocabularies with shifting meanings and byzantine grammar that is inconsistently employed renders natural languages unsuitable for computers.

High-level languages simplify complex programming while low-level languages tend to produce more efficient code. Using high-level languages, you can break up a complex application into smaller components, although the trade-off for convenience is most often code efficiency. Consequently, when applications must meet certain performance standards, developers may forego the ease of coding in high-level languages and opt for lower level languages.

Computing Architecture Continuum

From the mainframe to grid computing, each computing architecture has developed in response to the demands organizations place on their IT departments. Similar to the range of programming language options, each computing architecture in Figure 1 presents a unique environment.

[Click image to view at full size]

Figure 1: Computing architecture continuum.

Mainframe computers use a host/terminal architecture, whereby all of the application processing executes on the mainframe host. Multiple users can simultaneously access the mainframe via local or remote "dumb" terminals (or terminal emulation software), which simply display queries and results. Introduced in the 1950s, mainframes remain popular in large organizations needing extreme reliability, availability, and serviceability.

The mainframe is ideal for mission-critical applications that process bulk data such as credit-card processing, bank account management, market trading, and ERP. Applications that require high security are another mainframe strength. Today's leading mainframe vendors include IBM, Hewlett-Packard, and Unisys.

Minicomputers employ the same host/terminal architecture as mainframes but typically serve a smaller user population. Launched in 1959, the minicomputer era was ushered by Digital Equipment Corporation and the introduction of its PDP-1. Selling for an amazingly low $120,000, the PDP-1 extended the reach of computing to a broader audience.

Over time, minicomputers basically morphed into midrange systems and servers, but their function remains the same—processing applications for multiple users. In small and midsize businesses, midrange systems usually run general business applications. Large enterprises generally use them for department-level operations. Vendors include IBM, Hewlett-Packard, and Sun Microsystems.

Moving away from the hosting model, client/server architecture splits the application-processing load between one or more servers and the user's client computer. Client/server encourages IT departments to select the appropriate hardware and software platforms for client and server functions. For example, database management system servers frequently run on platforms specially designed and configured to perform queries, while file servers usually run on platforms with special elements for managing files.

Client/server was a response to monolithic, isolated applications running on minicomputers and mainframes. Seeking integrated, responsive, and comprehensive applications, companies turned to client/server architecture to support the complete range of their business processes—from call centers to CRM and beyond. Leading client/server vendors include Oracle and PowerSoft.

The Internet computing model defined by the World Wide Web introduced a new twist to client/server's distributed processing model. Whereas client/server relied on dedicated client-side software to run applications, Internet computing relies on one client application—the web browser—to present the GUIs of countless applications while back-end servers process the bulk of the application.

The shift from the client/server "fat client" to the Internet "thin client" brings huge benefits. Software upgrades are made solely at the server and no longer include a client-side component that has to be distributed to the user base. Meanwhile, applications both inside and outside the firewall give authorized users ready access to any web-enabled application, from company newsletters and HR benefits to e-commerce and financial services. Leading Internet vendors include Sun and BEA.

Grid computing architecture is emerging to let companies flatten the dominant three-tier Internet architecture. Today, the back end of a standard web application is processed by a low-end web server, a high-end application server, and a high-end database server or some other data store. Grids can meld the web server and application server tiers into a single tier of parallel, commodity servers running Linux.

A web application deployed on a grid architecture offers significant dollar savings over the same application running on three tiers. In different hardware configurations, grid computing is being used successfully on a variety of applications, from modeling financial markets and simulating earthquakes to serving millions of web pages per day. In the grid arena, the leading technologies are Linux and x86-based computers.

Programming Language Progression

As computing platforms shift (see Figure 2), languages of choice shift as well. While Cobol dominated the mainframe and minicomputer eras, the client/server era may have presented developers with the most language options. Developers could choose from a number of popular languages, including Microsoft's Visual Basic, Borland's Delphi, PowerSoft's PowerScript, and others. These languages were all essentially somewhat-typed, pseudo-interpreted languages. And they were all replaced with Java, a strongly typed, pseudo-interpreted language and Visual Basic .NET, a somewhat-typed, pseudo-interpreted language.

[Click image to view at full size]

Figure 2: Programming language progression.

During the Internet era, organizations ran a variety of server operating systems in the middle tier, including Solaris, AIX, HP-UX, Irix, and Windows NT. In many companies, a strong requirement was that applications be portable across two or more platforms to avoid being locked into a single vendor. If an organization's applications only ran on one platform, the organization lost much of its bargaining leverage and the vendor could price gouge in the next upgrade cycle.

Java was originally designed to run on set-top clients and then on PC clients, so the language and its runtime were designed to be portable and undoubtedly met this goal. Using servers from NetDynamics and KIVA, some companies had already started running Java on the server. In addition, Java offered some of the benefits developers enjoyed in languages from their client/server days, such as garbage collection and higher level APIs to operating system features that abstracted complexity.

Java soon gained a critical mass of vendors who supported the platform. Everything under the sun soon had a Java API, including Oracle, SAP, Tibco, CICS, MQSeries, and so on. Over a couple of years, these applications and services were all accessible via standardized APIs that grew into J2EE, which went on to dominate the corporate computing environment of the Internet era.

What Java failed to provide was 4GL-type tools. However, no other language had 4GL-type tools for web applications, so their absence was no surprise. Unfortunately, years have passed, and the vast majority of J2EE applications are still built by hand. A lesson that Microsoft has learned is that for APIs to be toolable, they need to be developed concurrently with the tool. Moreover, both APIs and tools should depend on easily externalized metadata. Java APIs were always written on the merits of the APIs themselves, and subsequent tools were predominantly code generators shunned by programmers.

The Java APIs grew into a morass of inconsistent and incomprehensible APIs. Programming even the simplest application proved to be complicated. According to Gartner, more than 80 percent of J2EE deployments are servlet/JSP-to-JDBC applications. That is, the vast majority of these applications are basically HTML front-ends to relational databases. Ironically, much of what makes Java complicated is the myriad of band-aid extensions, such as generics and JSP templates, which were added to simplify development of such basic applications.

Despite these issues, Java and J2EE have come to completely dominate the Internet era of corporate computing. These technologies will remain dominant until companies begin their migration to next-generation grid architectures and their related languages.

The Rise of Grid

Grid computing takes advantage of networked computers, creating a virtual environment that distributes application processing across a parallel infrastructure. Grids can employ a number of computational models to achieve their goal of high throughput.

Heterogeneous grid computing relies on a mix of different, geographically distributed computers to solve massive computational problems such as simulating earthquakes. Mainframes in California and Massachusetts may work with clusters of midrange systems in China and thousands of PCs across Europe to solve a single problem.

The drive to heterogeneous grid computing arose from sheer frustration. With limited access to scarce, expensive resources such as supercomputers, users recognized that compute-intensive problems could be broken up and distributed across multiple, lower cost machines that were readily available. Typically, the resulting calculations could be delivered faster and more cost effectively.

With the advantages of grid computing, the appearance of homogeneous grids simply reflects the fact that clusters of low-cost, homogenous PCs running Linux can be a genuine alternative to higher priced computer architectures. Numerous Wall Street firms now run complex financial simulations such as Monte Carlo calculations on large clusters of Linux machines.

On the Web, the massive throughput offered by grid computing takes on a new meaning. Rather than focus on solving a single problem—sequencing the human genome, for example—a grid can focus on executing a single task, such as serving web pages. Web portals such as Google, Yahoo, and Amazon all have demonstrated the efficacy of running thousands of commodity Linux machines as web servers.

Clearly, the grid architecture works for many well-known Internet companies. Now, users are starting to move transactional applications onto the grid architecture. The all-or-nothing nature of transactional applications can make moving to commodity grid computing a delicate matter for companies that are used to running these applications on high-end architectures that are perceived as more robust and reliable. On the other hand, the grid advantages can prove to be an irresistible lure.

Grid Languages

Regardless of when transactional applications ultimately wind up on a grid, IT is already engaging in a subtle paradigm shift, moving away from larger SMP boxes running proprietary flavors of UNIX, and moving toward large grids of one- to two-processor x86 machines running Linux. These machines already dominate the front-tier web server market. Now, they are starting to appear on the back end with products like Oracle RAC, the grid-enabled version of Oracle. The transition to grid will soon affect the middle tier, but it is held back by J2EE implementations. These apps were built to run on small clusters of multiprocessor machines rather than large clusters of unit-processor machines.

Unlike earlier architectures, grid has no pressing requirement for portability. Companies are no longer locked in by a vendor when they run Linux on x86 white boxes. Consequently, they have no problem with applications that only run on Linux/x86. The footnote to this portability rule concerns corporations that require applications be developed on Windows-based machines. For these companies, the only portability requirement is the ability to develop on Windows and deploy on Linux.

Basically, today's corporate applications all produce text, whether HTML for web browsers or XML for other applications. With the onslaught of web services, all back-end resources will soon be providing XML rather than binary data. The average corporate application will be a big text pump, taking in XML from the back end, transforming it somewhat, and producing either HTML or XML.

With this in mind, clear requirements emerge for a programming language best suited to support corporate applications in a grid environment:

Fast handling of XML (dynamic data with fluctuating types).
Fast processing of text into objects and out of objects.
Optimal handling of control flow, which is the bulk of most applications' limited logic.
Minimal portability (Linux/x86 and Windows/x86).
Minimal abstraction (very thin veneer over the operating system for system services).
Specific tuning for one- or two-processor x86 machines.

Considering these requirements, Java does not fare well:

Java is a strongly typed language that does not easily handle XML data, which is inherently unstructured.
Java is painfully slow at processing text because it cannot manipulate strings directly.
Java is great for complicated applications but not ideally suited for specifying control flow.
Java provides maximum portability, which is overkill for grid apps.
Java provides maximum abstraction with a huge virtual machine that sits between the application and the operating system and is overkill for grid applications.
Most J2EE implementations are tuned for 4-16 processor SMP boxes.

For applications deployed on grid architectures, Java does not suffice. What developers need is a scripting language that is loosely typed to facilitate XML encapsulation and that can efficiently process text. The language should be very well suited for specifying control flow: It should be a thin veneer over the operating system.

Most Linux distributions already bundle three such languages—PHP, Python, and Perl. PHP is by far the most popular. Python is considered the most elegant, if not odd. Perl is the tried-and-true workhorse. All three languages are open source and free. As Figure 3 illustrates, PHP use has skyrocketed over the past few years.

[Click image to view at full size]

Figure 3: Distribution of languages for web pages (Source: Google, June 2006).

Grid Concerns

Like the computing architectures and languages that came before it, grid comes with its own set of challenges and trade-offs. For example, there are various additional semantics and failure modes associated with grid's asynchronous programming model, especially in large-scale distributed applications.

Perhaps the biggest difference is that the software needs to expect that machines will fail, and fail regularly. This means redundancy must be built into the software layer. When invoking logic, programmers should not be thinking about calling a specific machine, which is the traditional synchronous RPC model. Instead, programmers should think about invoking a service. At runtime, that service could in fact be running on the same machine or on different machines.

The biggest conceptual abstraction that programmers need to understand is that applications need to evolve into a set of services so they can be spread across a grid. Having a main event loop and running sequential logic can deploy to a grid, but this type of model scales vertically on hardware, not horizontally.

Clearly, the massively scalable web sites in use today represent the first large-scale use of this type of architecture. Web interactions are inherently atomic. For instance, user A's shopping cart has nothing to do with user B's shopping cart. User A's credit card can be processed independently of user B's credit card. In an e-commerce site, the only resource these users really have to share is the inventory system and an external shipper's tracking service.

Grid has evolved from numerical computing—where things like airbag simulations could be split up among numerous machines—to serving multitudes of web users. The next level up is servicing requests on shared data; for example, searching via Google search and browsing social networks. Google has it a bit easier because they don't really care if a search is slightly different every time you search, so they can gradually update the indexes across clusters. That problem is a bit fuzzier and users don't notice.

Social networks are a bit different, and they have had a lot of problems scaling. Essentially, the entire object graph of social relationships has to be accessible in real time by all of the independent web users. Friendster (www.friendster.com/) solved this problem by using PHP to service the web requests and using a back-end service that had the object graph in memory. In this hybrid model, the social network construct is essentially considered a back-end service, like an inventory system.

A final concern revolves around maintenance. These systems are incredibly hard to debug. There have been a lot of homegrown tools to do this, but it is an emerging solution. From a monitoring, administering, and analyzing perspective, all of the major systems management vendors have had solutions to manage large clusters of commodity machines for years now. They are still getting better, but there are a lot of choices.

The Scripting Future

PHP, Python, and Perl are still somewhat immature in terms of their enterprise libraries, and their web services capabilities are nascent. Regardless, they have the necessary ingredients to meet the requirements of the next corporate computing phase of "text pump" applications.

In addition to being free and open source, these languages are easy to learn and use. PHP, Python, and Perl are primed to follow the trail blazed by Linux and Apache and make huge inroads into the corporate market. The latest version of PHP is virtually indistinguishable from Java, to the point of almost identical syntax and keywords.

Outside of the open-source arena, Microsoft has created Zen, previously named "X#" (http://research.microsoft.com/~emeijer/Papers/XML2003/xml2003.html), an XML-native language for its common language runtime (CLR). Visual Basic is arguably the most popular scripting language in the world, and Windows is well tuned for one- to two-processor machines. As long as Microsoft remains in the picture, developers will most likely be able to choose among .NET, Java, and PHP/Python/Perl. However, when the application is on a grid architecture, the open-source scripting languages will rule.