Dr. Dobb's | Inline Redux | November 01, 2003

Inline Redux

When is inlining performed? And is it possible to write a function that is guaranteed to never be inlined? This month, Herb considers the many and varied opportunities for inlining, including many that are likely to surprise you.

November 01, 2003
URL:http://www.drdobbs.com/inline-redux/184403879

Inline Redux

Quick! How would you answer the following questions?

When is inlining performed?

at coding time

at compile time

at link time

at application install time

at runtime

at some other time

For extra marks: What kinds of functions are guaranteed to never be inlined?

Please stop now and think about these questions before reading on.

* * * * *

Which answer did you pick? If you picked A or B, you're not alone. Those are the most common answers, and you might have thought of Item 12 of [1] where I gave some detailed discussion about the C++ inline keyword. If that were all there was to it, we could stop right here, declare a spontaneous editorial holiday, and take the rest of this column off. But we won't do that this time, because it turns out that there is more to say. Much more.

The reason I'm writing this article is to show why the most accurate answer to the primary question is "any or all of the above," and the general answer to the for-extra-marks question is "none." Wonder why? Read on.

Brief Recap: Inlining

If you've already read Item 12 of [1], the next few sections are review and you can safely skim them while skipping ahead to the discussion of Answer C.

In short, "inlining" means replacing a function call with an "in-place" expansion of the function's body. For example, consider the following code:

//  Example 1
//
double Square( double x ) { return x * x; }

int main() {
  double d = Square( 3.14159 * 2.71828 );
}

The idea of inlining the function call is to treat it (conceptually) as though it were written instead as something like this:

int main() {
  const double __temp = 3.14159 * 2.71828;
  double d = __temp * __temp;
}

This inlining eliminates the overhead of performing the function call, namely of pushing the parameter onto the stack and then having the CPU jump elsewhere in memory to execute the function's code, thus losing some locality of reference. This inlining is also not the same thing as treating Square as a macro, because an inlined function call is still a function call and its arguments are only evaluated once. With a macro they could be evaluated multiple times, such as in a macro like #define SquareMacro(x) ((x)*(x)), where the call SquareMacro( 3.14159 * 2.71828 ) would expand to 3.14159 * 2.71828 * 3.14159 * 2.71828 (that's multiplying ð by e twice, not once).

Incidentally, did you notice that this Example 1 illustrates inlining but does not use the inline keyword? That's intentional. We'll come back to this interesting twist several times.

Answer A: At Coding Time

At coding time, developers can incant the inline keyword in their programs. That's not really "performing" inlining in the sense of actually moving code around to eliminate a function call, but it is an attempt to choose the appropriate places for inlining to take place, so we'll consider that as the earliest opportunity to make decisions about inlining. [2]

When you're tempted to write inline in your code, there are three important things to remember.

By default, don't do it. Premature optimization is evil, and you shouldn't be tempted to write inline until after profiling demonstrates the need in specific cases. See Item 12 of [1], and see also [3], for further harangues and dire warnings about premature optimization in general and premature inlining in particular.
It only means "pretty please." As described at length in Item 12 of [1], the inline keyword is merely a hint to the compiler, a special hook in language to let you (try to) sweet-talk the compiler. (See below for the drawbacks of sweet talk.) And that's all it's good for: The inline keyword is not required to have any semantic effect whatsoever in a C++ program. It doesn't affect other parts of the standard language, in that writing inline on a function does not change how you use the function (for example, you can still take the function's address), and it is not possible for standards-conforming C++ code to programmatically detect whether a given function was declared as inline or not.
3. It's often at the wrong level of granularity. We write inline on a function, but when inlining is performed it actually is done to a function call. This distinction is important because the very same function can (and often should) be inlined at one call site but not at another; writing inline does not give you any way to express that fact. Because we can only write inline on a function itself, when we do so we are implicitly saying that we think we know that it is appropriate to inline this function at all possible call sites. That kind of prescience is rarely accurate. We often colloquially speak of "inlining a function," but to be accurate it would be better to change our vocabulary to consistently talk about "inlining a function call."

Answer B: At Compile Time

At compile time, compilers routinely perform precisely the kind of inlining described in Example 1.

What does the compiler do when we try to sweet-talk it by declaring certain functions to be inline? It depends. Not all compilers respond well to sweet talk, even when it's accompanied by chocolate and flowers. Your compiler (or other tools, as we will see) may ignore you, in three interesting ways:

By refusing to inline calls to functions that you declared inline.
By inlining calls to functions you didn't declare inline.
By inlining some calls, but not others, to the same function (whether or not that function was declared inline).

Note again that Example 1 doesn't say inline anywhere. This was deliberate, because I wanted to illustrate that inlining can still happen. Indeed, don't be surprised if the compiler you're using today will in fact inline the function call in Example 1. Because you can't write a conforming program that can tell the difference, this falls into the category of perfectly legitimate optimizations that a compiler can (and often should) perform on your behalf.

Modern compilers are usually better than programmers are at deciding which function calls to inline, including whether to perform inline expansion of the same function at some call sites but not at others. Why? The simplest reason is that the compiler has more context, because it knows the "real" structure of the call site--the machine code actually generated for the call site after other optimizations, such as loop unrolling and dead branch elimination, have already been performed. For example, the compiler may be able to detect that inlining a function call in a certain inner loop would make the loop too large to fit into cache which would slow down performance, and elect not to inline that call site while still inlining other calls to the same function.

Answer C: At Link Time

Now we start to get into the more interesting, and more modern, aspects of inlining.

Question: Can a function be inlined at link time? Answer: Yes. This gets to the heart of the for-extra-marks question posed at the outset: What kinds of functions are guaranteed to never be inlined? The reason I posed this extra question is because there's a common belief that certain kinds of functions can't be inlined. In particular, functions whose definitions are not put into header files, but put into separate modules, are commonly thought to be uninlineable.

So let's try to make inlining as hard as we can. Consider this slight variation to Example 1:

//  Example 2: Put the function in a different module,
// and make the function's source definition unavailable
//

//--- file main.cpp ---
//
double Square( double x );

int main() {
  double d = Square( 3.14159 * 2.71828 );
}

//--- file square.obj (or .o) ---
//
// contains the compiled definition for:
// double Square( double x ) { return x * x; }

The idea here is that the implementation of Square has been moved out of the main.cpp translation unit. In fact, more than that: Square is not even available in source form, only in object form. "Certainly this call to Square is guaranteed to never be inlined!" some may be tempted to exclaim. As far as the compiler goes, they would be correct. While compiling main.cpp, not even an inordinately precocious compiler could not possibly peek at the definition of Square while compiling main.cpp.

A precocious linker, however, could, and some popular commercial implementations do. Several compiler products have supported such cross-module inlining; one product that supports it today is Microsoft Visual C++ version 7.0 (aka ".NET") and higher, using the /LTCG switch which stands for "link time code generation." One real advantage of doing such late inlining is, again, that the tool knows more of the actual context of each call site, and can make smarter choices about where and when it's worth inlining a given call.

But wait, there's more: Did you notice that there's nothing in the above that says Square is even written in C++? This illustrates a second advantage of post-compilation inlining: It is language-neutral. Square could just as easily have been written in Fortran or Pascal. Sweet. All the linker has to do is be aware of the parameter-passing convention, and then remove the parameter-pushing and -popping code along with the program jump instruction.

But wait (again), there's more—much more, because as it turns out that even now we're still just getting started.

Answer D: At Application Install Time

Fast-forward now to the joyous day when we've compiled and linked our application, bundled it all happily into a tarball or setup, and proudly shipped the shrink-wrapped CD to our first customer. Surely now it's time to: a) break open the morale fund; b) have a ship party; and c) declare that all opportunities for inlining are past. Right?

Yes, yes, and no, respectively.

Particularly since the mid-1990s, increasing numbers of shipping applications are targeted for managed runtime environments. That is, instead of being compiled to machine code that's specific to a particular chip and to API calls that are specific to a particular operating system, they are compiled to a bytecode stream that will be interpreted or compiled by a runtime environment on the user's machine that abstracts away some or all CPU and OS facilities. Common examples include, but are not limited to, a Java Virtual Machine (JVM) and the .NET Common Language Runtime (CLR) (and its cousins, such as Mono and Rotor, that also implement the ISO Common Language Infrastructure (CLI) standard which specifies a subset of the CLR). When targeting these environments, the compiler translates the original C++ source code into said bytecode stream that represents a program written as opcodes in the runtime environment's instruction set.

Aside: Some of these environments have an instruction set so rich that high-level object-oriented concepts, including classes and inheritance and virtual functions, have direct first-class support. A compiler that is targeting such a platform can (and many do) translate the source program into the instruction language on a straight class-for-class, function-for-function basis, possibly after applying some optimizations of their own including performing some function call inlining early at compile time. If the compiler does this, then a C++ function in the source code can be more or less directly represented by a function having the same signature, in the target instruction set. Of course, a compiler doesn't have to mirror things that closely, and even if it doesn't necessarily do it that way the following inlining notes will still apply.

What does this have to do with inlining at application installation time? Even in managed environments, eventually the CPU on the user's machine has to be fed instructions in its native instruction set. Therefore the managed environment is responsible for translating the instruction language into native machine code that the local CPU can understand. This is often done when the application is first installed, and at this point, just as in any other compilation-like process, further optimizations can be (and frequently are) applied. In particular, the .NET CLR performs some inlining at application installation time when some or all of the installed IL is translated into native instructions ready for execution.

Again, it's worth noting that some of these managed environments are language-neutral, so that the optimizations (including inlining) being performed at application install time can be applied across language boundaries. Don't be too surprised if your C# program makes a call to a C++ function, and the call ends up being inlined.

So, when is it too late to perform inlining? Never say never, because even now the story is not quite over…

Answer E: At Runtime

All right, enough is enough: Surely by the time we hit runtime, we must be well beyond opportunities for code optimizations. Right?

It might seem impossible that inlining can still be performed at runtime, but in fact there are several ways it can be done. In particular, I want to mention profile-directed optimization and guarded inlining. Like a managed environment (see previous and next sections), this requires some tool support to exist on the user's machine at runtime.

The idea behind profile-directed optimization is that when the application is actually run, instrumentation hooks inserted into the executing program can gather data about how the program is actually being used, in particular what functions are being called heavily and under what conditions (e.g., the size of the working set compared to the total cache memory when the function is called). The data gathered from these instrumented hooks can be used to modify the executable image so that selected function calls can be inlined to tune the application to its target environment based on actual runtime performance measurements.

Guarded inlining is another example of how aggressive the runtime inline optimizations can be. In particular, [4] and [5] document the Jikes Research Virtual Machine (RVM), née the Jalapeño dynamic optimizing compiler, for JVM targets. Among other things, this compiler is able to inline virtual member functions by assuming that the receiver of the virtual call will be of a given declared type (in order to avoid the cost not only of the function call but of the extra expense of virtual dispatch). Now, compilers can already routinely nonvirtualize (and therefore also optionally inline) certain virtual function calls today, if the type of the target is statically known. What's new here is that the Jikes/Jalapeño environment can nonvirtualize and inline calls to virtual functions even if the static type of the target is not known. Because the guess might not be right, it inserts a guard that performs a runtime check that validates that the target object's type is what was expected; if it's not, it falls back to a normal virtual function call.

Answer F: At Some Other Time

Finally, I'll add one "other time" example I can think of, which is similar to some of the others but distinct enough that I'll give it its own section.

Recall that in Answer D we considered inlining that happens when installing applications on certain managed runtime environments, such as a JVM or .NET CLR environment. Of course, Astute Readers will already have noticed that earlier I only mentioned translation from bytecode to native machine instructions at application installation time, but there's another and more common time when that translation takes place, namely at JIT time, where JIT refers to "just-in-time" compilation.

The idea behind JITting is to compile functions "just in time," just before they're about to be used. This has the advantage of amortizing the cost of compiling the program down to native code, because of instead of one big compilation step you get lots of little compilations for individual parts of the code, just as they're about to be used. It has the corresponding disadvantages of potentially making the first runs of a program somewhat slower, and of reducing the quality of optimization because the JIT has to be fast and can't afford to spend a lot of time analyzing inlining and other optimization opportunities. A JITter can still perform optimizations like inlining, but we can generally get better results by doing the same work earlier, say at application installation time (see Answer D above) when we're not so time-sensitive and the compiler can spend the cycles to do more analysis.

Summary

Like all optimizations, inlining is frequently better when performed by tools that are aware of the generated code and/or the execution environment rather than by the programmer. The later it is performed, the more specific and targeted it can be.

We talk about inlining functions, but it's more correct to say that we perform inlining on function calls. After all, the same function may be inlined in one place but not in others. And, because of the many opportunities that exist for inlining even well after initial compilation has finished, the same function can be inlined, not only in different places, but by different tools in each place.

There's more to inlining than the inline keyword alone.

References

[1] H. Sutter. More Exceptional C++ (Addison-Wesley, 2002).
[2] Arguably, another interpretation of "at coding time" is that some developers literally inline at coding time by physically moving blocks of source code around. That's not the usual meaning of "inlining" and so I'll ignore it.
[3] http://www.google.com/search?q=site:www%2Egotw%2Eca+premature (various references to premature optimization in articles that I've written).
[4] M. Arnold et al. "Adaptive Optimization in the Jalapeño JVM" (Proceedings of the conference on Object-oriented programming, systems, languages, and applications, ACM Press, 2000).
[5] Jikes RVM home page.