EXPERIMENTAL DATA

This section reports our optimization's effect on code size and execution time. We applied our optimization to the C++ programs listed in Table 1. The original and optimized programs were each complied on two machines: a DECStation 5000 with a MIPS R3000 33Mhz processor running Ultrix 4.2A (the DEC machine), and a Gateway 2000 with a Pentium 90Mhz processor running Linux 1.0.9 (the Intel machine). Each program was compiled with GNU's C++ compiler (version 2.3.1 on the DEC machine and version 2.5.8 on the Intel machine) using the -O3 optimization level², which does (limited) interprocedural optimization, (limited) inlining, and substantial intraprocedural optimization.

Table 1: Descriptions of Programs Used for Data Collection

Program	Description
city simulation	A toy example that simulates cars, trucks, roads, and stop lights.
employee database	An example with different kinds of workers (i.e., hourly, salary, by-piece, ...)
geqn	The equation formatter from the Groff text formatting package.

The standard Unix time utility was used to collect the number of CPU seconds used by each program. Each run was done when the machine was unloaded, so it received an average of 99.9% of the CPU time. As a result, elapse time and CPU time were virtually identical.

The choice of execution time--a direct measure of performance--rather than an indirect measure of performance such as instruction count is intentional.³Pipeline effects, cache effects, and the like impact the average number of cycles-per-instruction, which, in turn, impact the computed execution time [7]. This occurs in our last example where the computed execution times before and after our optimization are nearly identical while the observed execution times are not.

Each program was run thirty times at random physical memory locations to account for cache effects (see [2] for a description of why this is necessary and the pitfalls associated with not accounting for memory location induced cache effects when timing programs on modern architectures). We report in Table 2 the average times for the thirty runs including the margin of error and the object code size before and after optimization. Note that the percentage growth in the resulting binary files (which include library code) was considerably less: the first two programs showed essentially zero growth, while geqn showed about 50% growth. We now discuss each program in more detail.

	Size			Time
Program	Unopt	Optimized	Change	Unoptimized	Optimized	Change
Intel:
city simulation	12618	12762	1.14%	9.650 $\pm$ 0.067	9.235 $\pm$ 0.074	-4.29%
employee	7867	8715	10.78%	95.500 $\pm$ 2.706	90.978 $\pm$ 1.268	-4.74%
geqn	132331	264739	100.06%	51.265 $\pm$ 0.170	52.610 $\pm$ 0.156	2.62%
Dec:
city simulation	25348	24052	-5.11%	41.865 $\pm$ 0.874	41.621 $\pm$ 0.766	-0.58%
employee	21852	24056	10.09%	295.613 $\pm$ 11.606	282.953 $\pm$ 13.364	-4.28%
geqn	302868	586416	93.62%	101.884 $\pm$ 0.162	85.878 $\pm$ 0.248	-15.71%

Size represents object code size in bytes. Time represents CPU time in seconds.

The city simulation simulates traffic flow on the roads of a city. The program has only one virtual function and was chosen to illustrate the effects on programs that contain very few virtual functions. On the Intel machine, our optimization caused about a 4% decrease in execution time and a nominal 1% increase in object code size. On the DEC machine, the optimization made only a nominal difference in the execution time. However, it is interesting to note that it decreased the object code size. This provides indirect evidence that our optimization is enabling other optimizations.

The employee program, based on code from [5], has a high proportion of virtual functions and virtual function calls; it was chosen to provide expected results for programs that use virtual functions heavily. The program has a class hierarchy of eight classes and virtual functions that (1) print the employee's data (name, current wages, etc.), (2) report the employee's earnings for the current week, (3) give the employee a raise, and (4) reset the employee for a new week. Each type of worker implements these functions differently.

We created a test input file with 1000 employees (200 of each kind), and then iterated for 300 weeks, printing the data and the earnings for each employee each week and giving each employee a raise every ten weeks. For this example our results were very promising. On the Intel machine, we achieved an almost 4% decrease in execution time with only a 10% increase in code size. On the DEC machine, we achieved slightly over 4% decrease in execution time with a similar 10% increase in code size.

Finally, geqn is the equation formatter distributed with GNU's groff text formatting package. It was chosen for its size and to give results for a large ``real world'' program. The code contains 6400 source lines of which 400 are virtual function calls. Geqn was written with a single class (called box) from which its other seventeen classes are derived. Class box declares fourteen virtual functions (on average only 7 are overridden). This class hierarchy leads to very large inlined functions and consequently large growth in code size. Better object-oriented analysis and design could lead to better class organization, and this explosive code size increase should be averted.

We used a large (3.4 MB) input file to obtain our data. For the Intel machine, our optimization provided no help. In contrast, on the DEC machine, we obtained a significant improvement (15%). The explanation lies in the cache sizes on the two machines. The Pentium processor in the Intel machine has separate 2-way associative 8K code and data caches. In the first two test programs, the size of the optimized program was small enough that cache conflicts were minimal. With geqn, the large code size leads to excessive cache misses; thus, leading to an increase in execution time. The MIPS processor in the DEC machine has a larger 64K cache. Thus, the optimized version of geqn did not encounter a restrictive number of cache misses.

The experimental results shown in Table 2 represent a promising step towards the efficient compilation of C++ programs containing virtual functions. Since our optimization enables inlining, we expected a code size increase. This was not a problem except for the largest program running on the Intel machine, which has the smallest cache. While execution time reduction was not universal, the employee program with its high percentage of virtual function calls showed a consistent performance improvement. Improved compiler optimization, for example interprocedural constant propagation (of types), should improve these results.

Verbatim copying and distribution of this entire paper is permitted in any medium, provided this notice is preserved.