Dhrystone Benchmark on PowerPC 440 CPU

Performance Assessment using the Dhrystone v2.1 Benchmark on the Xilinx ML507 FPGA Platform
 

Table of Contents [Toc]

Overview
Test Setup
Hardware Architecture
   Clocking Infrastructure
   Glossary
The Dhrystone Benchmark
IBM PowerPC Performance Libraries
Test Results
   Measurement Results
   Inter-/Extrapolated Results
Cheated Results
   Measurement Results
   Inter-/Extrapolated Results
Summary & Conclusions
References
 

Overview   [Toc] [Top]

This project describes the performance assessment of the IBM PowerPC 440 CPU using the Dhrystone version 2.1 benchmark. The tests were executed on the Xilinx Virtex-5 ML507 FPGA prototyping platform with an embedded PowerPC 440 CPU running at a clock frequency of 400 MHz. The main objectives of this project are summarized as follows:

Well, I'm aware of the paraphrase 'fake, lie, benchmark', therefore I seek to follow best practices by clearly specifying the overall setup employed including hardware system, software and compiler versions used, and employed compilation flags. Moreover, I do not intend to tweak any system specifics in order to reach highest scores, but rely on ordinary compiler flags such as '-O2' and '-O3'. The corresponding results are listed below, any interpretation is left to the prospective reader.

Note that in general, benchmark numbers are meaningless without proper specification of compiler settings and benchmarking conditions [4].

Last but not least, keep in mind Dilbert:

"A misleading benchmark test can accomplish in minutes what years of good engineering can never do."
 

Test Setup   [Toc] [Top]

Hardware IBM PowerPC 440 CPU running at 400 MHz
CPM & PLB running at 133 MHz
Peripherals: 32 kB BRAM, RS232 UART, interrupt controller, timer
Software Dhrystone benchmark version 2.1 (Language: C) [5]
Selected tests employ the IBM PowerPC Performance Libraries [6]
Compiler powerpc-eabi-gcc (GCC) 4.1.1 20060524
Compilation Program compiled without 'register' attribute
Separate compilation of files dhry_1.c and dhry_2.c, as intended by the Dhrystone benchmark

Hardware Architecture   [Toc] [Top]

The hardware architecture is depicted below. Only essential peripheral blocks have been attached to the CPU.

Hardware Architecture
Hardware architecture for the performance assessment of the embedded IBM PowerPC 440 CPU using the Dhrystone benchmark. As peripheral blocks, 32 kB on-board memory, timer, interrupt controller and RS232 UART for serial communication are connected.

Clocking Infrastructure   [Toc] [Top]

The clocking scheme of the system architecture needs to adhere to a bunch of different rules imposed by various interconnect and device specifics. The applied clocking scheme and the corresponding important clock ratios are listed below.

Clock Frequency Clock Ratio
CPU core clock 400 MHz
CPM clock 133 MHz CPU:CPM 3:1
MPLB (PLB_v46_0) 133 MHz CPU:MPLB 3:1, CPM:MPLB 1:1

Glossary   [Toc] [Top]

APU Auxiliary Processing Unit
BRAM Block Random Access Memory
CPM Communications Processor Module
DAC Digital-to-Analog Converter
DCR Device Control Register
DMA Direct Memory Access
FCB Fabric Coprocessor Bus
FCM Fabric Coprocessor Module
FPU Floating-Point Unit
GPIO General Purpose Input/Output
MCI Memory Controller Interface
MPLB Processor Local Bus Master
SPLB Processor Local Bus Slave
PLB Processor Local Bus
PPC PowerPC

The Dhrystone Benchmark   [Toc] [Top]

The Dhrystone benchmark is a synthetic benchmark program and performs a series of CPU-centric operations such as integer arithmetic, comparisons, and logic and string operations. Being ported to the C programming language in 1988, the Dhrystone version 2.1 benchmark mostly relies on standard C library functions such as strcmp(), strcpy(), and memcpy(), and does not involve any multiply-accumulate or floating-point execution. The benchmark was mainly intended to characterize the integer performance of CPUs during the dawn of the Internet age in the 80's and 90's. In the end, the measured performance is reflected in a benchmark-specific metric called Dhrystones per second. This number can then be converted to Dhrystone MIPS (DMIPS).

What does 1 DMIPS stand for? The VAX-11/780 has been selected as the 'reference 1 MIPS machine', which scores 1757 Dhrystones per second in the Dhrystone benchmark, and hence, constitutes the reference for 1 DMIPS compute performance. As you can see from the measurement results below, the examined IBM PowerPC 440 CPU has an equivalent performance of 879 VAX-11/780. Did you get it? Well, even for me, the VAX-11/780 is an obscure ancient computing device from Digital Equipment Corporation (DEC), which was introduced in October 1977. It is kind of a computer dinosaur I haven't had the pleasure to become acquainted with...

DEC VAX-11/780
The VAX-11/780 from Digital Equipment Corporation (DEC) introduced in 1977. It was running at 5 MHz and incorporated 32 bit addressing, 16 registers, 2 kB cache and 128 kB - 8 MB ECC RAM. This system achieves 1757 Dhrystones per second in the Dhrystone benchmark and represents the virtual metric for the benchmark by considering this system's performance as 1 Dhrystone MIPS (DMIPS).
(Source: en.wikipedia.org / Digital Equipment Corporation)

IBM PowerPC Performance Libraries   [Toc] [Top]

Selected tests employ the compilation flag '-mppcperflib' which infers the IBM PowerPC Performance Libraries for optimized low-level integer and floating-point emulation, and optimized string handling routines [6]. According to Xilinx, the IBM PowerPC Performance Libraries may show an average of three times increase in speed on applications that heavily use these routines.

Caution: The IBM PowerPC Performance Libraries are only intended for improving the execution of emulated floating-point arithmetics and hence cannot be used in conjunction with floating-point hardware, i.e., with active '-mfpu' switch.

Test Results   [Toc] [Top]

Using separate compilation of files dhry_1.c and dhry_2.c, as intended by the Dhrystone benchmark.
Any interpretation is left to the prospective reader.

Measurement Results   [Toc] [Top]

IBM PowerPC 440 @ 400 MHz

The gray shaded table entries represent compiler settings violating the intended compilation rules of the synthetic benchmark.

Compiler optimization flags -O3 -mppcperflib¹º -O3¹¹ -O2 -mppcperflib²º -O2²¹
Execution time for 100'000'000 iterations [s] 64.01 75.75
98.8
110.8
Dhrystones per second 1'562'376 1'320'147 1'012'608
902'930
Dhrystone MIPS (DMIPS) 889.2
751.4 576.3
513.9
Time for one run through Dhrystone [us] 0.64
0.76 0.99
1.11
DMIPS/MHz 2.22 1.88 1.44
1.28

¹º CFLAGS := -g -Wall -Werror -std=c99 -O3 -mcpu=440 -mppcperflib -DTIME
¹¹ CFLAGS := -g -Wall -Werror -std=c99 -O3 -mcpu=440 -DTIME
²º CFLAGS := -g -Wall -Werror -std=c99 -O2 -mcpu=440 -mppcperflib -DTIME
²¹ CFLAGS := -g -Wall -Werror -std=c99 -O2 -mcpu=440 -DTIME

Inter-/Extrapolated Results   [Toc] [Top]

Using 889 DMIPS @ 400 MHz as reference data:

CPU clock [MHz] 100 200 300 400 500 550
Dhrystones per second
390'594
781'188
1'171'782
1'562'376
1'952'970
2'148'267
Dhrystone MIPS (DMIPS) 222
445
667
889
1112
1223
DMIPS/MHz 2.22

Using 576 DMIPS @ 400 MHz as reference data:

CPU clock [MHz] 100 200 300 400 500 550
Dhrystones per second
253'152
506'304
759'456
1'012'608 1'265'760
1'392'336
Dhrystone MIPS (DMIPS) 144
288
432
576
720
793
DMIPS/MHz 1.44

Cheated Results   [Toc] [Top]

I was curious about the impact on Dhrystone performance if you merge files dhry_1.c and dhry_2.c into a single file, as you are not supposed to do [2]. The results are listed below. Again, any interpretation is left to the prospective reader.

Measurement Results   [Toc] [Top]

IBM PowerPC 440 @ 400 MHz

The gray shaded table entries represent compiler settings violating the intended compilation rules of the synthetic benchmark.

Compiler optimization flags -O3 -mppcperflib¹º -O3¹¹ -O2 -mppcperflib²º -O2²¹
Execution time for 100'000'000 iterations [s] 49.25 61.51 94.0
105.5
Dhrystones per second 2'030'390 1'625'766 1'063'775
947'798
Dhrystone MIPS (DMIPS) 1155.6 925.3 605.4 539.4
Time for one run through Dhrystone [us] 0.49 0.62 0.94 1.06
DMIPS/MHz 2.89 2.31 1.51 1.35

¹º CFLAGS := -g -Wall -Werror -std=c99 -O3 -mcpu=440 -mppcperflib -DTIME
¹¹ CFLAGS := -g -Wall -Werror -std=c99 -O3 -mcpu=440 -DTIME
²º CFLAGS := -g -Wall -Werror -std=c99 -O2 -mcpu=440 -mppcperflib -DTIME
²¹ CFLAGS := -g -Wall -Werror -std=c99 -O2 -mcpu=440 -DTIME

Inter-/Extrapolated Results   [Toc] [Top]

Using 1156 DMIPS @ 400 MHz as reference data:

CPU clock [MHz] 100 200 300 400 500 550
Dhrystones per second 507'598
1'015'195
1'522'793
2'030'390
2'537'988
2'791'787
Dhrystone MIPS (DMIPS) 289 578 867 1156 1445 1589
DMIPS/MHz 2.89

Using 605 DMIPS @ 400 MHz as reference data:

CPU clock [MHz] 100 200 300 400 500 550
Dhrystones per second 265'944
531'888
797'831
1'063'775
1'329'719
1'462'691
Dhrystone MIPS (DMIPS) 151
303
454
605
757
833
DMIPS/MHz 1.51

Summary & Conclusions   [Toc] [Top]

By using gcc 4.1.1 and the IBM PowerPC Performance Libraries for an embedded PowerPC 440 CPU running at 400 MHz, it was possible to achieve up to 576 Dhrystone MIPS (DMIPS) with cache-adjusted code size and compiler optimization flags, which are officially allowed by the Dhrystone benchmark. Extrapolated to 550 MHz, the PowerPC 440 would achieve 793 DMIPS for the version 2.1 benchmark, what is well below the advertised single-core performance of 1100+ DMIPS from Xilinx [1]. In case aggressive compiler optimization techniques beyond the intention of the benchmark are used, the Dhrystone performance increases dramatically - ultimately owed to the fact, that the synthetic Dhrystone benchmark was both, not intended and not designed to cope with such compiler-based code optimizations [4].

However, the relevance of these performance numbers are in general questionable: The code must be running entirely from cache without any I/O transfers to show best results. As soon as larger code size, costly I/O transfers, and different compiler options are involved, these numbers are merely theoretical.

Last but not least, it is very impressive to see how different code optimization techniques of the compiler significantly influence the execution time of the identical piece of code. Here at last it becomes obvious, that a CPU performance measuring tool like a benchmark needs to be designed by keeping clearly in mind hardware, software and compiler architectures and capabilities.
 

References   [Toc] [Top]

[1] Xilinx Inc., Virtex-5 Family Brochure, Dec 2008

[2] Xilinc Inc., Virtex-5 Website

[3] Paul Glover, Running the Dhrystone 2.1 Benchmark on a Virtex-II Pro PowerPC Processor, Xilinx Application Note (XAPP507), July 2005

[4] Alan R Weiss, Dhrystone Benchmark - History, Analysis, "Scores" and Recommendations, White Paper, Nov. 1, 2002

[5] netlib.org, Benchmark Programs and Reports

[6] sourceforge.net, IBM PowerPC Performance Libraries

 

Last updated: 2012/12/30

[Toc] [Top]
 

If you see only this page in your browser window,
click here
to get the entire site.