Mike\’s Blog

March 31, 2006

Java Performance in Dual Core/Multiprocessor Environment

Filed under: Software Development — mlee888 @ 7:39 pm

By: Mike K Lee

March 30, 2006

The Shocker

Same Java program on a single CPU machine can actually run a lot faster than on a multiprocess/multi-core machine!!

The Story

I was recently lucky enough to have spec’ed myself a pretty nifty development machine at work. Here is the summary: Pentium D 955EE processor[1], 3GB of 667MHz DDR2, Dual 74GB Western Digital Raptor @ 10K RPM configured in RAID 0, and a Nvidia Quadro FX1400 with Dual DVI outputs. I know AMD still edges out in many of the benchmarks[2] and yes, I know it is already obsolete even before I booted the machine up in February of 2006. With the Intel’s announcement in March, Conroe and Woodcrest is only going to make this happen even sooner.[3]


In any case, this is not your grandma’s PC and certainly you would think it should beat anything that I have been using for more than 2 years. After all, this CPU is being touted as one of the fastest Intel CPUs – at least during February 2006.


I quickly setup up and replicated my development environment from my T41 IBM Notebook[4]. The new box is fast.[5] In no time, I have installed Microsoft Visual Studio 2005, Eclipse 3.1.1 and Java 5 – and several previous versions of the VM, all under Windows XP x64.


I fired up a build and not surprisingly, it built and ran the included tests successfully. A build with all the test runs does take some time, so I was paying close attention to the performance of this on the new box.


I went to look at some of the performance metrics of the tests. I immediately compared this to the build that I had just done on my 2 years old notebook[6]. I did not believe what I saw.


The new box was not any faster in many of the tests. Not only was the result not any faster than the notebook[7], it was a lot slower!!! I was seeing performance numbers like 180K records/sec processing speed on the notebook compared to 130K records/sec on the new box.


This cannot be right!!


What the #*@*@?

Did I just wasted $10,000[8] on four of these new machines?

Damn, I should have gotten an AMD box.

Is it Microsoft? Windows x64 ?

What the #*@*@? (many more times)

Something is wrong with the Java VM ?


This would not have been a very surprising result for a number of different scenarios. For instance, to optimally use multiple CPUs, each thread needs to be designed and coordinated properly – one thread is not waiting around for another. And I do not mean dead lock situations. I have also read that Hyper-threading can be slower in some situations[9].


The test was not that complicated. It was in fact single threaded. It opens a file. It does some very simple processing while it reads its entirety.


Although most benchmarks suggest Windows x64 has little overhead running 32 bit applications, maybe the 32-bit Java VM has a known issue with this. I contemplated repartitioning the drive and install a regular XP to find out. Instead of doing this, a more efficient approach was taken. I downloaded and ran the tests against Sun’s latest Java 5 64-bit VM for Windows.


The good news is that the 64-bit VM did run faster, but was still a lot slower than the notebook. What the #@(#@@? I did some more research on Windows x64 and by this time concluded that it has nothing to with the OS[10].


I have ruled out one of many environmental factors, the OS. The same class files are being run on the notebook and the new box, so I naturally what to examine the environments that I can control. Sun has had the reputation of not producing the fastest VM. There are in fact a number of other VMs available, including BEA JRockit(r) Java 5 (32 and 64-bit versions) and IBM’s.


There are several performance tuning parameters when starting the VM also. The choice of garbage collection algorithm and strategy can also affected performance[11]. I have tried an endless array of VM options and different vendor’s VM but none of them was able to bring this new box to perform at the same speed as the notebook, let alone run faster.


Could it be there an issue with dual cores? I turned off Hyper-Threading and the second core, effectively turning the new box into a single CPU machine. With this done, performance surged! The new box turned in about 400K records/second, about double that of the notebook.


This was both good and bad news. It was great that finally the new box can run faster than the notebook. But come on, am I supposed to continue to have the second core disabled? I am sure this was not part of Intel’s plan! Nor mine!


By now, I have no choice but to dive into the code. I begun to look into the call graphs with a profiler and eventually came down to a few blocks of code in the BufferedInputStream that has taken up most of the time. The prime suspect was the ‘synchronized’ keyword and as further tests would reveal, was in fact the culprit.


With synchronized taken out the BufferedInputStream, the test read performance on the new box skyrocketed to about 360K records/second with the dual core enabled. With “-server” option, it went up further to about 400k records/second. At last, this was about doubled the fastest performance I got from the notebook, with 200K records/second with –server option.

Test Results


In light of my findings, a set of very rudimentary tests were done. These are not comprehensive tests.



32-bit Java 1.5.0_06-b05

New Box

(Pentium D 955E)


New Box

(Pentium D 955EE)

Dual Core


New Box




(1.7GHz Pentium M)



StringBuffer (bunch of appends)





StringBuilder (same op as StringBuffer)





Vector operations (add & removes)





ArrayList operations (same op as vector)





data file (write) with sync





data file (write) without sync





data file( read) with sync





data file (read) without sync







New Box

Dual+Hyper Threading


BEA 64-bit VM

New Box



Sun 64-bit VM

StringBuffer (bunch of appends)



StringBuilder (same op as StringBuffer)



Vector operations



ArrayList operations



data file (write) with sync



data file (write) without sync



data file( read) with sync



data file (read) without sync





Ubuntu – AMD64 w/ 64-bit JVM


New Box

(Pentium D 955E)

Ubuntu AMD64 Live-Boot (read/write against memory)

Uniprocessor kernel

New Box

(Pentium D 955EE)


SMP kernel[12]

StringBuffer (bunch of appends)



StringBuilder (same op as StringBuffer)



Vector operations



ArrayList operations



data file (write) with sync



data file (write) without sync



data file( read) with sync



data file (read) without sync





It is not news that synchronized has a negative performance impact. You probably know it is best to use StringBuilder instead of StringBuffer whenever you can. The performance comparison of 2.123 seconds compared to 0.961 between StringBuffer and StringBuilder is somewhat expected. What is really striking here is that the same set of code using StringBuffer on a multi-core enabled machine is much slower!


I would not be surprised that the cost of acquiring a monitor on a multi-processor environment is more expensive than in a single cpu case. However, why would Vector behave so well compared to ArrayList? Maybe I am missing something here and have not gotten the time to investigate this further. If you have any ideas, feel free to comment.


You should note that the above runs are merely averages of several runs (>1 but < 10) and not meant to draw actual performance differences but are used for relative comparison within the same run. Naturally they ran while the system is already in a steady idle state.


The answer, as usual, is: “it depends.” For some applications where performance is not factor, it is not likely that you would need to worry about any of this. After all, reading an 80MB file takes only around 2 seconds. However, if performance is important to you, to get that extra milliseconds here and there means better response time for your user or perhaps like in my case, 2 hours of processing time is a lot better than 5 hours, paying close attention to where you might be implicitly using synchronized is very worthwhile.


Based on my observation, it is not clear that synchronized alone is the cause of performance bottleneck. Given the same synchronized code is used in both Vector and StringBuffer, one would expect the performance of Vector to be worse than that of ArrayList. The way JIT works, the code stream and what surrounds the synchronized block and execution pattern are also likely keys to performance. Different VM implementation also show significant performance differences for the same exact hardware as I compared BEA vs. Sun’s. I have yet to try IBM’s Java 5 VM, but I expect that to be better than that of Sun’s based on past experiences. The 64-bit VM also seem to be noticeably faster for my application, without any code change.


None of the tests here examine the effect of multiple threads with thread contentions for the same resources. Nonetheless, common sense would suggest that you should avoid such resource sharing as much as possible anyway.


The important thing here is that before you introduce parallel processing, it would be wise to know the performance characteristics of your application on a single CPU machine and multi-processor machine.

General Multi-Processing and Threading


When you look at what is available on Dell or HP’s websites, you almost cannot buy an x86[13] based computer with a single logical CPU these days. Whether it is through plain old multiple physical CPUs , Hyperthreading[14], or multi-core CPUs, I do not need to convince you that this is or will become the most common environment that you will run into.


Even on a single CPU box, multithreading can significantly enhance performance and simplifies programming. Much like multi-tasking at the OS level, a multi-threaded application can take advantage of idle time[15] to do other tasks. Imagine you are asked to ping 100,000 hosts. Each ping is a test to see the host is alive or not. The ping can take anywhere from milliseconds to several seconds. To sequentially ping all 100,000 hosts will take a very long time. While the program is waiting for the response to come back from a ping, separate threads can send for another ping. With a pool of threads for pinging, you will end up doing a lot less waiting. Certainly, threads can be used to effectively reduce idle cpu time while waiting for IO to happen.


In a multiprocessor environment, your application becomes physically possible to do parallel processing – doing more than one thing at a time. This is in contrast to a single processor environment where it is only an appearance of parallel processing with the support preemptive multitasking.[16] Multithreaded programming models that leverage this true parallel processing capability generally is not trying to minimize idleness but rather to squeeze as much raw CPU cycles as possible. Invariably your threads is likely to need some common resource, this is of course where you want to pay close attention.


Your comments are welcome. I have not had time to post the simple source code to the test but if you are interested, I can email it to you. See email address at the end.


About Me

Email: Email Address


Director of Advanced Development and Chief Architect




[1] Pentium D 955EE is the latest Extreme Edition using the 65nm manufacturing process, dual core, Hyper-threading capable, 3.46GHz, 1066MHz FSB, with 2x 2MB L2 Cache: http://www.tomshardware.com/2005/12/28/intels_65_nm_process_breathes_fire_into_double_core_extreme_edition/


[2] Athlon 64 X2 4800+ and Dual FX-60 series beats the Intel D 955EE in a number of benchmarks. http://www.tomshardware.com/2005/12/28/intels_65_nm_process_breathes_fire_into_double_core_extreme_edition/page23.html


[3] http://www.tomshardware.com/2006/03/13/idf_spring_2006/


[4] The T41 Notebook has a Pentium M 1.7GHz, 60 GB 7200 RPM hard drive, and 1GB RAM

[5] Yes, I know you can get a faster box still. But for about $2500, it was a great deal.

[6] Do not quote me on the age of the notebook, it may only be a year old, but it feels like 2 or 3 years old for sure.

[7] The T41 Notebook has a Pentium M 1.7GHz, 60 GB 7200 RPM hard drive, and 1GB RAM

[8] We got 4 of these, at roughly $2500 each.

[9] Hyper-Threading speeds Linux Multiprocessor performance on a single processor http://www-128.ibm.com/developerworks/linux/library/l-htl/

[10] 64-bit vs. 32-bit Windows, http://www.extremetech.com/article2/0,1697,1857522,00.asp


[11] In particular, there are a number of new garbage collector options available for multiprocessor machines

[12] Unable to run this test – Did not get a chance to look/install a Linux dist that supports Intel Matrix Storage w/ RAID enabled driver to install, and has not gotten a “Live” CD/Image with SMP support.

[13] Market leaders in the x86 platform are Intel and AMD with the Pentium 4, Pentium M, Xeon, Core Duo, Althon 64, Sempron, and Turion 64, and so on. http://en.wikipedia.org/wiki/X86

[14] http://arstechnica.com/articles/paedia/cpu/hyperthreading.ars/1

[15] Common idle time occurs when the program has to wait for IO to complete

[16] Preemptive multitasking switches the CPU context from one task to another many times each second, producing the effect of parallel processing.


Other References

http://www.research.ibm.com/journal/sj/391/christ.html – even though dated, it is still a very good article.


D. Bacon, R. Konuru, C. Murthy, and M. Serrano, “Thin Locks: Featherweight Synchronization for Java,” ACM Conference on Programming Language Design and Implementation, Montreal, Canada (June 17­19, 1998).


Jon Stokes, Introduction to Multithreading, Superthreading and Hyperthreading, http://arstechnica.com/articles/paedia/cpu/hyperthreading.ars/1



Java theory and practice: More flexible, scalable locking in JDK 5.0 http://www-128.ibm.com/developerworks/java/library/j-jtp10264/




Create a free website or blog at WordPress.com.