Mike\’s Blog

March 31, 2006

Java Performance in Dual Core/Multiprocessor Environment

Filed under: Software Development — mlee888 @ 7:39 pm

By: Mike K Lee

March 30, 2006

The Shocker

Same Java program on a single CPU machine can actually run a lot faster than on a multiprocess/multi-core machine!!

The Story

I was recently lucky enough to have spec’ed myself a pretty nifty development machine at work. Here is the summary: Pentium D 955EE processor[1], 3GB of 667MHz DDR2, Dual 74GB Western Digital Raptor @ 10K RPM configured in RAID 0, and a Nvidia Quadro FX1400 with Dual DVI outputs. I know AMD still edges out in many of the benchmarks[2] and yes, I know it is already obsolete even before I booted the machine up in February of 2006. With the Intel’s announcement in March, Conroe and Woodcrest is only going to make this happen even sooner.[3]

 

In any case, this is not your grandma’s PC and certainly you would think it should beat anything that I have been using for more than 2 years. After all, this CPU is being touted as one of the fastest Intel CPUs – at least during February 2006.

 

I quickly setup up and replicated my development environment from my T41 IBM Notebook[4]. The new box is fast.[5] In no time, I have installed Microsoft Visual Studio 2005, Eclipse 3.1.1 and Java 5 – and several previous versions of the VM, all under Windows XP x64.

 

I fired up a build and not surprisingly, it built and ran the included tests successfully. A build with all the test runs does take some time, so I was paying close attention to the performance of this on the new box.

 

I went to look at some of the performance metrics of the tests. I immediately compared this to the build that I had just done on my 2 years old notebook[6]. I did not believe what I saw.

 

The new box was not any faster in many of the tests. Not only was the result not any faster than the notebook[7], it was a lot slower!!! I was seeing performance numbers like 180K records/sec processing speed on the notebook compared to 130K records/sec on the new box.

 

This cannot be right!!

 

What the #*@*@?

Did I just wasted $10,000[8] on four of these new machines?

Damn, I should have gotten an AMD box.

Is it Microsoft? Windows x64 ?

What the #*@*@? (many more times)

Something is wrong with the Java VM ?

 

This would not have been a very surprising result for a number of different scenarios. For instance, to optimally use multiple CPUs, each thread needs to be designed and coordinated properly – one thread is not waiting around for another. And I do not mean dead lock situations. I have also read that Hyper-threading can be slower in some situations[9].

 

The test was not that complicated. It was in fact single threaded. It opens a file. It does some very simple processing while it reads its entirety.

 

Although most benchmarks suggest Windows x64 has little overhead running 32 bit applications, maybe the 32-bit Java VM has a known issue with this. I contemplated repartitioning the drive and install a regular XP to find out. Instead of doing this, a more efficient approach was taken. I downloaded and ran the tests against Sun’s latest Java 5 64-bit VM for Windows.

 

The good news is that the 64-bit VM did run faster, but was still a lot slower than the notebook. What the #@(#@@? I did some more research on Windows x64 and by this time concluded that it has nothing to with the OS[10].

 

I have ruled out one of many environmental factors, the OS. The same class files are being run on the notebook and the new box, so I naturally what to examine the environments that I can control. Sun has had the reputation of not producing the fastest VM. There are in fact a number of other VMs available, including BEA JRockit(r) Java 5 (32 and 64-bit versions) and IBM’s.

 

There are several performance tuning parameters when starting the VM also. The choice of garbage collection algorithm and strategy can also affected performance[11]. I have tried an endless array of VM options and different vendor’s VM but none of them was able to bring this new box to perform at the same speed as the notebook, let alone run faster.

 

Could it be there an issue with dual cores? I turned off Hyper-Threading and the second core, effectively turning the new box into a single CPU machine. With this done, performance surged! The new box turned in about 400K records/second, about double that of the notebook.

 

This was both good and bad news. It was great that finally the new box can run faster than the notebook. But come on, am I supposed to continue to have the second core disabled? I am sure this was not part of Intel’s plan! Nor mine!

 

By now, I have no choice but to dive into the code. I begun to look into the call graphs with a profiler and eventually came down to a few blocks of code in the BufferedInputStream that has taken up most of the time. The prime suspect was the ‘synchronized’ keyword and as further tests would reveal, was in fact the culprit.

 

With synchronized taken out the BufferedInputStream, the test read performance on the new box skyrocketed to about 360K records/second with the dual core enabled. With “-server” option, it went up further to about 400k records/second. At last, this was about doubled the fastest performance I got from the notebook, with 200K records/second with –server option.

Test Results

 

In light of my findings, a set of very rudimentary tests were done. These are not comprehensive tests.

 

 

32-bit Java 1.5.0_06-b05

New Box

(Pentium D 955E)

SINGLE Core

New Box

(Pentium D 955EE)

Dual Core

 

New Box

 

Dual+Hyper-Threading

Notebook

(1.7GHz Pentium M)

 

Seconds

StringBuffer (bunch of appends)

1.427

5.780

5.829

2.123

StringBuilder (same op as StringBuffer)

0.470

0.484

0.469

0.961

Vector operations (add & removes)

9.593

10.071

10.218

18.474

ArrayList operations (same op as vector)

9.579

9.586

9.999

18.547

data file (write) with sync

0.894

1.438

1.563

4.687

data file (write) without sync

0.878

0.813

0.860

3.496

data file( read) with sync

1.906

2.066

2.172

2.393

data file (read) without sync

1.750

1.580

1.922

2.163

 

 

New Box

Dual+Hyper Threading

 

BEA 64-bit VM

New Box

Dual+Hyper-Threading

 

Sun 64-bit VM

StringBuffer (bunch of appends)

1.078

2.953

StringBuilder (same op as StringBuffer)

0.874

0.234

Vector operations

4.859

2.328

ArrayList operations

4.640

1.828

data file (write) with sync

1.687

1.624

data file (write) without sync

0.890

0.563

data file( read) with sync

1.547

1.344

data file (read) without sync

1.187

1.062

 

Linux

Ubuntu – AMD64 w/ 64-bit JVM

 

New Box

(Pentium D 955E)

Ubuntu AMD64 Live-Boot (read/write against memory)

Uniprocessor kernel

New Box

(Pentium D 955EE)

 

SMP kernel[12]

StringBuffer (bunch of appends)

0.827

 

StringBuilder (same op as StringBuffer)

0.288

 

Vector operations

1.925

 

ArrayList operations

8.215

 

data file (write) with sync

0.504

 

data file (write) without sync

0.398

 

data file( read) with sync

0.535

 

data file (read) without sync

0.420

 

 

 

It is not news that synchronized has a negative performance impact. You probably know it is best to use StringBuilder instead of StringBuffer whenever you can. The performance comparison of 2.123 seconds compared to 0.961 between StringBuffer and StringBuilder is somewhat expected. What is really striking here is that the same set of code using StringBuffer on a multi-core enabled machine is much slower!

 

I would not be surprised that the cost of acquiring a monitor on a multi-processor environment is more expensive than in a single cpu case. However, why would Vector behave so well compared to ArrayList? Maybe I am missing something here and have not gotten the time to investigate this further. If you have any ideas, feel free to comment.

 

You should note that the above runs are merely averages of several runs (>1 but < 10) and not meant to draw actual performance differences but are used for relative comparison within the same run. Naturally they ran while the system is already in a steady idle state.

Recommendation

The answer, as usual, is: “it depends.” For some applications where performance is not factor, it is not likely that you would need to worry about any of this. After all, reading an 80MB file takes only around 2 seconds. However, if performance is important to you, to get that extra milliseconds here and there means better response time for your user or perhaps like in my case, 2 hours of processing time is a lot better than 5 hours, paying close attention to where you might be implicitly using synchronized is very worthwhile.

 

Based on my observation, it is not clear that synchronized alone is the cause of performance bottleneck. Given the same synchronized code is used in both Vector and StringBuffer, one would expect the performance of Vector to be worse than that of ArrayList. The way JIT works, the code stream and what surrounds the synchronized block and execution pattern are also likely keys to performance. Different VM implementation also show significant performance differences for the same exact hardware as I compared BEA vs. Sun’s. I have yet to try IBM’s Java 5 VM, but I expect that to be better than that of Sun’s based on past experiences. The 64-bit VM also seem to be noticeably faster for my application, without any code change.

 

None of the tests here examine the effect of multiple threads with thread contentions for the same resources. Nonetheless, common sense would suggest that you should avoid such resource sharing as much as possible anyway.

 

The important thing here is that before you introduce parallel processing, it would be wise to know the performance characteristics of your application on a single CPU machine and multi-processor machine.

General Multi-Processing and Threading

 

When you look at what is available on Dell or HP’s websites, you almost cannot buy an x86[13] based computer with a single logical CPU these days. Whether it is through plain old multiple physical CPUs , Hyperthreading[14], or multi-core CPUs, I do not need to convince you that this is or will become the most common environment that you will run into.

 

Even on a single CPU box, multithreading can significantly enhance performance and simplifies programming. Much like multi-tasking at the OS level, a multi-threaded application can take advantage of idle time[15] to do other tasks. Imagine you are asked to ping 100,000 hosts. Each ping is a test to see the host is alive or not. The ping can take anywhere from milliseconds to several seconds. To sequentially ping all 100,000 hosts will take a very long time. While the program is waiting for the response to come back from a ping, separate threads can send for another ping. With a pool of threads for pinging, you will end up doing a lot less waiting. Certainly, threads can be used to effectively reduce idle cpu time while waiting for IO to happen.

 

In a multiprocessor environment, your application becomes physically possible to do parallel processing – doing more than one thing at a time. This is in contrast to a single processor environment where it is only an appearance of parallel processing with the support preemptive multitasking.[16] Multithreaded programming models that leverage this true parallel processing capability generally is not trying to minimize idleness but rather to squeeze as much raw CPU cycles as possible. Invariably your threads is likely to need some common resource, this is of course where you want to pay close attention.

 

Your comments are welcome. I have not had time to post the simple source code to the test but if you are interested, I can email it to you. See email address at the end.

 

About Me

Email: Email Address

 

Director of Advanced Development and Chief Architect

www.evidentsoftware.com.

 

 


[1] Pentium D 955EE is the latest Extreme Edition using the 65nm manufacturing process, dual core, Hyper-threading capable, 3.46GHz, 1066MHz FSB, with 2x 2MB L2 Cache: http://www.tomshardware.com/2005/12/28/intels_65_nm_process_breathes_fire_into_double_core_extreme_edition/

 

[2] Athlon 64 X2 4800+ and Dual FX-60 series beats the Intel D 955EE in a number of benchmarks. http://www.tomshardware.com/2005/12/28/intels_65_nm_process_breathes_fire_into_double_core_extreme_edition/page23.html

 

[3] http://www.tomshardware.com/2006/03/13/idf_spring_2006/

 

[4] The T41 Notebook has a Pentium M 1.7GHz, 60 GB 7200 RPM hard drive, and 1GB RAM

[5] Yes, I know you can get a faster box still. But for about $2500, it was a great deal.

[6] Do not quote me on the age of the notebook, it may only be a year old, but it feels like 2 or 3 years old for sure.

[7] The T41 Notebook has a Pentium M 1.7GHz, 60 GB 7200 RPM hard drive, and 1GB RAM

[8] We got 4 of these, at roughly $2500 each.

[9] Hyper-Threading speeds Linux Multiprocessor performance on a single processor http://www-128.ibm.com/developerworks/linux/library/l-htl/

[10] 64-bit vs. 32-bit Windows, http://www.extremetech.com/article2/0,1697,1857522,00.asp

 

[11] In particular, there are a number of new garbage collector options available for multiprocessor machines

[12] Unable to run this test – Did not get a chance to look/install a Linux dist that supports Intel Matrix Storage w/ RAID enabled driver to install, and has not gotten a “Live” CD/Image with SMP support.

[13] Market leaders in the x86 platform are Intel and AMD with the Pentium 4, Pentium M, Xeon, Core Duo, Althon 64, Sempron, and Turion 64, and so on. http://en.wikipedia.org/wiki/X86

[14] http://arstechnica.com/articles/paedia/cpu/hyperthreading.ars/1

[15] Common idle time occurs when the program has to wait for IO to complete

[16] Preemptive multitasking switches the CPU context from one task to another many times each second, producing the effect of parallel processing.

 

Other References

http://www.research.ibm.com/journal/sj/391/christ.html – even though dated, it is still a very good article.

 

D. Bacon, R. Konuru, C. Murthy, and M. Serrano, “Thin Locks: Featherweight Synchronization for Java,” ACM Conference on Programming Language Design and Implementation, Montreal, Canada (June 17­19, 1998).

 

Jon Stokes, Introduction to Multithreading, Superthreading and Hyperthreading, http://arstechnica.com/articles/paedia/cpu/hyperthreading.ars/1

 

http://www-128.ibm.com/developerworks/eserver/library/es-JavaVirtualMachinePerformance.html

Java theory and practice: More flexible, scalable locking in JDK 5.0 http://www-128.ibm.com/developerworks/java/library/j-jtp10264/

 

http://www-128.ibm.com/developerworks/java/library/j-jalapeno/

 

15 Comments »

  1. I guess the dual core would come handy only for applications that do lots of multithreading and on the same time use a lot of CPU. On the other hand, Intel has been known to limit the capacity of desktop chips in order to sell higher price server chips…

    Comment by Milton — April 3, 2006 @ 9:22 am

  2. Good test setup, Mike!

    Comment by Anjan — April 3, 2006 @ 9:53 am

  3. Mike – Really good information. My first takeaway is that I’m as surprised as you are by the results. But what else should I takeaway? My perception is that programming for multi-processor, dual core, and/or hyperthreading requires careful attention to optimize performance. I will be interested in the discussion at work about what to do with this information! Thanks, JT

    Comment by Jason — April 4, 2006 @ 11:43 am

  4. Jason, most people don’t think about multiprocessor/cores when they are doing single threaded development. But the results I have seen is that even in this case, you have to test/worry about multiprocessor/cores.

    Comment by mlee888 — April 9, 2006 @ 9:00 pm

  5. Have you tried using the -XX:+UseBiasedLocking Sun JVM option?

    Comment by HashiDiKo — April 10, 2006 @ 4:25 pm

  6. Thanks HashiDiKo! I have not tried this on all the various configs yet.
    -XX:+UseBiasedLocking is great option to try, particularly when your applicaton is single threaded.
    http://java.sun.com/performance/reference/whitepapers/tuning.html

    *-XX:+UseBiasedLocking
    Enables a technique for improving the performance of uncontended synchronization. An object is "biased" toward the thread which first acquires its monitor via a monitorenter bytecode or synchronized method invocation; subsequent monitor-related operations performed by that thread are relatively much faster on multiprocessor machines. Some applications with significant amounts of uncontended synchronization may attain significant speedups with this flag enabled; some applications with certain patterns of locking may see slowdowns, though attempts have been made to minimize the negative impact.

    Based on a quick run – this indeed did improve StringBuffer performance significantly (over 5 seconds to 0.969 sec) – but the times for the buffered read/write tests continued to show better performance for the non synchronized versions.
    From a 955EE, Dual Core + HT
    .1000 [main] INFO SynchronizedTests – stringBufferOperation has taken: 0.969 seconds
    .1484 [main] INFO SynchronizedTests – stringBuilderOperation has taken: 0.484 seconds
    .11061 [main] INFO SynchronizedTests – vectorOp has taken: 9.577 seconds
    .21060 [main] INFO SynchronizedTests – arrayOp has taken: 9.999 seconds
    .22497 [main] INFO SynchronizedTests – writeBinaryFile-sync:true has taken: 1.437 seconds
    24529 [main] INFO SynchronizedTests – readBinaryFile-sync:true has taken: 2.032 seconds
    25357 [main] INFO SynchronizedTests – writeBinaryFile-sync:false has taken: 0.828 seconds
    26997 [main] INFO SynchronizedTests – readBinaryFile-sync:false has taken: 1.64 seconds

    -server -XX:+UseBiasedLocking 

    .484 [main] INFO SynchronizedTests  – stringBufferOperation has taken: 0.453 seconds
    .812 [main] INFO SynchronizedTests  – stringBuilderOperation has taken: 0.328 seconds
    .7062 [main] INFO SynchronizedTests  – vectorOp has taken: 6.25 seconds
    .8625 [main] INFO SynchronizedTests  – arrayOp has taken: 1.563 seconds
    .9906 [main] INFO SynchronizedTests  – writeBinaryFile-sync:true has taken: 1.281 seconds
    11343 [main] INFO SynchronizedTests  – readBinaryFile-sync:true has taken: 1.437 seconds
    12125 [main] INFO SynchronizedTests  – writeBinaryFile-sync:false has taken: 0.782 seconds
    12984 [main] INFO SynchronizedTests  – readBinaryFile-sync:false has taken: 0.859 seconds
     

    Comment by mlee888 — April 10, 2006 @ 4:42 pm

  7. Mike.

    I have a different problem. I cannot get my java app to use more than one of the logical processors. I’ve a dual socket, dual core so including the HT, Linux sees 8 logical processors. My java 64 process will max out on one of these and not span over to the other procs. JConsole s showing Java Hotspot 64-bit Server VM. Any ideas?

    Comment by Dave F — May 23, 2006 @ 6:59 pm

  8. Dave,

    Is this your own app that you wrote or someone else wrote? One way of getting some answer is to attach a debugger/profiler to see what threads are there in that app and what they are doing. If your application is not engineered with effective thread model, it will not likely benefit from having more processors – in worse, it will make it slower.

    Since you are seeing the app only able to use one, using the bias locking option will likely increase your app’s performance – but alas, it will not make it use the other processors.

    -Mike.

    Comment by mlee888 — May 23, 2006 @ 9:29 pm

  9. This is someone else’s app and looking at JConsole, appears to have an effective thread model. I think I might know what is happening tho. Top is showing the Java thread maxing out a logical processor at 99.9% and this led me to believe it was not spanning processors. However I believe it actually is using more than one processor because the sum of the CPU used is greater than 12.5 % by far. So sorry. This looks like user error aided by Tops, Thanks.

    Comment by Dave F — May 24, 2006 @ 2:12 pm

  10. Mike,

    On MP IA32 and AMD64 systems we have to use locked atomic instructions (e.g., lock:cmpxchg) to acquire locks. On uniprocessors we can do without the lock: prefix. lock:ed instructions have considerable latency on some Intel processors (>500 cycles).

    -Dave (JVM core technology, Sun)

    Comment by Dave Dice — June 5, 2006 @ 11:48 am

  11. Hi Guys,

    I am working on specifying hardware to our customers for a vendor product we deploy and support.

    The product does a lot of large batch processing, reading lots of data out of a MSSQL database, performing a series of financial costing algorithms and inserting results back.

    The APP is for all intents and purposes a big single threaded application (with a Servlet based front end).

    With the limited resources we have for performance testing, it seems the app runs a lot better on a fast single CPU, single core box, than any dual core or dual CPU boxes.

    Using simple monitoring like task manager and perfmon, we only ever see ‘one’ cpu being maxed out.

    We are fine with this, as it is after all not written to be multi-threaded, so we would not expect it to use more than 1 cpu. We are also really stuck, as a total re-write is not possible. (Aside: We’ve obviously learnt from this and our new app under development is VERY multi-threaded )

    Our issue is as you said in your article, it’s getting hard to get beefy single CPU machines.

    We used to be able to recommend a DL380 with a 3.6Ghz single cpu for a reasonable price tag.

    Now the only thing we can get in a reasonable price tag is DL380 with a 3.0Ghz dual core.

    So in effect, out app has gone from running an a 3.6Ghz CPU to a 3.0Ghz cpu.

    My big questions:
    1) Is anyone else suffering from this same issues/challenges for old apps?
    2) Has anyone got any good ideas to speed up our app on lower clock speed, more core/cpu systems?

    Thanks in advance.

    James

    Comment by James — March 19, 2007 @ 10:32 pm

  12. James,

    If your app is fronted by a servlet engine, then that in itself suggests potential parallel processing when/where multiple same or different servlets may be called at the same time? although you know more about your app and if it is doing these batch processing in a single threaded mode then it is I suppose.

    While the clock speed has been reduced, the actual speed has improved in most cases. You cannot go by clock speed as architectures for Intel (as well as AMD and others) have and are constantly changing such that 2.3 GHz Core Due processor is executing more effective instructions/second than a 3.0 Ghz original P4.

    Depending on your app, you may be able to split up your batch work into two batch jobs running at the same time, two processes? (I know this sounds obvious)

    Have you tried the flag that I mentioned – being a single thread app – you are likely to gain if you use: -XX:+UseBiasedLocking (mileage will varying depending on what cpu you have)

    If you are doing a lot of io (db queries, etc) your older app may also gain from having better/more cached disks/faster disks – and more memory. (64bit java vm, 64 bit database, etc on machines with more than 4gb for example).

    -Mike

    Comment by mlee888 — March 22, 2007 @ 2:53 am

  13. This blog is awesome! Do you need help promoting it?

    Comment by woenmag — January 25, 2011 @ 10:39 am

  14. Hello!!!!
    One study, published in the journal Human Reproduction in January 2000, found that Viagra shortened the refractory period by about 10 minutes in healthy men. However, Viagra (and as for any other ED drugs) cannot increase your sexual appetite or make you ejaculate if you have problems reaching orgasm!..alternative generic…
    Goodluck..
    ____________________________
    rx vs generic

    Comment by jotiradew — June 4, 2012 @ 9:00 pm


RSS feed for comments on this post. TrackBack URI

Leave a reply to Dave F Cancel reply

Create a free website or blog at WordPress.com.