Mike\’s Blog

April 21, 2006

Reverse Multithreading CPUs

Filed under: Software Development — mlee888 @ 6:06 pm

http://hardware.slashdot.org/article.pl?sid=06/04/18/206242

'Posted by ScuttleMonkey on Tuesday April 18, @06:26PM
from the quick-geordi-reverse-the-polarity dept.

microbee writes "The register is reporting that AMD is researching a new CPU technology called 'reverse multithreading', which essentially does the opposite of hyperthreading in that it presents multiple cores to the OS as a single-core processor." From the article: "The technology is aimed at the next architecture after K8, according to a purported company mole cited by French-language site x86 Secret. It's well known that two CPUs – whether two separate processors or two cores on the same die – don't generate, clock for clock, double the performance of a single CPU. However, by making the CPU once again appear as a single logical processor, AMD is claimed to believe it may be able to double the single-chip performance with a two-core chip or provide quadruple the performance with a quad-core processor." '

 

April 9, 2006

Java Performance in Dual Core/Multiprocessor Environment with Linux results

Filed under: Software Development — mlee888 @ 8:49 pm

April 9, 2006

 

I ran the tests on a Linux box over the weekend.  Since I also have the Intel Matrix storage RAID controller on my home dual core machine as well as the 955EE, I have been unable to find time or willing to take the chance to fiddle with the drivers to get Linux to boot on these without potentially killing my RAID configuration.  

 

This is from a hyper-threading capable 2.8GHz machine with 1GB RAM, single SATA disk.  Similar results were found against StringBuffer where without HT enabled, it ran a lot faster.  “-server” option has a significant impact, as you might expect.   In general, however, Linux performance with Vector with –server option is significantly faster than Windows’ implementation.    I am not quite sure why without HT, the ArrayList operations took a lot longer than the version with HT.   I suspect that it may have to do with the garbage collection that may be occurring immediately after the Vector operations, affecting the results of the ArrayList operations.   I suspect that the data read without sync also may be impacted similarly.

 

In any case, based on these results and more importantly real world application observation, this continues to suggest that you need to pay close attention to synchronized or unintended synchronized usage in your code when they run in multiprocessor environments vs. single processor environment.  However, if your application is single threaded, it will be very worth while to try this option:

-XX:+UseBiasedLocking comments

With this option, significant performance improved in some non-contented monitor acquisitions, dramatically improving performance on some of the test results.  With additional –server option, the tests ran even faster.  

 

 

 

Linux – Java 1.5.0_06-b05

Fedora 2.61.11-1.1369_FC4smp

Intel 2.8GHz  (average over 5 runs)

 

With HT

Without -server option

Without HT

Without –server option

With HT

With -server

Without HT

With -server

StringBuffer (bunch of appends)

10.415

2.006

5.858

0.776

StringBuilder (same op as StringBuffer)

0.917

0.911

0.617

0.4118

Vector operations

18.383

17.691

2.857

2.320

ArrayList operations

17.569

17.791

2.0045

10.088

data file (write) with sync

2.085

1.111

1.907

0.765

data file (write) without  sync

1.004

2.470

0.868

0.865

data file( read) with sync

5.211

1.007

1.860

0.777

data file (read) without sync

4.032

2.707

2.634

0.657

 

Windows XP x64 – 32bit VM

Intel 955EE, 3.46GHz

-XX:+UseBiasedLocking

Dual Core + HT

 

Without -server

Dual Core + HT

 

With -server

StringBuffer (bunch of appends)

0.984

0.453

StringBuilder (same op as StringBuffer)

0.484

0.328

Vector operations

9.609

6.25

ArrayList operations

9.780

1.563

data file (write) with sync

1.421

1.281

data file (write) without  sync

0.843

0.687

data file( read) with sync

2.032

1.422

data file (read) without sync

1.484

0.625

 

 

 

March 31, 2006

Java Performance in Dual Core/Multiprocessor Environment

Filed under: Software Development — mlee888 @ 7:39 pm

By: Mike K Lee

March 30, 2006

The Shocker

Same Java program on a single CPU machine can actually run a lot faster than on a multiprocess/multi-core machine!!

The Story

I was recently lucky enough to have spec’ed myself a pretty nifty development machine at work. Here is the summary: Pentium D 955EE processor[1], 3GB of 667MHz DDR2, Dual 74GB Western Digital Raptor @ 10K RPM configured in RAID 0, and a Nvidia Quadro FX1400 with Dual DVI outputs. I know AMD still edges out in many of the benchmarks[2] and yes, I know it is already obsolete even before I booted the machine up in February of 2006. With the Intel’s announcement in March, Conroe and Woodcrest is only going to make this happen even sooner.[3]

 

In any case, this is not your grandma’s PC and certainly you would think it should beat anything that I have been using for more than 2 years. After all, this CPU is being touted as one of the fastest Intel CPUs – at least during February 2006.

 

I quickly setup up and replicated my development environment from my T41 IBM Notebook[4]. The new box is fast.[5] In no time, I have installed Microsoft Visual Studio 2005, Eclipse 3.1.1 and Java 5 – and several previous versions of the VM, all under Windows XP x64.

 

I fired up a build and not surprisingly, it built and ran the included tests successfully. A build with all the test runs does take some time, so I was paying close attention to the performance of this on the new box.

 

I went to look at some of the performance metrics of the tests. I immediately compared this to the build that I had just done on my 2 years old notebook[6]. I did not believe what I saw.

 

The new box was not any faster in many of the tests. Not only was the result not any faster than the notebook[7], it was a lot slower!!! I was seeing performance numbers like 180K records/sec processing speed on the notebook compared to 130K records/sec on the new box.

 

This cannot be right!!

 

What the #*@*@?

Did I just wasted $10,000[8] on four of these new machines?

Damn, I should have gotten an AMD box.

Is it Microsoft? Windows x64 ?

What the #*@*@? (many more times)

Something is wrong with the Java VM ?

 

This would not have been a very surprising result for a number of different scenarios. For instance, to optimally use multiple CPUs, each thread needs to be designed and coordinated properly – one thread is not waiting around for another. And I do not mean dead lock situations. I have also read that Hyper-threading can be slower in some situations[9].

 

The test was not that complicated. It was in fact single threaded. It opens a file. It does some very simple processing while it reads its entirety.

 

Although most benchmarks suggest Windows x64 has little overhead running 32 bit applications, maybe the 32-bit Java VM has a known issue with this. I contemplated repartitioning the drive and install a regular XP to find out. Instead of doing this, a more efficient approach was taken. I downloaded and ran the tests against Sun’s latest Java 5 64-bit VM for Windows.

 

The good news is that the 64-bit VM did run faster, but was still a lot slower than the notebook. What the #@(#@@? I did some more research on Windows x64 and by this time concluded that it has nothing to with the OS[10].

 

I have ruled out one of many environmental factors, the OS. The same class files are being run on the notebook and the new box, so I naturally what to examine the environments that I can control. Sun has had the reputation of not producing the fastest VM. There are in fact a number of other VMs available, including BEA JRockit(r) Java 5 (32 and 64-bit versions) and IBM’s.

 

There are several performance tuning parameters when starting the VM also. The choice of garbage collection algorithm and strategy can also affected performance[11]. I have tried an endless array of VM options and different vendor’s VM but none of them was able to bring this new box to perform at the same speed as the notebook, let alone run faster.

 

Could it be there an issue with dual cores? I turned off Hyper-Threading and the second core, effectively turning the new box into a single CPU machine. With this done, performance surged! The new box turned in about 400K records/second, about double that of the notebook.

 

This was both good and bad news. It was great that finally the new box can run faster than the notebook. But come on, am I supposed to continue to have the second core disabled? I am sure this was not part of Intel’s plan! Nor mine!

 

By now, I have no choice but to dive into the code. I begun to look into the call graphs with a profiler and eventually came down to a few blocks of code in the BufferedInputStream that has taken up most of the time. The prime suspect was the ‘synchronized’ keyword and as further tests would reveal, was in fact the culprit.

 

With synchronized taken out the BufferedInputStream, the test read performance on the new box skyrocketed to about 360K records/second with the dual core enabled. With “-server” option, it went up further to about 400k records/second. At last, this was about doubled the fastest performance I got from the notebook, with 200K records/second with –server option.

Test Results

 

In light of my findings, a set of very rudimentary tests were done. These are not comprehensive tests.

 

 

32-bit Java 1.5.0_06-b05

New Box

(Pentium D 955E)

SINGLE Core

New Box

(Pentium D 955EE)

Dual Core

 

New Box

 

Dual+Hyper-Threading

Notebook

(1.7GHz Pentium M)

 

Seconds

StringBuffer (bunch of appends)

1.427

5.780

5.829

2.123

StringBuilder (same op as StringBuffer)

0.470

0.484

0.469

0.961

Vector operations (add & removes)

9.593

10.071

10.218

18.474

ArrayList operations (same op as vector)

9.579

9.586

9.999

18.547

data file (write) with sync

0.894

1.438

1.563

4.687

data file (write) without sync

0.878

0.813

0.860

3.496

data file( read) with sync

1.906

2.066

2.172

2.393

data file (read) without sync

1.750

1.580

1.922

2.163

 

 

New Box

Dual+Hyper Threading

 

BEA 64-bit VM

New Box

Dual+Hyper-Threading

 

Sun 64-bit VM

StringBuffer (bunch of appends)

1.078

2.953

StringBuilder (same op as StringBuffer)

0.874

0.234

Vector operations

4.859

2.328

ArrayList operations

4.640

1.828

data file (write) with sync

1.687

1.624

data file (write) without sync

0.890

0.563

data file( read) with sync

1.547

1.344

data file (read) without sync

1.187

1.062

 

Linux

Ubuntu – AMD64 w/ 64-bit JVM

 

New Box

(Pentium D 955E)

Ubuntu AMD64 Live-Boot (read/write against memory)

Uniprocessor kernel

New Box

(Pentium D 955EE)

 

SMP kernel[12]

StringBuffer (bunch of appends)

0.827

 

StringBuilder (same op as StringBuffer)

0.288

 

Vector operations

1.925

 

ArrayList operations

8.215

 

data file (write) with sync

0.504

 

data file (write) without sync

0.398

 

data file( read) with sync

0.535

 

data file (read) without sync

0.420

 

 

 

It is not news that synchronized has a negative performance impact. You probably know it is best to use StringBuilder instead of StringBuffer whenever you can. The performance comparison of 2.123 seconds compared to 0.961 between StringBuffer and StringBuilder is somewhat expected. What is really striking here is that the same set of code using StringBuffer on a multi-core enabled machine is much slower!

 

I would not be surprised that the cost of acquiring a monitor on a multi-processor environment is more expensive than in a single cpu case. However, why would Vector behave so well compared to ArrayList? Maybe I am missing something here and have not gotten the time to investigate this further. If you have any ideas, feel free to comment.

 

You should note that the above runs are merely averages of several runs (>1 but < 10) and not meant to draw actual performance differences but are used for relative comparison within the same run. Naturally they ran while the system is already in a steady idle state.

Recommendation

The answer, as usual, is: “it depends.” For some applications where performance is not factor, it is not likely that you would need to worry about any of this. After all, reading an 80MB file takes only around 2 seconds. However, if performance is important to you, to get that extra milliseconds here and there means better response time for your user or perhaps like in my case, 2 hours of processing time is a lot better than 5 hours, paying close attention to where you might be implicitly using synchronized is very worthwhile.

 

Based on my observation, it is not clear that synchronized alone is the cause of performance bottleneck. Given the same synchronized code is used in both Vector and StringBuffer, one would expect the performance of Vector to be worse than that of ArrayList. The way JIT works, the code stream and what surrounds the synchronized block and execution pattern are also likely keys to performance. Different VM implementation also show significant performance differences for the same exact hardware as I compared BEA vs. Sun’s. I have yet to try IBM’s Java 5 VM, but I expect that to be better than that of Sun’s based on past experiences. The 64-bit VM also seem to be noticeably faster for my application, without any code change.

 

None of the tests here examine the effect of multiple threads with thread contentions for the same resources. Nonetheless, common sense would suggest that you should avoid such resource sharing as much as possible anyway.

 

The important thing here is that before you introduce parallel processing, it would be wise to know the performance characteristics of your application on a single CPU machine and multi-processor machine.

General Multi-Processing and Threading

 

When you look at what is available on Dell or HP’s websites, you almost cannot buy an x86[13] based computer with a single logical CPU these days. Whether it is through plain old multiple physical CPUs , Hyperthreading[14], or multi-core CPUs, I do not need to convince you that this is or will become the most common environment that you will run into.

 

Even on a single CPU box, multithreading can significantly enhance performance and simplifies programming. Much like multi-tasking at the OS level, a multi-threaded application can take advantage of idle time[15] to do other tasks. Imagine you are asked to ping 100,000 hosts. Each ping is a test to see the host is alive or not. The ping can take anywhere from milliseconds to several seconds. To sequentially ping all 100,000 hosts will take a very long time. While the program is waiting for the response to come back from a ping, separate threads can send for another ping. With a pool of threads for pinging, you will end up doing a lot less waiting. Certainly, threads can be used to effectively reduce idle cpu time while waiting for IO to happen.

 

In a multiprocessor environment, your application becomes physically possible to do parallel processing – doing more than one thing at a time. This is in contrast to a single processor environment where it is only an appearance of parallel processing with the support preemptive multitasking.[16] Multithreaded programming models that leverage this true parallel processing capability generally is not trying to minimize idleness but rather to squeeze as much raw CPU cycles as possible. Invariably your threads is likely to need some common resource, this is of course where you want to pay close attention.

 

Your comments are welcome. I have not had time to post the simple source code to the test but if you are interested, I can email it to you. See email address at the end.

 

About Me

Email: Email Address

 

Director of Advanced Development and Chief Architect

www.evidentsoftware.com.

 

 


[1] Pentium D 955EE is the latest Extreme Edition using the 65nm manufacturing process, dual core, Hyper-threading capable, 3.46GHz, 1066MHz FSB, with 2x 2MB L2 Cache: http://www.tomshardware.com/2005/12/28/intels_65_nm_process_breathes_fire_into_double_core_extreme_edition/

 

[2] Athlon 64 X2 4800+ and Dual FX-60 series beats the Intel D 955EE in a number of benchmarks. http://www.tomshardware.com/2005/12/28/intels_65_nm_process_breathes_fire_into_double_core_extreme_edition/page23.html

 

[3] http://www.tomshardware.com/2006/03/13/idf_spring_2006/

 

[4] The T41 Notebook has a Pentium M 1.7GHz, 60 GB 7200 RPM hard drive, and 1GB RAM

[5] Yes, I know you can get a faster box still. But for about $2500, it was a great deal.

[6] Do not quote me on the age of the notebook, it may only be a year old, but it feels like 2 or 3 years old for sure.

[7] The T41 Notebook has a Pentium M 1.7GHz, 60 GB 7200 RPM hard drive, and 1GB RAM

[8] We got 4 of these, at roughly $2500 each.

[9] Hyper-Threading speeds Linux Multiprocessor performance on a single processor http://www-128.ibm.com/developerworks/linux/library/l-htl/

[10] 64-bit vs. 32-bit Windows, http://www.extremetech.com/article2/0,1697,1857522,00.asp

 

[11] In particular, there are a number of new garbage collector options available for multiprocessor machines

[12] Unable to run this test – Did not get a chance to look/install a Linux dist that supports Intel Matrix Storage w/ RAID enabled driver to install, and has not gotten a “Live” CD/Image with SMP support.

[13] Market leaders in the x86 platform are Intel and AMD with the Pentium 4, Pentium M, Xeon, Core Duo, Althon 64, Sempron, and Turion 64, and so on. http://en.wikipedia.org/wiki/X86

[14] http://arstechnica.com/articles/paedia/cpu/hyperthreading.ars/1

[15] Common idle time occurs when the program has to wait for IO to complete

[16] Preemptive multitasking switches the CPU context from one task to another many times each second, producing the effect of parallel processing.

 

Other References

http://www.research.ibm.com/journal/sj/391/christ.html – even though dated, it is still a very good article.

 

D. Bacon, R. Konuru, C. Murthy, and M. Serrano, “Thin Locks: Featherweight Synchronization for Java,” ACM Conference on Programming Language Design and Implementation, Montreal, Canada (June 17­19, 1998).

 

Jon Stokes, Introduction to Multithreading, Superthreading and Hyperthreading, http://arstechnica.com/articles/paedia/cpu/hyperthreading.ars/1

 

http://www-128.ibm.com/developerworks/eserver/library/es-JavaVirtualMachinePerformance.html

Java theory and practice: More flexible, scalable locking in JDK 5.0 http://www-128.ibm.com/developerworks/java/library/j-jtp10264/

 

http://www-128.ibm.com/developerworks/java/library/j-jalapeno/

 

The Silver is the New Black Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.