Distributed.net Client Comparison for Different CPU Architectures


Below is an email I wrote to a friend after looking at distributed.net client
performance relative to cpu clock.
A note for the reader: the names of computers are computers in my house. Rabbit
is a 200MHz Pentium Pro running Win95, roo is a 133MHz mobile pentium running
Win95, owl is a 60MHz pentium running linux, comp apps are a lab of 133MHz
pentiums running Win95, pooh is a 166MHz pentium running linux (two
processors, but these tests only look at once cpu), piglet is a 40MHz sun4c
(ss2) running OpenBSD, heff[alumps] is a 250MHz R4K on an Indigo2 running
Irix 6.5.6, and kanga is a Mac IIcx running macos 7.0.1 (the cpu is 20-30MHz I
believe). After the message is a table of rc5-64/des/csc/ogr performance
comparissions of these computers for those interested.

If anyone has other systems to add to this, or has insight to the client
performance differences I'd love to hear from you!

Thanks and enjoy,
Chris Frost, chris@frostnet.net

-------------------------------------------------------------------------------

Took a look at the 133mhz pentium logs in comp apps today. They are
finishing 10^28 blocks in around 25 minutes. Each processor on pooh takes
slightly over 30. How is that??

Did notice these interesting statics when looking at our computers (in
order from most efficient to least):

450 p3: 1300kkeys / 1sec / 450mhz = 2.889 kkeys*mhz/sec
rabbit:  396kkeys / 1sec / 200mhz = 1.98 kkeys*mhz/sec
roo:     191kkeys / 1sec / 133mhz = 1.436 kkeys*mhz/sec
owl:     85kkeys  / 1sec / 60mhz  = 1.4167 kkeys*mhz/sec
comp apps: ~140   / 1sec / 133mhz = 1.08 kkeys*mhz/sec
pooh:    115kkeys / 1sec / 166mhz = 0.69277 kkeys*mhz/sec (and two processors)
piglet:  32kkeys  / 1sec / 40mhz  = 0.8 kkeys*mhz/sec
heff:    150kkeys / 1sec / 250MHz = 0.6 kkeys*mhz/sec
kanga:   4kkeys   / 1sec / 20-30mhz = 0.2-0.133 kkeys*mhz/sec

The 450MHz pentium three is the most efficient, as the code is using mmx
instructions of course. Seems the rabbit (pentium pro) is the second most
efficient, which I would expect.
However, roo, a box running windows, comes in second. Very odd. Owl comes
in ahead of comp apps (which is pretty amazing), and you can see the rest.
I would think the comp apps computers may be off (same computer as roo,
but as a desktop rather than mobile cpu), so I'll have to recheck this.
Anyway, the efficiency of roo/comp apps vs pooh is very interesting.

Near the bottom you see heffalumps. The R4K was out before the pentium, so
I can see it being less efficient (though at csc is smokes the rest of the
boxes in single cpu mode, which is most interesting, more on that later).
(Update) After looking into the rc5 core, I remembered that rc5
is very dependent on single-position rotates. MIPS only supports variable
rotates, so work that requires one instruction on a cpu with single-rotate
instruction (like intel) takes two. With this in mind, it seems the mips cpus
actually do fairly well considering. I assume alphas are the same in this
respect (see the table at the bottom of this page). (End Update)
Even less efficient than piglet though (piglet's cpu is a bit older). Could
you test your p3 and R10k (and R12k if you could) and give me the stats.
(Update) I talked with the mips-maintainer for the dnet client and he
has been working on optimizations for R4k-class cpus and dual-pipelined
processors (R1xK and R8k), though I'm not sure if any of this work is
in the mainstream client. (End Update)

Another thing to keep in mind for non-intel (and *esp* mips) platforms is
that intel is optimized like nothing else, I'd be surprised to see very
much mips optimization at all, beyond whatever the mips compiler does (if
they aren't using gcc). This accounts a great deal I imagine (look at the
mmx speedups! I don't think mips R4k speedups would be as much, but it
would help. R1xK speedups would be even better than p3 -> p3 mmx I think
though). Another thing to keep in mind: MIPS was designed with the following
notition i mind: "We can optimize better at the compiler stage than at the
execution stage." Thus, mips is extremely heavily dependent on compiler
instructions. This eases cpu design some, allows for cheaper cpus, and
allows for greater cpu performance; however, without a good compiler, things
don't do as well as planed. I'm not sure about this, but I believe that
the mips clients are compiled with gcc. gcc is a great compiler, don't get me
wrong, but it hasn't had as much work put into optimizations for mips cpus as
have intel, and mips needs that work even more. gcc vs MipsPro might not make
much, or any real, difference, but it would be interesting to see
none-the-less.


Anyway, back to csc, which isn't as heavily optimized as rc5-64:

pooh:    0.9036
roo:     0.8947
rabbit:  0.7400
heff:    0.7000
piglet:  0.375

Pooh beats out every one else, but roo is almost there. Rabbit is back
there (the lack of optimizations are showing in here), heffalumps just
barely behind, and piglet just stinks. From this, assuming there
aren't any cpu arch differences causing speed changes (which there are,
but I have no idea by how much, or even in which direction!), you can kind
of get a glimpse where computers stand with less optimization. I know
there is still *significant* intel optimization, and quite a bit of
p6-core, but it's less than rc5. Interesting that heff almost catches
rabbit in efficiency, if there were a mips-optimized (rather than just
ported) csc client things would be interesting, esp an R1xK client. Piglet
had a very poor port I think (and this would support that).

Anyway, I'd love to see some R1xK numbers if you could get any for
rc5/csc. There aren't really any optimizations, but the arch itself will
help a bit. Also, I wonder if the mips clients were compiled with mipspro
or gcc? If the later, I wonder what mipspro could do?

hope you feel better,
--
Chris Frost  |  http://www.frostnet.net/chris/
-------------+----------------------------------
Public PGP Key:
   Email chris@frostnet.net with the subject "retrieve pgp key"
   or visit http://www.frostnet.net/chris/about/pgp_key.phtml

-------------------------------------------------------------------------------





Further Note: fly is a dual ev6 alpha running OSF/1. As soon as I know the
speed of each cpu I'll add calcuations for this client as well.


Last Updated: $Date: 2000/01/27 01:38:59 $, Version: $Revision: 1.4 $