Clarify commentary in sha512-sparcv9.pl.

author: Andy Polyakov <appro@openssl.org> 2007-09-26 12:16:32 +0000
committer: Andy Polyakov <appro@openssl.org> 2007-09-26 12:16:32 +0000
commit: 79fe664f19489870ae24b4d0e11830a7624ca9d6 (patch)
tree: 5e479f086d732a14b36534617ab0f2e7e75a3101
parent: 5f0477f47b470e15683f054dc98f0cf20179dc7d (diff)
download: openssl-79fe664f19489870ae24b4d0e11830a7624ca9d6.tar.gz
1 files changed, 14 insertions, 6 deletions
diff --git a/crypto/sha/asm/sha512-sparcv9.pl b/crypto/sha/asm/sha512-sparcv9.pl
index 25f80390ac..fa276d258b 100644
--- a/crypto/sha/asm/sha512-sparcv9.pl
+++ b/crypto/sha/asm/sha512-sparcv9.pl
@@ -17,7 +17,7 @@
 # Performance is >75% better than 64-bit code generated by Sun C and
 # over 2x than 32-bit code. X[16] resides on stack, but access to it
 # is scheduled for L2 latency and staged through 32 least significant
-# bits of %l0-%l7. The latter is done to achieve 32-/64-bit bit ABI
+# bits of %l0-%l7. The latter is done to achieve 32-/64-bit ABI
 # duality. Nevetheless it's ~40% faster than SHA256, which is pretty
 # good [optimal coefficient is 50%].
 #
@@ -25,14 +25,22 @@
 #
 # It's not any faster than 64-bit code generated by Sun C 5.8. This is
 # because 64-bit code generator has the advantage of using 64-bit
-# loads to access X[16], which I consciously traded for 32-/64-bit ABI
-# duality [as per above]. But it surpasses 32-bit Sun C generated code
-# by 60%, not to mention that it doesn't suffer from severe decay when
-# running 4 times physical cores threads and that it leaves gcc [3.4]
-# behind by over 4x factor! If compared to SHA256, single thread
+# loads(*) to access X[16], which I consciously traded for 32-/64-bit
+# ABI duality [as per above]. But it surpasses 32-bit Sun C generated
+# code by 60%, not to mention that it doesn't suffer from severe decay
+# when running 4 times physical cores threads and that it leaves gcc
+# [3.4] behind by over 4x factor! If compared to SHA256, single thread
 # performance is only 10% better, but overall throughput for maximum
 # amount of threads for given CPU exceeds corresponding one of SHA256
 # by 30% [again, optimal coefficient is 50%].
+#
+# (*)	Unlike pre-T1 UltraSPARC loads on T1 are executed strictly
+#	in-order, i.e. load instruction has to complete prior next
+#	instruction in given thread is executed, even if the latter is
+#	not dependent on load result! This means that on T1 two 32-bit
+#	loads are always slower than one 64-bit load. Once again this
+#	is unlike pre-T1 UltraSPARC, where, if scheduled appropriately,
+#	2x32-bit loads can be as fast as 1x64-bit ones.
 
 $bits=32;
 for (@ARGV)	{ $bits=64 if (/\-m64/ || /\-xarch\=v9/); }
author	Andy Polyakov <appro@openssl.org>	2007-09-26 12:16:32 +0000
committer	Andy Polyakov <appro@openssl.org>	2007-09-26 12:16:32 +0000
commit	79fe664f19489870ae24b4d0e11830a7624ca9d6 (patch)
tree	5e479f086d732a14b36534617ab0f2e7e75a3101
parent	5f0477f47b470e15683f054dc98f0cf20179dc7d (diff)
download	openssl-79fe664f19489870ae24b4d0e11830a7624ca9d6.tar.gz