|
CubeHash software
CubeHash512 is as fast as (and often faster than)
both SHA-256 and SHA-512 on typical CPUs.
Here are some examples of publicly verifiable CubeHash512 speeds measured by the
eBASH project:
Cycles/byte for long messages | Machine | Implementation |
CubeHash512 | SHA-256 | SHA-512 | CubeHash512 |
~9.26 | ~17.99 | 11.20 | amd64, 2401MHz, Intel Xeon E5620 (206c2), 2010, giant4 | amd64-2 |
10.23 | 15.26 | 10.26 | amd64, 2833MHz, Intel Core 2 Quad Q9550 (10677), 2008, berlekamp | amd64-2 |
12.03 | 15.06 | 11.35 | amd64, 3200MHz, AMD Phenom II X4 955 (100f42), 2009, morningstar | amd64-2 |
12.27 | 21.17 | 54.10 | ppc32, 533MHz, Motorola PowerPC G4 7410, 2001, gggg | ppcaltivec |
12.29 | 20.65 | 13.32 | ppc64, 1800MHz, IBM PowerPC G5 970, 2003?, gcc40 | ppcaltivec |
12.39 | 15.36 | 11.71 | amd64, 2405MHz, Intel Core 2 Quad Q6600 (6fb), 2007, utrecht | amd64-2 |
13.01 | 15.57 | 17.51 | x86, 2405MHz, Intel Core 2 Quad Q6600 (6fb), 2007, utrecht | x86xmm |
15.04 | 40.21 | 41.86 | cellspu, 3192MHz, Cell (PS3), 2006, nmi0249 | cellspu |
23.45 | 32.35 | 54.82 | x86, 1667MHz, Intel Atom N280 (106c2), 2009, slim | x86xmm |
27.81 | ~22.31 | ~89.50 | armeabi, 800MHz, Freescale i.MX515 (v7-A, Cortex A8), 2009, gcc33 | armneon |
Many SHA-3 candidates are like SHA-512 in providing inconsistent performance,
for example slowing down dramatically on
32-bit Intel Atom CPUs used in many netbooks,
the 32-bit Apple A4 (ARM Cortex A8) CPU used in the iPad and iPhone,
etc.
CubeHash512 provides consistently high performance.
CubeHash512 is between #1 and #6 in software performance
among second-round SHA-3 candidates on every modern CPU,
for long messages and short messages.
The CubeHash512 implementations mentioned above, and several additional implementations,
are distributed as part of the
SUPERCOP benchmarking framework,
illustrating different aspects of CubeHash optimization:
Portability | Implementation | Word size | Description |
portable | simple | 32 bits | The recommended introduction to CubeHash for programmers. Includes compatibility layer for the NIST API. |
portable | spec | 32 bits | A more verbose version closely tracking the CubeHash specification. Includes compatibility layer for the NIST API. |
portable | bitsliced | 16 bits | Different implementation strategy, useful in some contexts. |
portable | hardware2 | 32 bits | 2-cycle-per-round hardware implementation strategy. The recommended introduction to CubeHash for VHDL/Verilog programmers. |
portable | hardware4 | 16 bits | 4-cycle-per-round hardware implementation strategy. |
portable | hardware8 | 8 bits | 8-cycle-per-round hardware implementation strategy. |
portable | hardware16 | 4 bits | 16-cycle-per-round hardware implementation strategy. |
portable | hardware32 | 2 bits | 32-cycle-per-round hardware implementation strategy. |
portable | unrolled | 32 bits | Similar to simple, plus full unrolling of the inner loops; 32 separate 32-bit variables instead of an array. Includes compatibility layer for the NIST API. |
portable | unrolled2 | 32 bits | Similar to unrolled, plus 2-way unrolling of the main loop. Includes compatibility layer for the NIST API. |
portable | unrolled3 | 32 bits | Similar to unrolled2, but smaller code. |
portable | unrolled4 | 32 bits | Similar to unrolled3, but different scheduling. |
portable | unrolled5 | 32 bits | Similar to unrolled4, but different scheduling. |
amd64, x86 with SSE | mmintrin | 64 bits | Vectorized implementation using MMX registers and PSHUFW. |
amd64, x86 with SSE3 | emmintrin4 | 128 bits | Vectorized implementation using XMM registers. Includes compatibility layer for the NIST API. |
amd64, x86 with SSE3 | emmintrin5 | 128 bits | Similar to emmintrin4, but smaller code. |
amd64 | amd64-32 | 32 bits | Similar to unrolled3, plus improved instruction scheduling. |
amd64 | amd64 | 128 bits | Similar to emmintrin5, plus improved instruction scheduling. |
amd64 | amd64-2 | 128 bits | Similar to amd64, plus improved instruction scheduling. |
amd64 with AVX | amd64avx | 128 bits | Similar to amd64-2, plus some use of AVX instructions. Untested! |
x86 | x86 | 32 bits | Similar to unrolled3, plus improved instruction scheduling. |
x86 with SSE3 | x86xmm | 128 bits | Similar to amd64-2, but for x86 instead of amd64. |
armeabi | arm | 32 bits | Similar to unrolled3, plus improved instruction scheduling. |
armeabi with NEON | armneon | 128 bits | Vectorized implementation using NEON registers. |
ia64 | precompiled/ia64 | 32 bits | Precompiled version of unrolled3. |
mipso32 | mipso32 | 32 bits | Similar to unrolled3, plus improved instruction scheduling. |
mips32, mips64 | mips64 | 32 bits | Similar to unrolled3, plus improved instruction scheduling. |
ppc32, ppc64 with AltiVec | ppcaltivec | 128 bits | Vectorized implementation using AltiVec registers. |
ppc32 | ppc32 | 32 bits | Similar to unrolled3, plus improved instruction scheduling. |
ppc64 under Linux | ppc64 | 32 bits | Similar to unrolled3, plus improved instruction scheduling. |
ppc64 under AIX | ppc64aix | 32 bits | Similar to ppc64, with tweaks needed to compile under AIX. |
cellspu | cellspu | 128 bits | Vectorized implementation using SPE registers. |
sparcv9 | sparcv9 | 32 bits | Similar to unrolled3, plus improved instruction scheduling. |
Version
This is version 2010.12.03 of the software.html web page.
|