CubeHash: a simple hash function

Software

CubeHash software

CubeHash512 is as fast as (and often faster than) both SHA-256 and SHA-512 on typical CPUs. Here are some examples of publicly verifiable CubeHash512 speeds measured by the eBASH project:

Cycles/byte for long messages			Machine	Implementation
CubeHash512	SHA-256	SHA-512	Machine	CubeHash512
~9.26	~17.99	11.20	amd64, 2401MHz, Intel Xeon E5620 (206c2), 2010, giant4	amd64-2
10.23	15.26	10.26	amd64, 2833MHz, Intel Core 2 Quad Q9550 (10677), 2008, berlekamp	amd64-2
12.03	15.06	11.35	amd64, 3200MHz, AMD Phenom II X4 955 (100f42), 2009, morningstar	amd64-2
12.27	21.17	54.10	ppc32, 533MHz, Motorola PowerPC G4 7410, 2001, gggg	ppcaltivec
12.29	20.65	13.32	ppc64, 1800MHz, IBM PowerPC G5 970, 2003?, gcc40	ppcaltivec
12.39	15.36	11.71	amd64, 2405MHz, Intel Core 2 Quad Q6600 (6fb), 2007, utrecht	amd64-2
13.01	15.57	17.51	x86, 2405MHz, Intel Core 2 Quad Q6600 (6fb), 2007, utrecht	x86xmm
15.04	40.21	41.86	cellspu, 3192MHz, Cell (PS3), 2006, nmi0249	cellspu
23.45	32.35	54.82	x86, 1667MHz, Intel Atom N280 (106c2), 2009, slim	x86xmm
27.81	~22.31	~89.50	armeabi, 800MHz, Freescale i.MX515 (v7-A, Cortex A8), 2009, gcc33	armneon

Many SHA-3 candidates are like SHA-512 in providing inconsistent performance, for example slowing down dramatically on 32-bit Intel Atom CPUs used in many netbooks, the 32-bit Apple A4 (ARM Cortex A8) CPU used in the iPad and iPhone, etc. CubeHash512 provides consistently high performance. CubeHash512 is between #1 and #6 in software performance among second-round SHA-3 candidates on every modern CPU, for long messages and short messages.

The CubeHash512 implementations mentioned above, and several additional implementations, are distributed as part of the SUPERCOP benchmarking framework, illustrating different aspects of CubeHash optimization:

Portability Implementation Word size Description

portable simple 32 bits The recommended introduction to CubeHash for programmers. Includes compatibility layer for the NIST API.

portable spec 32 bits A more verbose version closely tracking the CubeHash specification. Includes compatibility layer for the NIST API.

portable bitsliced 16 bits Different implementation strategy, useful in some contexts.

portable hardware2 32 bits 2-cycle-per-round hardware implementation strategy. The recommended introduction to CubeHash for VHDL/Verilog programmers.

portable hardware4 16 bits 4-cycle-per-round hardware implementation strategy.

portable hardware8 8 bits 8-cycle-per-round hardware implementation strategy.

portable hardware16 4 bits 16-cycle-per-round hardware implementation strategy.

portable hardware32 2 bits 32-cycle-per-round hardware implementation strategy.

portable unrolled 32 bits Similar to simple, plus full unrolling of the inner loops; 32 separate 32-bit variables instead of an array. Includes compatibility layer for the NIST API.

portable unrolled2 32 bits Similar to unrolled, plus 2-way unrolling of the main loop. Includes compatibility layer for the NIST API.

portable unrolled3 32 bits Similar to unrolled2, but smaller code.

portable unrolled4 32 bits Similar to unrolled3, but different scheduling.

portable unrolled5 32 bits Similar to unrolled4, but different scheduling.

amd64, x86 with SSE mmintrin 64 bits Vectorized implementation using MMX registers and PSHUFW.

amd64, x86 with SSE3 emmintrin4 128 bits Vectorized implementation using XMM registers. Includes compatibility layer for the NIST API.

amd64, x86 with SSE3 emmintrin5 128 bits Similar to emmintrin4, but smaller code.

amd64 amd64-32 32 bits Similar to unrolled3, plus improved instruction scheduling.

amd64 amd64 128 bits Similar to emmintrin5, plus improved instruction scheduling.

amd64 amd64-2 128 bits Similar to amd64, plus improved instruction scheduling.

amd64 with AVX amd64avx 128 bits Similar to amd64-2, plus some use of AVX instructions. Untested!

x86 x86 32 bits Similar to unrolled3, plus improved instruction scheduling.

x86 with SSE3 x86xmm 128 bits Similar to amd64-2, but for x86 instead of amd64.

armeabi arm 32 bits Similar to unrolled3, plus improved instruction scheduling.

armeabi with NEON armneon 128 bits Vectorized implementation using NEON registers.

ia64 precompiled/ia64 32 bits Precompiled version of unrolled3.

mipso32 mipso32 32 bits Similar to unrolled3, plus improved instruction scheduling.

mips32, mips64 mips64 32 bits Similar to unrolled3, plus improved instruction scheduling.

ppc32, ppc64 with AltiVec ppcaltivec 128 bits Vectorized implementation using AltiVec registers.

ppc32 ppc32 32 bits Similar to unrolled3, plus improved instruction scheduling.

ppc64 under Linux ppc64 32 bits Similar to unrolled3, plus improved instruction scheduling.

ppc64 under AIX ppc64aix 32 bits Similar to ppc64, with tweaks needed to compile under AIX.

cellspu cellspu 128 bits Vectorized implementation using SPE registers.

sparcv9 sparcv9 32 bits Similar to unrolled3, plus improved instruction scheduling.

Version

This is version 2010.12.03 of the software.html web page.