DES F round

Martyr · 27-08-2004 2:55pm #1

comment ~
				-=[DES F function using allocated memory]=-
					(144 instructions, instead of 400)
					----------------------------------

	Imagine you have 3 DWORD values.

	A = 0x12345678, B = 0x9abcdef0, C = 0xbaddad55

B + C are data values in memory, and we have to
XOR A by each of them in our code.

	A ^ B
	A ^ C
					; using A, B + C as eax, ebx, ecx

	mov	eax, A		; eax = (0x12345678)
	mov	ebx, B		; ecx = (0x9abcdef0)
	mov	ecx, C		; ecx = (0xbaddad55)

	mov	edx, eax		; edx = eax
	;.............
	xor	eax, ebx		; eax ^= ebx (0x88888888)
	xor	eax, ecx		; eax ^= ecx (0x325525DD)

if we were to XOR A against the result of B ^ C, then we would get the same result.
	
	xor	ebx, ecx		; ebx ^= ecx (0x206173A5)	B ^ C
	xor	edx, ebx		; edx ^= ebx (0x325525DD)

It figures, I know.

so, why not XOR a DES sbox against a relative sbox and eliminate a few instructions
here and there in the F round?

No doubt that it has been done already, but i haven't seen any code that does it,
which is why I mention it now.
Maybe there is no point?? I don't know.

To do this successfully;

First we have to create every possible combination of xor'd bytes using sbox 1 + 3,
then 2 + 4..and so on until we have 4 new sbox tables from 8, which are allocated in memory.

That is (256 * 256) = 65536 bytes for 2 merged sboxes, containing sbox1 + sbox3 for example,
which would be indexed by first 16bits of key_schedule in each round.

we could do the F round in 9 (144 total) instructions if we xor'd 4 relative sboxes together
instead of just 2, but this would require 4 GB of memory for each 4 bytes of key_schedule,
totalling 8 Gigabytes of sbox memory in 1 round.

i thought at first, before writing the code, the results of a shortened F round would yield 
faster processing, but,
like an MMX routine which processed 2 hashes at once using similar idea,
both were slower, mainly because of the time needed to access 
data in each merged box.

i'm not sure if results would be more positive on a 64-Bit cpu.
But if you think about it, some encryption algorithms that
use fixed substitution boxes can be precomputed.
~

.586
.model flat, stdcall

include <msvcrt.inc>
include <set_key.asm>
include <set_odd.asm>
include <str_to_key.asm>

.data

nLeft			dd	02400b807h		; precomputed IV for Microsoft Lanman hash
nRight		dd	0aa190747h

szPassword		db	"PASSWORD"
			db	8	dup	(0)		;

des_key		db	8	dup	(0)	
key_schedule	db	128	dup	(0)

L1			dd	0			; storage for first routine
R1			dd	0

L2			dd	0			; storage for second routine
R2			dd	0

include <des_sboxes.inc>

psboxes		dd	offset sbox3
			dd	offset sbox1

			dd	offset sbox4
			dd	offset sbox2

			dd	offset sbox7
			dd	offset sbox5

			dd	offset sbox8
			dd	offset sbox6

pushz	macro szText:VARARG
	local	next_label
	call	next_label
	db	szText,00h
next_label:
endm

des_transform_single	macro	dwL, dwR, nOffs
	mov	eax, [ebp + nOffs]		; ebp = key_schedule
	mov	edx, [ebp + nOffs + 4]

	xor	eax, dwL				; dwL = left half of input
	xor	edx, dwL

	and	eax, 0fcfcfcfch
	and	edx, 0cfcfcfcfh

	ror	edx, 4

	movzx	ebx, al
	movzx	ecx, ah

	xor	dwR, dword ptr [ebx + sbox1]
	xor	dwR, dword ptr [ecx + sbox3]

	shr	eax, 16

	mov	bl, dl
	mov	cl, dh
	
	xor	dwR, dword ptr [ebx + sbox2]
	xor	dwR, dword ptr [ecx + sbox4]

	shr	edx, 16

	mov	bl, ah
	mov	cl, dh

	xor	dwR, dword ptr [ebx + sbox7]
	xor	dwR, dword ptr [ecx + sbox8]

	and	eax, 0ffh
	and	edx, 0ffh
	
	xor	dwR, dword ptr [eax + sbox5]
	xor	dwR, dword ptr [edx + sbox6]
ENDM

sbox_1_and_3	equ	0*0ffffh+0
sbox_2_and_4	equ	1*0ffffh+1
sbox_5_and_7	equ	2*0ffffh+2
sbox_8_and_6	equ	3*0ffffh+3

;##################################################################
des_transform_double	macro	dwL, dwR, nOffs
	mov	eax, dword ptr[key_schedule + nOffs]
	mov	ebx, dword ptr[key_schedule + nOffs + 4]

	xor	eax, dwL
	xor	ebx, dwL

	and	eax, 0fcfcfcfch
	and	ebx, 0cfcfcfcfh

	ror	ebx, 4

	mov	cx, ax
	mov	dx, bx

	shr	eax, 16
	shr	ebx, 16

	xor	dwR, dword ptr [ebp + eax + sbox_5_and_7]
	xor	dwR, dword ptr [ebp + ebx + sbox_8_and_6]

	xor	dwR, dword ptr [ebp + ecx + sbox_1_and_3]
	xor	dwR, dword ptr [ebp + edx + sbox_2_and_4]
ENDM

; about the merged sboxes
;
; For 4 xors, its 4 * (256 * 256) = 262,144 BYTES of sbox which is what
; is implemented here.
;
; And for 2 xors, thats 2 * (65,536 * 65,536) = 8,589,934,592 BYTES of space, or roughly 8.5 Giga Bytes
;
; below is what the 2 xor'd version F round would look like..

	;mov	eax, [key_schedule + 00]
	;mov	ebx, [key_schedule + 04]
	
	;xor	eax, dwL
	;xor	ebx, dwL
	
	;and	eax, 0fcfcfcfch
	;and	ebx, 0cfcfcfcfh
	
	;ror	ebx, 04
	
	;xor	dwR, dword ptr [eax + merged_1_3_5_7_sboxes]
	;xor	dwR, dword ptr [ebx + merged_2_4_8_6_sboxes]

; to do it in 128 instructions, precompute rotate ????
; do it in 96, eliminating AND instructions, by providing
; appropriate indexed tables > 63 ????

; using MMX, theory.
	;movq	mm0, qword ptr [key_schedule]
	;pxor	mm0, dqL
	;pxor mm0, qword ptr [merged_sboxes]
	... scrapped 48.
psbox_mem		dd	0

.code
main:
	pushz	10,'DES F Round using "merged" sboxes',10
	call	printf
	add	esp, 04

	push	offset des_key
	push	offset szPassword
	call	str_to_key
	
	push	offset key_schedule
	push	offset des_key
	call	DESSetKey

	mov	esi, dword ptr [nLeft]
	mov	edi, dword ptr [nRight]
	lea	ebp, [key_schedule]
	xor	ecx, ecx
	xor	ebx, ebx
	;######################################

	des_transform_single	esi, edi, 00*08
	des_transform_single	edi, esi, 01*08
	des_transform_single	esi, edi, 02*08
	des_transform_single	edi, esi, 03*08

	des_transform_single	esi, edi, 04*08
	des_transform_single	edi, esi, 05*08
	des_transform_single	esi, edi, 06*08
	des_transform_single	edi, esi, 07*08
 
	des_transform_single	esi, edi, 08*08
	des_transform_single	edi, esi, 09*08
	des_transform_single	esi, edi, 10*08
	des_transform_single	edi, esi, 11*08
 
	des_transform_single	esi, edi, 12*08
	des_transform_single	edi, esi, 13*08
	des_transform_single	esi, edi, 14*08
	des_transform_single	edi, esi, 15*08

	;######################################
	
	mov	dword ptr [L1], esi		; save results.
	mov	dword ptr [R1], edi
	;--------------------------------------
	call	merge_des_sboxes			; allocate memory
	mov	[psbox_mem], eax
	test	eax, eax
	jz	memory_error
	;--------------------------------------
	mov	esi, dword ptr [nLeft]
	mov	edi, dword ptr [nRight]
	mov	ebp, [psbox_mem]
	xor	ecx, ecx
	xor	ebx, ebx
	;######################################

	des_transform_double	esi, edi, 00*08
	des_transform_double	edi, esi, 01*08
	des_transform_double	esi, edi, 02*08
	des_transform_double	edi, esi, 03*08

	des_transform_double	esi, edi, 04*08
	des_transform_double	edi, esi, 05*08
	des_transform_double	esi, edi, 06*08
	des_transform_double	edi, esi, 07*08
 
	des_transform_double	esi, edi, 08*08
	des_transform_double	edi, esi, 09*08
	des_transform_double	esi, edi, 10*08
	des_transform_double	edi, esi, 11*08
 
	des_transform_double	esi, edi, 12*08
	des_transform_double	edi, esi, 13*08
	des_transform_double	esi, edi, 14*08
	des_transform_double	edi, esi, 15*08
 
	;######################################

	mov	dword ptr [L2], esi
	mov	dword ptr [R2], edi
	;--------------------------------------
	
	push	dword ptr [R2]
	push	dword ptr [L2]
	push	dword ptr [R1]
	push	dword ptr [L1]
	pushz	10,"First result:%08x%08x - Second:%08x%08x"
	call	printf
	add	esp, 5*4

free_mem:
	push	dword ptr [psbox_mem]
	call	free
	add	esp, 04h

exit_code:
	push	eax
	call	exit

memory_error:
	pushz	10,"Could not allocate memory for merged sboxes"
	call	printf
	add	esp, 04h
	jmp	exit_code
;################################################################
merge_des_sboxes:
	pushad
	
	push	256*256*4		; 4 * 65536 merged sbox memory
	call	malloc
	add	esp, 04
	test	eax, eax			; if zero, return error
	mov	[esp + 28], eax	
	jz	end_merge
	xchg	eax, edi
	
	lea	esi, [psboxes]		; load sbox pointers
	xor	ecx, ecx
init_sbox:
	push	ecx
	mov	edx, [esi + 4 * ecx + 4]	; get an sbox ;)
	mov	ebp, [esi + 4 * ecx]
	xor	ecx, ecx
store_sbox:
	push	ecx
	xor	ecx, ecx
store_dword:	
	mov	eax, dword ptr [ebp]		
	xor	eax, dword ptr [edx + 4 * ecx]	; xor current dword against current byte in dword
	stosd							; store xor'd 4 bytes/ 1 dword
	inc	ecx
	cmp	ecx, 64			; repeat 64 times
	jne	store_dword

	inc	ebp				; increase 1 character in sbox
	pop	ecx
	inc	ecx
	cmp	ecx, 256			; repeat 256 times
	jne	store_sbox

	pop	ecx
	add	ecx, 2
	cmp	ecx, 8
	jne	init_sbox	
end_merge:
	popad
	ret
dd 3f946967h,7b689b77h,938f0828h,20801d6bh
end	main

Zonko · 29-08-2004 2:30am

*Opens mouth wide*
Assembly I presume, if you can understand all that, you are worthy of muchos praise.

crashedmind · 02-09-2004 5:51pm

First off - it wasn't clear to me what point you were making so apologies if I'm off target...

There's lots of des speed ups out there using some neat tricks - all else being equal they usually trade increased memory requirements against speed.

Taking a step back, you can do maximum pre-computation by going with OFB mode to precompute almost everything and just do an XOR of the plaintext with the resulting stream at the end for encryption.

Capt'n Midnight · 02-09-2004 8:07pm

store_sbox:
	push	ecx
	xor	ecx, ecx
store_dword:	
	mov	eax, dword ptr [ebp]		
	xor	eax, dword ptr [edx + 4 * ecx]	; xor current dword against current byte in dword
	stosd							; store xor'd 4 bytes/ 1 dword
	inc	ecx
	cmp	ecx, 64			; repeat 64 times
	jne	store_dword

	inc	ebp				; increase 1 character in sbox
	pop	ecx
	inc	ecx
	cmp	ecx, 256			; repeat 256 times
	jne	store_sbox

[/QUOTE]

Two nested loops with up to 64*256 iterations
what matters is the number of clock cycles to get to the answer not the size of the algorithm, unless you are trying to put it into a virus or something

Also the order of instructions and the choice of the matters (cf. the polymorphic viruses - not all equilivant instructions take the same time to execute) and having instructions that can execute out of sequence or use different parts of the processor can also lead to optimisations.

GB hash tables are practical if you don't mind buying some more RAM.

Martyr · 03-09-2004 6:21pm

First off - it wasn't clear to me what point you were making so apologies if I'm off target...

Excuse me if I confuse anyone here, my heads a bit groggy.

I realise there is no point in mentioning it now, mainly because its not
really practical, and if there was any point, surely someone would
have done it long ago.
but I just thought that it was a little interesting because if
it weren't for the long delay in accessing data from the "merged" sbox
tables, then it would be very fast, faster than any other software DES implementation, atleast i guess so.

And I've looked around for DES speed-ups, like the bitslice code by Biham,
i tested some code by Matthew Kwan which appeared to be slower
that an ordinary DES routine using macros in assembly.

Thats not to say that a proper implementation of Bihams idea wouldn't
be as fast.

I think i saw that idea of using OFB mode in l0phtcrack.
I'm intrigued by what they do there..

I can only manage 530,000 hashes per second on a PII 350, where as
lc does about 750,000

I figure i can get 1 million hashes soon.

Its not for a virus, but I had an idea of a password cracking virus
against NT Lanman hashes.
I don't write viruses, contrary to popular belief, if you write in assembly,
you're obviously writing viruses or cracking software.
I've studied them for a long time, to understand how they work,
but i've never actually written one, and probably never will, because
its alot of work with no goal.

I think polymorphic & meta code is very interesting.

I would like to finish a password cracker for nix, and give away source code.

For unix crypt, mcrypt,bcrypt,lanman,ntlm1 and 2 aswell as others.
I'm well aware of John The Ripper, but already the lanman hash
routine i have is 2 times faster.

DES F round

Comments