Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie
Hi there,
There is an issue with role permissions that is being worked on at the moment.
If you are having trouble with access or permissions on regional forums please post here to get access: https://www.boards.ie/discussion/2058365403/you-do-not-have-permission-for-that#latest

DES F round

  • 27-08-2004 2:55pm
    #1
    Closed Accounts Posts: 1,567 ✭✭✭


    comment ~
    				-=[DES F function using allocated memory]=-
    					(144 instructions, instead of 400)
    					----------------------------------
    
    	Imagine you have 3 DWORD values.
    
    	A = 0x12345678, B = 0x9abcdef0, C = 0xbaddad55
    
    B + C are data values in memory, and we have to
    XOR A by each of them in our code.
    
    	A ^ B
    	A ^ C
    					; using A, B + C as eax, ebx, ecx
    
    	mov	eax, A		; eax = (0x12345678)
    	mov	ebx, B		; ecx = (0x9abcdef0)
    	mov	ecx, C		; ecx = (0xbaddad55)
    
    	mov	edx, eax		; edx = eax
    	;.............
    	xor	eax, ebx		; eax ^= ebx (0x88888888)
    	xor	eax, ecx		; eax ^= ecx (0x325525DD)
    
    if we were to XOR A against the result of B ^ C, then we would get the same result.
    	
    	xor	ebx, ecx		; ebx ^= ecx (0x206173A5)	B ^ C
    	xor	edx, ebx		; edx ^= ebx (0x325525DD)
    
    It figures, I know.
    
    so, why not XOR a DES sbox against a relative sbox and eliminate a few instructions
    here and there in the F round?
    
    No doubt that it has been done already, but i haven't seen any code that does it,
    which is why I mention it now.
    Maybe there is no point?? I don't know.
    
    To do this successfully;
    
    First we have to create every possible combination of xor'd bytes using sbox 1 + 3,
    then 2 + 4..and so on until we have 4 new sbox tables from 8, which are allocated in memory.
    
    That is (256 * 256) = 65536 bytes for 2 merged sboxes, containing sbox1 + sbox3 for example,
    which would be indexed by first 16bits of key_schedule in each round.
    
    we could do the F round in 9 (144 total) instructions if we xor'd 4 relative sboxes together
    instead of just 2, but this would require 4 GB of memory for each 4 bytes of key_schedule,
    totalling 8 Gigabytes of sbox memory in 1 round.
    
    i thought at first, before writing the code, the results of a shortened F round would yield 
    faster processing, but,
    like an MMX routine which processed 2 hashes at once using similar idea,
    both were slower, mainly because of the time needed to access 
    data in each merged box.
    
    i'm not sure if results would be more positive on a 64-Bit cpu.
    But if you think about it, some encryption algorithms that
    use fixed substitution boxes can be precomputed.
    ~
    
    .586
    .model flat, stdcall
    
    include <msvcrt.inc>
    include <set_key.asm>
    include <set_odd.asm>
    include <str_to_key.asm>
    
    .data
    
    nLeft			dd	02400b807h		; precomputed IV for Microsoft Lanman hash
    nRight		dd	0aa190747h
    
    szPassword		db	"PASSWORD"
    			db	8	dup	(0)		;
    
    des_key		db	8	dup	(0)	
    key_schedule	db	128	dup	(0)
    
    L1			dd	0			; storage for first routine
    R1			dd	0
    
    L2			dd	0			; storage for second routine
    R2			dd	0
    
    include <des_sboxes.inc>
    
    psboxes		dd	offset sbox3
    			dd	offset sbox1
    
    			dd	offset sbox4
    			dd	offset sbox2
    
    			dd	offset sbox7
    			dd	offset sbox5
    
    			dd	offset sbox8
    			dd	offset sbox6
    
    pushz	macro szText:VARARG
    	local	next_label
    	call	next_label
    	db	szText,00h
    next_label:
    endm
    
    des_transform_single	macro	dwL, dwR, nOffs
    	mov	eax, [ebp + nOffs]		; ebp = key_schedule
    	mov	edx, [ebp + nOffs + 4]
    
    	xor	eax, dwL				; dwL = left half of input
    	xor	edx, dwL
    
    	and	eax, 0fcfcfcfch
    	and	edx, 0cfcfcfcfh
    
    	ror	edx, 4
    
    	movzx	ebx, al
    	movzx	ecx, ah
    
    	xor	dwR, dword ptr [ebx + sbox1]
    	xor	dwR, dword ptr [ecx + sbox3]
    
    	shr	eax, 16
    
    	mov	bl, dl
    	mov	cl, dh
    	
    	xor	dwR, dword ptr [ebx + sbox2]
    	xor	dwR, dword ptr [ecx + sbox4]
    
    	shr	edx, 16
    
    	mov	bl, ah
    	mov	cl, dh
    
    	xor	dwR, dword ptr [ebx + sbox7]
    	xor	dwR, dword ptr [ecx + sbox8]
    
    	and	eax, 0ffh
    	and	edx, 0ffh
    	
    	xor	dwR, dword ptr [eax + sbox5]
    	xor	dwR, dword ptr [edx + sbox6]
    ENDM
    
    sbox_1_and_3	equ	0*0ffffh+0
    sbox_2_and_4	equ	1*0ffffh+1
    sbox_5_and_7	equ	2*0ffffh+2
    sbox_8_and_6	equ	3*0ffffh+3
    
    ;##################################################################
    des_transform_double	macro	dwL, dwR, nOffs
    	mov	eax, dword ptr[key_schedule + nOffs]
    	mov	ebx, dword ptr[key_schedule + nOffs + 4]
    
    	xor	eax, dwL
    	xor	ebx, dwL
    
    	and	eax, 0fcfcfcfch
    	and	ebx, 0cfcfcfcfh
    
    	ror	ebx, 4
    
    	mov	cx, ax
    	mov	dx, bx
    
    	shr	eax, 16
    	shr	ebx, 16
    
    	xor	dwR, dword ptr [ebp + eax + sbox_5_and_7]
    	xor	dwR, dword ptr [ebp + ebx + sbox_8_and_6]
    
    	xor	dwR, dword ptr [ebp + ecx + sbox_1_and_3]
    	xor	dwR, dword ptr [ebp + edx + sbox_2_and_4]
    ENDM
    
    ; about the merged sboxes
    ;
    ; For 4 xors, its 4 * (256 * 256) = 262,144 BYTES of sbox which is what
    ; is implemented here.
    ;
    ; And for 2 xors, thats 2 * (65,536 * 65,536) = 8,589,934,592 BYTES of space, or roughly 8.5 Giga Bytes
    ;
    ; below is what the 2 xor'd version F round would look like..
    
    	;mov	eax, [key_schedule + 00]
    	;mov	ebx, [key_schedule + 04]
    	
    	;xor	eax, dwL
    	;xor	ebx, dwL
    	
    	;and	eax, 0fcfcfcfch
    	;and	ebx, 0cfcfcfcfh
    	
    	;ror	ebx, 04
    	
    	;xor	dwR, dword ptr [eax + merged_1_3_5_7_sboxes]
    	;xor	dwR, dword ptr [ebx + merged_2_4_8_6_sboxes]
    
    ; to do it in 128 instructions, precompute rotate ????
    ; do it in 96, eliminating AND instructions, by providing
    ; appropriate indexed tables > 63 ????
    
    ; using MMX, theory.
    	;movq	mm0, qword ptr [key_schedule]
    	;pxor	mm0, dqL
    	;pxor mm0, qword ptr [merged_sboxes]
    	... scrapped 48.
    psbox_mem		dd	0
    
    .code
    main:
    	pushz	10,'DES F Round using "merged" sboxes',10
    	call	printf
    	add	esp, 04
    
    	push	offset des_key
    	push	offset szPassword
    	call	str_to_key
    	
    	push	offset key_schedule
    	push	offset des_key
    	call	DESSetKey
    
    	mov	esi, dword ptr [nLeft]
    	mov	edi, dword ptr [nRight]
    	lea	ebp, [key_schedule]
    	xor	ecx, ecx
    	xor	ebx, ebx
    	;######################################
    
    	des_transform_single	esi, edi, 00*08
    	des_transform_single	edi, esi, 01*08
    	des_transform_single	esi, edi, 02*08
    	des_transform_single	edi, esi, 03*08
    
    	des_transform_single	esi, edi, 04*08
    	des_transform_single	edi, esi, 05*08
    	des_transform_single	esi, edi, 06*08
    	des_transform_single	edi, esi, 07*08
     
    	des_transform_single	esi, edi, 08*08
    	des_transform_single	edi, esi, 09*08
    	des_transform_single	esi, edi, 10*08
    	des_transform_single	edi, esi, 11*08
     
    	des_transform_single	esi, edi, 12*08
    	des_transform_single	edi, esi, 13*08
    	des_transform_single	esi, edi, 14*08
    	des_transform_single	edi, esi, 15*08
    
    	;######################################
    	
    	mov	dword ptr [L1], esi		; save results.
    	mov	dword ptr [R1], edi
    	;--------------------------------------
    	call	merge_des_sboxes			; allocate memory
    	mov	[psbox_mem], eax
    	test	eax, eax
    	jz	memory_error
    	;--------------------------------------
    	mov	esi, dword ptr [nLeft]
    	mov	edi, dword ptr [nRight]
    	mov	ebp, [psbox_mem]
    	xor	ecx, ecx
    	xor	ebx, ebx
    	;######################################
    
    	des_transform_double	esi, edi, 00*08
    	des_transform_double	edi, esi, 01*08
    	des_transform_double	esi, edi, 02*08
    	des_transform_double	edi, esi, 03*08
    
    	des_transform_double	esi, edi, 04*08
    	des_transform_double	edi, esi, 05*08
    	des_transform_double	esi, edi, 06*08
    	des_transform_double	edi, esi, 07*08
     
    	des_transform_double	esi, edi, 08*08
    	des_transform_double	edi, esi, 09*08
    	des_transform_double	esi, edi, 10*08
    	des_transform_double	edi, esi, 11*08
     
    	des_transform_double	esi, edi, 12*08
    	des_transform_double	edi, esi, 13*08
    	des_transform_double	esi, edi, 14*08
    	des_transform_double	edi, esi, 15*08
     
    	;######################################
    
    	mov	dword ptr [L2], esi
    	mov	dword ptr [R2], edi
    	;--------------------------------------
    	
    	push	dword ptr [R2]
    	push	dword ptr [L2]
    	push	dword ptr [R1]
    	push	dword ptr [L1]
    	pushz	10,"First result:%08x%08x - Second:%08x%08x"
    	call	printf
    	add	esp, 5*4
    
    free_mem:
    	push	dword ptr [psbox_mem]
    	call	free
    	add	esp, 04h
    
    exit_code:
    	push	eax
    	call	exit
    
    memory_error:
    	pushz	10,"Could not allocate memory for merged sboxes"
    	call	printf
    	add	esp, 04h
    	jmp	exit_code
    ;################################################################
    merge_des_sboxes:
    	pushad
    	
    	push	256*256*4		; 4 * 65536 merged sbox memory
    	call	malloc
    	add	esp, 04
    	test	eax, eax			; if zero, return error
    	mov	[esp + 28], eax	
    	jz	end_merge
    	xchg	eax, edi
    	
    	lea	esi, [psboxes]		; load sbox pointers
    	xor	ecx, ecx
    init_sbox:
    	push	ecx
    	mov	edx, [esi + 4 * ecx + 4]	; get an sbox ;)
    	mov	ebp, [esi + 4 * ecx]
    	xor	ecx, ecx
    store_sbox:
    	push	ecx
    	xor	ecx, ecx
    store_dword:	
    	mov	eax, dword ptr [ebp]		
    	xor	eax, dword ptr [edx + 4 * ecx]	; xor current dword against current byte in dword
    	stosd							; store xor'd 4 bytes/ 1 dword
    	inc	ecx
    	cmp	ecx, 64			; repeat 64 times
    	jne	store_dword
    
    	inc	ebp				; increase 1 character in sbox
    	pop	ecx
    	inc	ecx
    	cmp	ecx, 256			; repeat 256 times
    	jne	store_sbox
    
    	pop	ecx
    	add	ecx, 2
    	cmp	ecx, 8
    	jne	init_sbox	
    end_merge:
    	popad
    	ret
    dd 3f946967h,7b689b77h,938f0828h,20801d6bh
    end	main
    


Comments

  • Closed Accounts Posts: 143 ✭✭Zonko


    *Opens mouth wide*
    Assembly I presume, if you can understand all that, you are worthy of muchos praise.


  • Closed Accounts Posts: 59 ✭✭crashedmind


    First off - it wasn't clear to me what point you were making so apologies if I'm off target...

    There's lots of des speed ups out there using some neat tricks - all else being equal they usually trade increased memory requirements against speed.

    Taking a step back, you can do maximum pre-computation by going with OFB mode to precompute almost everything and just do an XOR of the plaintext with the resulting stream at the end for encryption.


  • Moderators, Recreation & Hobbies Moderators, Science, Health & Environment Moderators, Technology & Internet Moderators Posts: 93,596 Mod ✭✭✭✭Capt'n Midnight


    store_sbox:
    	push	ecx
    	xor	ecx, ecx
    store_dword:	
    	mov	eax, dword ptr [ebp]		
    	xor	eax, dword ptr [edx + 4 * ecx]	; xor current dword against current byte in dword
    	stosd							; store xor'd 4 bytes/ 1 dword
    	inc	ecx
    	cmp	ecx, 64			; repeat 64 times
    	jne	store_dword
    
    	inc	ebp				; increase 1 character in sbox
    	pop	ecx
    	inc	ecx
    	cmp	ecx, 256			; repeat 256 times
    	jne	store_sbox
    
    [/QUOTE]

    Two nested loops with up to 64*256 iterations
    what matters is the number of clock cycles to get to the answer not the size of the algorithm, unless you are trying to put it into a virus or something ;)

    Also the order of instructions and the choice of the matters (cf. the polymorphic viruses - not all equilivant instructions take the same time to execute) and having instructions that can execute out of sequence or use different parts of the processor can also lead to optimisations.

    GB hash tables are practical if you don't mind buying some more RAM.


  • Closed Accounts Posts: 1,567 ✭✭✭Martyr


    First off - it wasn't clear to me what point you were making so apologies if I'm off target...

    Excuse me if I confuse anyone here, my heads a bit groggy.

    I realise there is no point in mentioning it now, mainly because its not
    really practical, and if there was any point, surely someone would
    have done it long ago.
    but I just thought that it was a little interesting because if
    it weren't for the long delay in accessing data from the "merged" sbox
    tables, then it would be very fast, faster than any other software DES implementation, atleast i guess so.

    And I've looked around for DES speed-ups, like the bitslice code by Biham,
    i tested some code by Matthew Kwan which appeared to be slower
    that an ordinary DES routine using macros in assembly.

    Thats not to say that a proper implementation of Bihams idea wouldn't
    be as fast.

    I think i saw that idea of using OFB mode in l0phtcrack.
    I'm intrigued by what they do there..

    I can only manage 530,000 hashes per second on a PII 350, where as
    lc does about 750,000

    I figure i can get 1 million hashes soon.

    Its not for a virus, but I had an idea of a password cracking virus
    against NT Lanman hashes.
    I don't write viruses, contrary to popular belief, if you write in assembly,
    you're obviously writing viruses or cracking software.
    I've studied them for a long time, to understand how they work,
    but i've never actually written one, and probably never will, because
    its alot of work with no goal.

    I think polymorphic & meta code is very interesting.

    I would like to finish a password cracker for nix, and give away source code.

    For unix crypt, mcrypt,bcrypt,lanman,ntlm1 and 2 aswell as others.
    I'm well aware of John The Ripper, but already the lanman hash
    routine i have is 2 times faster.


Advertisement