Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie
Hi there,
There is an issue with role permissions that is being worked on at the moment.
If you are having trouble with access or permissions on regional forums please post here to get access: https://www.boards.ie/discussion/2058365403/you-do-not-have-permission-for-that#latest

Multi-Threading with DLL files

  • 10-04-2008 5:37pm
    #1
    Closed Accounts Posts: 1,567 ✭✭✭


    I need to run a thread for each available processor found on windows.

    But each thread needs to access "global data" without the use of spinlocks/mutexes

    Problem is, calling LoadLibrary() multiple-times returns the same base address for each thread.

    Workaround currently: replicate DLL file and load seperately, thus getting a different base address and invidual memory space.

    Everything runs fine, but i was hoping there was an alternative method that could be used.

    any one any ideas?


Comments

  • Closed Accounts Posts: 1,444 ✭✭✭Cantab.


    I need to run a thread for each available processor found on windows.

    But each thread needs to access "global data" without the use of spinlocks/mutexes

    Problem is, calling LoadLibrary() multiple-times returns the same base address for each thread.

    Workaround currently: replicate DLL file and load seperately, thus getting a different base address and invidual memory space.

    Everything runs fine, but i was hoping there was an alternative method that could be used.

    any one any ideas?

    I'm not a multi-processor expert, but how do you propose to have separate threads use the same memory without synchronising the shared memory space using standard RT techniques (i.e. mutexing, etc.)? Doesn't seem possible to me...


  • Closed Accounts Posts: 1,567 ✭✭✭Martyr


    i didn't phrase the question very well.

    The main problem is that x86 processors don't have enough free registers..
    In thread, it uses all 8 general purpose registers (yes, its in assembly), including EBP and ESP which are usually reserved for local variables or parameters to the function.

    as a result, i stored what needs saved in global data variables.

    but when multi-threading, each thread is sharing the same data, which obviously won't work.

    to get around this, i put the thread into a DLL file and duplicate it for each processor core available.

    the reason for this is better explained by MSDN when using Loadlibrary

    To summarize, the system performs the following steps at load time:
    1. Examines the image and determines its preferred base address and required size.
    2. Finds the address space required and maps the image, copy-on-write, from the file.
    3. Applies internal fixups if the image is not at its preferred base address.
    4. Fixes up all dynamic link imports by placing the correct address for each imported function into the appropriate entry of the Import Address Table. This table stores 32-bit addresses contiguously; to store up to 1024 imported functions requires it to dirty only one page of memory.


    on a dual-core for example, i rename the file.dll as

    file_0.dll
    file_1.dll

    load each individually, which then has its own memory space, get the procedure address, create the thread.

    this allows each thread to have its own private global variables..(sort-of)
    also, i don't need to use spinlocks or mutexes, helping improve speed of each thread.

    but the current method of duplicating files to make this work isn't greatest of solutions and i was hoping there was more elegant way to achieve this.


  • Closed Accounts Posts: 1,444 ✭✭✭Cantab.


    i didn't phrase the question very well.

    The main problem is that x86 processors don't have enough free registers..
    In thread, it uses all 8 general purpose registers (yes, its in assembly), including EBP and ESP which are usually reserved for local variables or parameters to the function.

    as a result, i stored what needs saved in global data variables.

    but when multi-threading, each thread is sharing the same data, which obviously won't work.

    to get around this, i put the thread into a DLL file and duplicate it for each processor core available.

    the reason for this is better explained by MSDN when using Loadlibrary

    To summarize, the system performs the following steps at load time:
    1. Examines the image and determines its preferred base address and required size.
    2. Finds the address space required and maps the image, copy-on-write, from the file.
    3. Applies internal fixups if the image is not at its preferred base address.
    4. Fixes up all dynamic link imports by placing the correct address for each imported function into the appropriate entry of the Import Address Table. This table stores 32-bit addresses contiguously; to store up to 1024 imported functions requires it to dirty only one page of memory.


    on a dual-core for example, i rename the file.dll as

    file_0.dll
    file_1.dll

    load each individually, which then has its own memory space, get the procedure address, create the thread.

    this allows each thread to have its own private global variables..(sort-of)
    also, i don't need to use spinlocks or mutexes, helping improve speed of each thread.

    but the current method of duplicating files to make this work isn't greatest of solutions and i was hoping there was more elegant way to achieve this.

    So you're programming an Intel multi-core at register level? Good for you!

    I'd like to know what kind of an application this is for -- why is it so crucial to hand-optimise the performance? Couldn't you just program as normal and tack on an extra processor or two? How much performance gain do you think your hand-written code will achieve above compiled code?

    Couldn't you use an Intel compiler and let it auto-optimise your high-level thread automatically? It's very smart you know.

    Could it be that your app may be better suited to a more parallel architecture such as GPU/FPGA?

    What ARE you implementing mate?


  • Registered Users, Registered Users 2 Posts: 2,152 ✭✭✭dazberry


    on a dual-core for example, i rename the file.dll as

    file_0.dll
    file_1.dll

    load each individually, which then has its own memory space, get the procedure address, create the thread.

    Have you looked at Thread Local Storage? Alternately since you're doing this in asm, could you not make more use of the stack as the stack should be unique for each thread?

    D.


  • Registered Users, Registered Users 2 Posts: 4,188 ✭✭✭pH


    dazberry wrote: »
    Have you looked at Thread Local Storage? Alternately since you're doing this in asm, could you not make more use of the stack as the stack should be unique for each thread?

    D.
    Absolutely - TLS is the correct way to do this.


  • Advertisement
  • Closed Accounts Posts: 1,567 ✭✭✭Martyr


    Cantab wrote:
    What ARE you implementing mate?
    multi-threaded programs, comparing difference in speed of x86 core2 with PS3 cell b.e, which is powerpc based.
    dazberry wrote:
    Have you looked at Thread Local Storage?

    yes, but it would slow down the thread too much unfortunately.
    what i'd hoped for was some function in windows which allowed loading 1 DLL file, multiple times, but each time at a different base address.
    dazberry wrote:
    Alternately since you're doing this in asm, could you not make more use of the stack as the stack should be unique for each thread?

    in some situations, i've found it faster to use global variables rather than the stack..atleast for this process.

    local variables stored above +128 or below -128 of the stack generates more than 3 opcodes, which usually takes longer for the processor to decode.

    The thread is broken up into separate routines.
    For this reason, using local variables, requires other registers to load effective address and/or PUSH/POP instructions which are avoided because they don't pair.

    an attempt is made to ensure there is only 1 write to a register every 2 instructions, breaking up dependencies - it would be better to do 1 write every 3 or 4 instructions, but again, there aren't enough registers.

    this is why esp and ebp are used, whereas compilers wouldn't normally touch these at all.


  • Registered Users, Registered Users 2 Posts: 1,313 ✭✭✭carveone


    yes, but it would slow down the thread too much unfortunately.
    what i'd hoped for was some function in windows which allowed loading 1 DLL file, multiple times, but each time at a different base address.

    Then does it become like fork()/exec() rather than creating a new thread? It's not sharing the same code space that's for sure...

    I believe there isn't a function that does what you ask. You'd have to write your own. Which would suck rather a lot more than what you're doing now!
    One level of indirection solves your problem but you'd start using LEA. I mean, if the stack is slow for you, perhaps you don't want to be adding computed offsets to alloced memory...

    Amusingly enough, under DOS you could change DS :D Only joking!

    Yeah, none of this helps much, sorry...

    Conor.


  • Closed Accounts Posts: 1,567 ✭✭✭Martyr


    carveone wrote:
    Amusingly enough, under DOS you could change DS Only joking!

    good point tbh, ds is default for data but you can over-ride this using es,fs,gs or ss..32-bit mode Windows still recognises segment prefixes :)


  • Registered Users, Registered Users 2 Posts: 1,481 ✭✭✭satchmo


    multi-threaded programs, comparing difference in speed of x86 core2 with PS3 cell b.e, which is powerpc based
    I'd be careful how you compare the two, they're inherently completely different processors. Besides the difference in cache latencies etc, the Cell's PPU uses in-order execution so you can't just execute the same instructions in the same order on both platforms and expect the performance to be comparable.

    Interesting thread (the programming board needs more like this), let us know how you get on.


  • Registered Users, Registered Users 2 Posts: 2,426 ✭✭✭ressem


    Well, he's not using the extra registers available in x64 mode, so it must be a fairly specific benchmark that he is looking to create.

    Actually aren't there about 40 physical general purpose registers available on Core Intel processors, which are swapped between using a register alias table?

    As for the original question, while the rebaseimage() can be used to create an image in memory, perhaps you can find an altered version of LoadLibrary to make use of it, something like
    http://www.joachim-bauch.de/tutorials/load_dll_memory.html


  • Advertisement
  • Closed Accounts Posts: 1,567 ✭✭✭Martyr


    satchmo wrote:
    I'd be careful how you compare the two, they're inherently completely different processors. Besides the difference in cache latencies etc, the Cell's PPU uses in-order execution so you can't just execute the same instructions in the same order on both platforms and expect the performance to be comparable.

    POWER/POWERPC is a completely new architecture to me - a little more difficult to learn than x86, would you say?

    i'll probably spend time writing code in C, then analysing the assembly generated by GCC to begin with.

    The in-order execution point - algorithms will be running in parallel.
    Would you say less dependencies generates faster code?

    The PPU/SPE's both have 32 128-bit vector registers and 32 general purpose registers (not to mention 32 floating point/other special purpose registers) correct?

    I read in different places that the SPE has "128 registers", assuming the writer meant 32 x (4 x 32-bit) / VMX registers - just wanted some clarification.

    for speed, where is best place to store/read data?
    also, what is the maximum amount of memory i can address in one SPE?

    this info is probably all buried in the manuals somewhere, but i know you've experience in this area already - hope you don't mind answering.
    satchmo wrote:
    let us know how you get on.

    that could be some time, but will do.
    ressem wrote:
    Well, he's not using the extra registers available in x64 mode, so it must be a fairly specific benchmark that he is looking to create

    there is both x86/x64 code.. i've just not installed 64-bit windows yet.
    linux fedora core 8 is running on the ps3.
    ressem wrote:
    As for the original question, while the rebaseimage() can be used to create an image in memory, perhaps you can find an altered version of LoadLibrary to make use of it, something like

    since the code is all assembly and there are only 1 or 2 calls to api during the thread, it might be good idea to allocate memory using VirtualAlloc() specifying PAGE_EXECUTE_READWRITE - copy the code/data there before calling CreateThread() on the address of code.

    though it would mean having to calculate all the data offsets manually..so i suppose in-memory execution (rebaseimage() or something similar) would be best solution so far.


Advertisement