low level implemenations of some mpi calls

vinks · 06-04-2004 06:21PM #1

say i wanted to do a few low level implementations of mpi_send, mpi_recv etc.... for parallel processing, and i wanted to lower the latency involved, would it be better to implement my own network driver (it doesnt have to be portable) for a specific setup within the kernel space bypassing as much of the kernel internals as possible, or should i do some fancy virtual address remaping,. that is to remap the device io using mmap to user space to allow direct access to the hardware to maximise performance and lower latency?

in short is it better to bring tcp/ip to the userspace, or implement what i want to do at the kernel level to maximise performance and lower latency for networking.

Typedef · 07-04-2004 12:01PM

For lower latency networking, you might consider UDP, which has an overall smaller payload penalty then TCP, though, I'm unconvinced that managing the UDP packets, doesn't somewhat negate the fundamental 'performance gain' here.

But, UDP doesn't gaurantee packet transmission as TCP does, so, you'd have to benchmark whether or not, in high yeild networked environment, UDP actually earns it's keep as against the seamlessness of TCP.

If in doubt, throw network hardware at the problem, since in the real world you will see much more real world latency gains from a 1000 mbit network then you'd every see from allowing mmap of your ethernet device, obviously I guess.

I'd be intrigued to hear/see code, if possible, for what you're trying to accomplish with parallelism here.

Perhaps a well designed client/server topology distributed around a network where at the client end, (if SMP or P4), threadding is used to delegate tasks locally on the node, would add better performance gains then hacking at the network interface.

Fundamentally, I guess it depends on the nature of the computation you are trying to accomplish, ie, if it is going to be network intensive I'd first of all question whether or not a cluster 'should' have network intensive operations being performed as against giving a client node a large data set to churn away at, keeping network traffic, relatively minimalistic.

vinks · 07-04-2004 01:58PM

well, the setup i have are 4x3ghz xeon systems (4 nodes of a 160node cluster) and each node is currently equipped with 1 myrinet nic (2gbit/s) soon to be upgraded to 2 myrinet nics in each machine, on 2 switches for testing. and the nature of the computation(s) are for life sciences whose data sets may be gigabytes to terabytes in size depending on how well optimised the code is. btw i also have the option of running in single user mode to turn off the scheduler, so threading may or may not help.

as for the topology of the setup, i shall be hopefully getting the setup to run in a 3d torus type layout. but this project is just in the planning stage and im exploring how feasible this is going to be at messing with kernel internals and mpi internals

i have considered in using udp as well, since i can almost be guranteed that the packets will be delivered, as i may or may not have direct p2p connections (if i can get more nics) to try this out.

Typedef · 07-04-2004 02:20PM

Hmm.

Ostensibly, I'm not 100% on 'how much' of a gain you'd get from hacking at the kernel internals to bypass internal checks, maybe add an mmap if whatever driver it is doesn't support it, but as regards 'bypassing' internal checks in the kernel..... hmm... I think that sort of thing could have the potential to be very nasty about backlashing.

Maybe not.

Err, you should post the code of anything you change though, so we can steal... I mean... so the community can benefit in kind.

Beths.

vinks · 07-04-2004 02:32PM

well currently with lam-mpi to code up a program the layers are generally

your program -> mpi_send / mpi_recv -> lamd (communication with the other nodes, node discovery etc...this is a little quicker in mpich i think) -> sockets -> tcp/ip-> kernel interface for the networking driver etc... -> pci bus interface -> the physical layer.

what i want is "my program -> mpi_send / mpi_recv -> hardware". if possible or else it may be a case of "my program -> mpi_send/recv at the kernel -> hardware"

i dunno if my code is going to be of any use to anyone its going to be very hardware specific

Typedef · 07-04-2004 02:47PM

Err.

But, if you cut out the lamd, node discovery phase, how will you run the code in the dnet/cluster?

vinks · 07-04-2004 03:47PM

i may have direct p2p connections, with some hard coded routing tables etc... i generallly know how the machines are going to be wired up, how many there are etc... so discovery isnt an issue.

     m - m
     |   |    <- this is the basic unit, to build my
     m - m        grid of machines.

where the edges dont wrap, if i can get more nics, i might see about getting the grid to wrap around itself.

Typedef · 07-04-2004 06:08PM

I suppose it depends on whether or not you have sanctioning to spend the time on that kind of hacking?

If you do, then I don't see why you should deny yourself that sort of fun, do you?

low level implemenations of some mpi calls

Comments