-
Notifications
You must be signed in to change notification settings - Fork 0
MPI bindings for distributed parallel calculations with GAP (BROKEN, NEEDS NEW MAINTAINER)
License
gap-packages/pargap
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
The ParGAP package The ParGAP (Parallel GAP) package provides a way of writing parallel programs using the GAP language. Former names of the package were ParGAP/MPI and GAP/MPI; the word MPI refers to Message Passing Interface, a well-known standard for parallelism. ParGAP is based on the MPI standard, and this distribution includes a subset implementation of MPI, to provide a portable layer with a high level interface to BSD sockets. Since knowledge of MPI is not required for use of this software, we now refer to the package as simply ParGAP. For more information visit the author's ParGAP home page at: http://www.ccs.neu.edu/home/gene/pargap.html ParGAP may be obtained as `pargap-XXX.zoo' (for some version number XXX) from the same places as GAP. `pargap' is available for download via the GAP www page at http://www-gap.mcs.st-and.ac.uk/Packages/packages.html or, alternatively, `pargap-XXX.tar.gz' (which is assured to be the most recent version) can be obtained from the author's ftp site: ftp://ftp.ccs.neu.edu/pub/people/gene/pargap/ ParGAP has been tested on Linux (ELF), Solaris 2.6, OSF 1 (alpha) and OS X. MPI Libraries The ParGAP package uses an MPI (Message Passing Interface) library to communicate between processes. One such library, called MPINU, is included with this package, but you can also run ParGAP using a version of MPI that is already present on your system. When you install ParGAP it will try to find and use a system MPI implementation, and if not then it will use its own MPINU. MPINU only works on Unix variants such as Linux, Solaris and OS X, but it should be possible to port it to Windows under Cygwin. Using a system MPI implementation is recommended: they can have better performance, support more systems and are more robust. Two popular implementations that are known to work with ParGAP are MPICH2 http://www.mcs.anl.gov/research/projects/mpich2/ Open MPI http://www.open-mpi.org/ They can be downloaded from their website, or may be available via your operating system's standard package management mechanism. This version of ParGAP has an issue on Macs when when using both MPINU2 and a system MPI implementation, so we recommend using the original MPINU library on these systems. Installing the ParGAP package To install the ParGAP package, move the file `pargap-XXX.zoo' or `pargap-XXX.tar.gz' into the `pkg' directory in which you plan to install ParGAP. Usually, this will be the directory `pkg' in the hierarchy of your version of GAP 4. Also note that currently it is not possible to have the `pkg' directory separate from GAP's `pkg' directory; we hope to remedy this in future versions of ParGAP (so that it will also possible to keep an additional `pkg' directory in your private directories; section "ref:Installing GAP Packages" of the GAP 4 reference manual gives details on how to do this, when it's possible.) (If you are not a system administrator and your system administrator won't install ParGAP for you on the system and you don't have enough disk space in your own directory to create a whole new GAP, what you can do is create the illusion of having a complete version of GAP in your own directory using symbolic links (sorry! currently that's all we can offer.) Now change into the `pkg' directory in which you plan to install ParGAP. If you got a `.zoo' file, unpack it with: unzoo -x pargap-XXX If you got a `.tar.gz' file and your `tar' command supports the `z' option, unpack it with: tar zxf pargap-XXX.tar.gz or otherwise unpack in two steps with: gunzip pargap-XXX.tar tar xvf pargap-XXX.tar Whether you got the `.zoo' or `.tar.gz' archive you should now have a new directory `pargap'. As for a generic GAP package, do: cd pargap ./configure make If you have a system-wide implementation of MPI (such as MPICH2 or Open MPI) then GAP will also need to be rebuilt, and configure will stop with a message about this. If you are content for GAP to be rebuilt with no special configure options then run the ParGAP configure with ./configure --with-basic-gap-configure otherwise, follow the instructions given after running the ParGAP ./configure. Your ParGAP should now be ready to use. In the `bin' subdirectory there will be a script pargap.sh which you should use to start ParGAP. Edit the script if necessary, copy it to a standard path and rename it according to how you intend to call ParGAP (e.g. rename it: `pargap'). Running ParGAP To run ParGAP when built with a system MPI library, you need to use your system's MPI launcher. With both MPICH and Open MPI this is called mpiexec, and you can type mpiexec -n 3 pargap to run three copies of ParGAP, i.e. one master and two slaves (this assumes that you have renamed the script as `pargap'). If ParGAP was built using MPINU then you should run ParGAP by calling `pargap' directly. In this case, it looks for a `procgroup' file which defines the master and slave processes that will be used by ParGAP. A sample `procgroup' file can be found in the `bin' subdirectory, and when ParGAP is started this should be in the current directory, or the full path to the file supplied using the `-p4pg' option. Thus if you renamed your shell script `pargap', the following are valid ways of starting ParGAP: pargap (if current directory contains the file: `procgroup'), or pargap -p4pg myprocgroupfile (where `myprocgroupfile' is the complete path of your procgroup file - there is no restriction on how you name it). If you had trouble installing ParGAP, please see the next section of this file. Otherwise, try it out: gap> # This assumes your procgroup file includes two slave processes. gap> PingSlave(1); #a `true' response indicates Slave 1 is alive true gap> # Print() on slave appears on standard output gap> # i.e. after the master's prompt. gap> SendMsg( "Print(3+4)" ); gap> 7 gap> # A <return> was input above to get a fresh prompt. gap> # gap> # To get special characters (including newline: `\n') gap> # into a string, escape them with a `\'. gap> SendMsg( "Print(3+4,\"\\n\")" ); gap> 7 gap> # Again, a <return> was input above after the 7 and new-line gap> # were printed to get a fresh prompt. gap> # gap> # Each SendMsg() is normally balanced by a RecvMsg(). gap> SendMsg( "3+4", 2); gap> RecvMsg( 2 ); 7 gap> # The following is equivalent to the two previous commands. gap> SendRecvMsg( "3+4", 2); 7 gap> # The two SendMsg() commands that were sent to Slave 1 earlier have gap> # responses that are waiting in the message queue from that slave. gap> # Check that there is a message waiting. With some MPI implementations gap> # the message is not immediately available, but when ProbeMsg() does gap> # return true then RecvMsg() is guaranteed to succeed. gap> ProbeMsgNonBlocking( 1 ); false gap> ProbeMsgNonBlocking( 1 ); true gap> # Print() is a `no-value' functions, and so the result of a RecvMsg() gap> # in both these cases is "<no_return_val>". gap> RecvMsg( 1 ); "<no_return_val>" gap> RecvMsg( 1 ); "<no_return_val>" gap> # As with Print() the result of Exec() appears on standard gap> # output, and the result is "<no_return_val>". gap> SendRecvMsg( "Exec(\"pwd\")" ); # Your pwd will differ :-) /home/gene "<no_return_val>" gap> # Define a variable on a slave gap> SendRecvMsg( "a:=45; 3+4", 1 ); 7 gap> # Note "a" is defined on slave 1, not slave 2. gap> SendMsg( "a", 2 ); # Slave prints error, output on master gap> Variable: 'a' must have a value gap> # <return> entered to get fresh prompt. gap> RecvMsg( 2 ); # No value for last SendMsg() command "<no_return_val>" gap> RecvMsg( 1 ); 45 gap> # Execute analogue of GAP's List() in parallel on slaves. gap> squares := ParList( [1..100], x->x^2 ); [ 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361, 400, 441, 484, 529, 576, 625, 676, 729, 784, 841, 900, 961, 1024, 1089, 1156, 1225, 1296, 1369, 1444, 1521, 1600, 1681, 1764, 1849, 1936, 2025, 2116, 2209, 2304, 2401, 2500, 2601, 2704, 2809, 2916, 3025, 3136, 3249, 3364, 3481, 3600, 3721, 3844, 3969, 4096, 4225, 4356, 4489, 4624, 4761, 4900, 5041, 5184, 5329, 5476, 5625, 5776, 5929, 6084, 6241, 6400, 6561, 6724, 6889, 7056, 7225, 7396, 7569, 7744, 7921, 8100, 8281, 8464, 8649, 8836, 9025, 9216, 9409, 9604, 9801, 10000 ] gap> # Send a large, local (non-remote) data structure to a slave gap> Concatenation("x := ", PrintToString([1..10]*2)); "x := [ 2, 4, 6, 8, 10, 12, 14, 16, 18, 20 ]\n\000" gap> SendMsg( Concatenation("x := ", PrintToString([1..10]*2)) ); gap> RecvMsg(); [ 2, 4, 6, 8, 10, 12, 14, 16, 18, 20 ] gap> # Send a local (non-remote) function to a slave gap> myfnc := function() return 42; end;; gap> # Use PrintToString() to define myfnc on all slave processes gap> BroadcastMsg( PrintToString( "myfnc := ", myfnc ) ); gap> SendRecvMsg( "myfnc()", 1 ); 42 gap> # Ensure problem shared data is read into master and slaves. gap> # Try one of your GAP program files instead. gap> ParRead( "/home/gene/myprogram.g"); The ParGAP package was designed and written by: Gene Cooperman College of Computer Science Northeastern University, Boston, MA, U.S.A. If you use ParGAP to solve a problem then please send a short email to `gene@ccs.neu.edu' about it, and reference the ParGAP package as follows: \bibitem[Coo99]{Coo99} Cooperman, Gene, {\sl Parallel GAP/MPI (ParGAP/MPI)}, Version 1, College of Computer Science, Northeastern University, 1999, \verb|http://www.ccs.neu.edu/home/gene/pargapmpi.html|. ========================================================================= Troubleshooting General problems 0. If you are using ParGAP on a Mac with MPINU2 or a system MPI implementation then {\ParGAP} may consistently crash on startup. If this is the case then try using MPINU instead by reconfiguring ParGAP with ./configure --with-mpi=MPINU This is a known issue which will be fixed in a forthcoming version. 1. Do you have enough swap space to support multiple GAP processes? A simple way to check this is with the UNIX command, `top'. The Linux version of `top' sorts by memory usage if you type `M'. 2. `make' tries to automatically create: pkg/pargap/bin/pargap.sh and copy the parameters from `<GAP_ROOT>/bin/gap.sh'. <GAP_ROOT> was specified when you executed `./configure <GAP_ROOT>' to install ParGAP. This can be error-prone if your site has an unusual setup. If you execute `<GAP_ROOT>/bin/gap.sh', does gap come up? If so, compare it with `pargap.sh' and check for correct settings in `.../pkg/pargap/bin/pargap.sh'? 3. Were the remote slave processes able to start up? If so, could they connect back to the master? To test connectivity problems, try manually starting a remote slave by executing a line in the script. Try a simple `ssh remote_hostname' to see if the issue is with security. 4. If the previous step failed due to security issues, such as requesting a password, you have several options. `man ssh' tells you the security model at your site. Then read "Problems with Passwords (Getting Around Security)" in the ParGAP manual in the `doc' directory. 5. Is `pargap' listed in `.../pkg/ALLPKG'? [It's needed to autostart slaves.] 6. Inside ParGAP, has MPI been successfully initialized? Try: gap> MPI_Initialized(); 7. A remote (slave) ParGAP process starts in your home directory and tries to cd to a directory of the same name as your local directory. Check your assumptions about the remote machine. Try: gap> SendRecvMsg("Exec(pwd)"); SendRecvMsg("UNIX_Hostname()"); gap> SendRecvMsg("UNIX_Getpid()"); 8. Every ParGAP slave process displays its GAP banner and startup messages on the terminal of the master process. If you have many slaves and do not wish to see these messages, then pass the `-b' and/or `-q' switches to {\ParGAP} when it starts, to disable the banner or all messages respectively. See the GAP Reference Manual for further details. 9. Read the documentation for further possible problems. Problems when using MPINU 1. Did ParGAP find your `procgroup' file? [It looks in the current directory for `procgroup', or for: ... -p4pg PATH/procgroup on the command line.] 2. Is the `procgroup' file in your current directory set correctly? Test it. If you are calling it on a remote host, manually type: ssh <HOSTNAME> <ParGAP> where <HOSTNAME> and <ParGAP> appear exactly as in `procgroup', e.g. ssh denali.ccs.neu.edu /usr/local/gap4r3/bin/pargap.sh In some cases, `exec' is used to save process overhead. Also try: ssh <HOSTNAME> exec <ParGAP> If you plan to call it on localhost, try just: <ParGAP> Note that if not all the slave processes succeed in connecting to the master, then ParGAP writes out a file: /tmp/pargapmpi-ssh.$$ where $$ is replaced by the the process id of the ParGAP process. 3. If the connection dies at random, after some period of time: You can experiment with SO_KEEPALIVE and variants. (man setsockopt) This periodically sends *null messages* so the remote machine does not think that the originating machine is dead. However, if the remote machine fails to reply, the local process sends a SIGPIPE signal to notify current processes of a broken socket, even though there might have been only a temporary lapse in connectivity. `ssh' specifies `KeepAlive yes' by default, but setting `KeepAlive no' might get you through some transient lapses in connectivity due to high congestion. You may also want to experiment with: `setenv SSH "ssh -n"' 4. If a host is on multiple networks, it will have multiple IP addresses and usually multiple hostnames. In this case, the master process cannot always guess correctly which IP address (which internet address) should be passed to the slave process, so that the slave process can call back to the master. In such cases, you may need to tell {\ParGAP} which hostname or IP address to use for the callback. This is done by setting the UNIX environment variable, `CALLBACK_HOST', as in the example below. # [ in sh/bash/... ] CALLBACK_HOST=denali.ccs.neu.edu; export CALLBACK_HOST # [ in csh/tcsh/... ] setenv CALLBACK_HOST=denali.ccs.neu.edu The appropriate line for your shell can be placed in your shell initialization file. Alternatively, you can set this up for all users by placing the Bourne shell version (for `sh') somewhere between the first and last line of `.../pkg/pargap/bin/pargap.sh'. 5. ParGAP is supplied with two different versions of MPINU: the original MPINU and a later version, MPINU2, and it will also work with other MPI libraries if they are present on your system. By default, if you do not have a system MPI implementation then MPINU2 is used. If you have problems which appear to be MPI-related, try rebuilding ParGAP with a different MPI library. For example, to use MPINU instead of MPINU2 then run configure using ./configure --with-mpi=MPINU Problems when using a system MPI library 1. Line editing at the GAP command prompt is unlikely to work when ParGAP is invoked with an MPI launcher, since they tend to do their own processing of the terminal I/O (stdin/stdout/stderr) which does not work well either the readline library used in newer versions of GAP or the in-built terminal editing in earlier versions of GAP. It may be useful to run ParGAP through the `rlwrap' utility, if available. For example, if ParGAP is run using `mpiexec', then try rlwrap mpiexec -n 3 pargap This should restore some of the line editing, although tab completion is limited to commands that `rlwrap' has already seen you use. For more information, try `man rlwrap'. 2. The command `FlushAllMsgs()' is not available when using a system MPI implementation, since it tests show that `ProbeMsgNonBlocking()', which it uses cannot be relied upon to always return `true' the first time that it is called after a message has been sent. If your system MPI implementation does exhibit this desired behaviour for `ProbeMsgNonBlocking()' then you can install your own local copy of `FlushAllMsgs()' by copying the code for this function from `lib/slavelist.g', removing the `if' statement and renaming the function. 3. The command `ParReset()' (see~"ParReset") is not available when using a system MPI implementation. When using a MPINU library, the slaves are launched by ParGAP itself and so can be contacted and restarted, but with a system MPI library the slaves are launched by `mpiexec' (or whichever MPI launcher you use) and so cannot be reset from within ParGAP. There is no known workaround for this. 4. GAP and, in particular, the \package{IO} Package install handlers for the SIGCHLD signal. Many implementations of MPI also install their own SIGCHLD handler, which may then conflict with {\ParGAP}. Testing has revealed no issues, but we cannot guarantee that there will be no interaction between the two. In particular, this may result in temporary files not being cleaned up properly. 5. The GAP memory manager, GASMAN, can run into problems extending the GAP workspace if external libraries use `malloc' to allocate their own memory. MPINU avoids the use of `malloc' as much as possible, but system MPI implementations may not be as careful. This can be resolved by starting ParGAP with the `-s' command-line switch, which asks ParGAP to pre-allocate memory before it starts. You can safely pre-allocate more memory than you will actually need since physical memory will only be mapped when it is actually used, so for example you could allocate 3Gb: mpiexec -n 3 pargap -s 3g The `-a' and `-m' switches can also be used to control memory usage. See the GAP Reference Manuel for further information. News of any other issues or solutions would be gratefully accepted. ========================================================================= Final Notes Note that this package modifies the GAP `src' and `bin' files, and creates a new GAP kernel. This new GAP kernel can be shared by traditional users of the old, sequential GAP kernel, and by those doing parallel processing. The GAP kernel will have identical behavior to the old GAP kernel when invoked through the gap.sh script or the `bin/@GAParch@/gap' binary. The new ParGAP variables will appear to the end user _ONLY_ if the GAP binary was invoked as `pargapmpi': a symbolic link to the actual GAP binary. The script, `pargap.sh', does this. So, in a multi-user environment, traditional users can continue to use `gap.sh' without noticing any difference. Only an invocation as `pargap.sh' will add the new features. Comments and contributions to a ParGAP user library, or any other type of assistance, are gratefully accepted. Gene Cooperman gene@ccs.neu.edu
About
MPI bindings for distributed parallel calculations with GAP (BROKEN, NEEDS NEW MAINTAINER)
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published