Notes on Lab #12

In this lab we will run timing experiments on a nontrivial parallel programming application using MPI, the Barnes-Hut algorithm for the N-Body Problem. We will also gain some experience in running CUDA on the NVIDIA Graphics Processing Unit.

Part I: GalaxSee

Fortunately for us, the Shodor Foundation has developed a nice simulation of the N-Body Problem, called GalaxSee. As its name suggests, this simulation allows you to see a galaxy forming from a group of N stars, where (like the number of digits of π in the previous lab) N is a model parameter that you can control. Like many HPC simulations, GalaxSee runs on a completely different time scale from the events it is simulating: millions or billions of years of activity are simulated in a matter of seconds or minutes (in contrast with e.g. molecular dynamics simulations, where simulating a few nanoseconds of activity can take an entire day).

To get started, spend a few minutes playing with the web version (Java applet) of GalaxSee, which will give you some idea of a serial implementation of the algorithm. The New Galaxy button allows you to specify the number of stars, and if you click the Model Parameters button, you'll see that, just like Vensim, GalaxSee allows you to choose Euler or Runge-Kutta 4 for doing the numerical integration. For this warmup exercise, I'd suggest leaving all the parameters at their defaults, except for the number of stars, which you should set to 1000. Then use your wristwatch or the wall clock to get a rough sense of how how many seconds it takes the galaxy to flatten out to a disk.

Once you've seen GalaxSee in action, it's time to download and run the MPI version. As usual with Unix-based projects, this will involve some hacking to get the code to run on our system. First, download this file, containing a zipped-up version of the HPC GalaxSee code, to the Downloads folder on your Mac. Now we need to transfer this file to hbar. There are popular WIMPy tools for doing this sort of file transfer, we're not about popularity here, are we? Instead, we're going to learn and use another powerful Unix command, scp. As with the ssh command we learned last week, the s stands for secure. The cp stands for copy, so scp is a command that allows to copy a file across the internet securely. Before we can use scp, however, we're going to need one more Unix command: cd, which stands for change directory. Why? Because by default the terminal program starts us in our home directory (like your H: drive, it's the one where all your personal stuff lives), and we saved the GalaxSee file to our Desktop directory. So, as you did last week, launch the Terminal application, but this time enter the equivalent (for your computer and username) of the following commands, shown in bold:

    32858-biolab09:~ levys$ cd Downloads
    32858-biolab09:Desktop levys$ scp Gal.tar levys@hbar:
    32858-biolab09:Desktop levys$ ssh -X levys@hbar

A few important things to note here: first, the mandatory colon at the end of the scp command, which lets scp know that you're specifying another computer (and not just another file) as the destination. Second, the -X in the ssh command, which tells your Mac to allow hbar to display graphics on it using the Unix X Window System. Finally, you have to type your hbar password twice: once for scp, and once for ssh.

So now we're on the hbar cluster, and we've got the Galaxy project copied to our home directory there. The project is packed into a single file (like a zip file), which we can unpack using our next Unix command, tar. The original meaning of tar was tape archive, but it's come to be associated with the term tarball, like a big ball of tar that packs all your files into one. To unpack the tar file, you can issue this command:

    [levys@HBAR ~]$ tar xvf Gal.tar

(The xvf part means "extract and verify, using the following file".) This will create a new directory, Gal which you can change to using the Unix cd (change directory) command. This directory contains the source code (human-written/readable program code) for the GalaxSee project, which you can see using ls:

    [levys@HBAR ~]$ cd Gal
    [levys@HBAR Gal]$ ls


Now it's time to run our next Unix command, make, to build the "executable" GalaxSee program. Make is a wonderful Unix command that allow you to manage a project built from multiple sources, so you don't have to keep issuing the mpicc command for example. So:

    [levys@HBAR Gal]$ make

If you're watching carefully, you'll see make automatically issuing the mpicc command (the one you used in last week's lab) to compile the source files, but then you'll see a large number of errors. WTF?! Welcome to the world of open-source, where there's no guarantee that anything works correctly "out of the box". If you were to copy-and-paste part of the error message (undefined reference to `std::ios_base::Init::Init()) to Google, you'd eventually see that the problem arises from using a C compiler to compile a program written in C++. In this case, if we edit the Makefile using vi we see exactly that: the first line, CC = mpicc, tells Unix to use the MPI C compiler mpicc to compile our project. Warm up your vi fingers, grab a vi cheat sheet, and use vi to change this line to CC       = mpic++. Save the Makefile, exit vi, and reissue the make command, and the program should compile without errors.1 Issuing the ls command again should reveal some new files, including the executable GalaxSee program, which will probably be highlighted with an asterisk, special color, or some other indication that it is an executable command that can be run.

Now we're ready to start some timing experiments. GalaxSee is like the cpi program from last week's lab: being an MPI program, it allows you to specify the number of processes, and it also allows you to specify the problem size (for cpi, the number of digits of π; for GalaxSee the number of stars). So look at how you invoked cpi in last week's lab, and do the same thing for GalaxSee. As with cpi and the Java applet you ran, 1000 is a reasonably sized problem to work on, and as usual we'll start with one process. Time the run as you did with cpi last week, and put the time in an Excel spreadsheet for plotting later.

With the cpi program we kept increasing the problem size by a factor of 10 until we got significant differences in run time. So try increasing the number of stars to 10000 and see what happens. Does it look like the program is going to terminate in a reasonable amount of time? If not, hit CTRL-C (the Control key at lower right and the C key simultaneously), which is Unix's way of forcing a program to quit. It looks like we'll have to try a different strategy to determine how running time scales with problem size. I'll leave it to you to come up with a set of problem sizes that gives you plottable results in a reasonable amount of time, for a single process. In tomorrow's class we will discuss why this problem scales so differently from the digits-of-π problem.

As you did last week, once you've found a problem size that takes a reasonable amount of time, experiment with the number of processes, and produce a second plot showing running time (or speedup) as a function of this number.

Part II: CUDA

Now that we've played a little with a message-passing architecture, let's try a shared-memory, multi-threaded architecture. As we discussed in class, Graphics Processing Units are the hot multi-threaded platform right now, and the CUDA architecture 2 from NVIDIA is currently the most popular and widely available. Some Macs have an NVIDIA GPU on them and so can support CUDA, but the ones you're using right now do not -- so we'll do this part of the lab on hbar as well.

To get started, download cudatest.cu and serialtest.c onto your Mac desktop, and scp them to your hbar account as you've done with the other files. As their names suggest, these two programs will allow you to compare CUDA against ordinary (serial) execution of the same algorithm. To keep things simple, all this algorithm does is generate an array of random values and perform the same trivial operation (multiply by 1 and add 0) to each value a specified number of times. As you did with the MPI programs, you will have to compile these program on hbar. So once you're back on hbar, issue the following commands:

    [levys@HBAR ~]$ cd
    [levys@HBAR ~]$ nvcc -o cudatest cudatest.cu
    [levys@HBAR ~]$ gcc -o serialtest serialtest.c

The cd command by itself puts you back in your home directory, where the two source files are. Note the parallel structure among all the Unix compiler commands you've been using (mpicc, mpic++, nvcc, gcc): all these compilers are built on the Gnu Compiler Collection (GCC), which, until Linux came along, was probably the biggest open-source project in history.

Now that you've compiled the two programs, it's time to compare their performance. Naturally, you will do this by invoking the Unix time command:

    [levys@HBAR ~]$ time ./serialtest nrands nops
    [levys@HBAR ~]$ time ./cudatest nrands nops

where nrands is the number of random values to use and nops is the number of operations (again, just multiply by 1 and add 0) to perform on each value. Note the ./ in front of the program name (better yet, just copy-and-paste the relevant part), which tells Unix to look in the current directory for these programs (unnecessary with mpirun). Experiment with different large values of the two parameters to get a non-trivial execution time. Then experiment with changing one or both of the parameters until you see an obvious benefit from the CUDA version. Plot the execution times for the two programs under various values of these parameters. What do you see? Based on the messages output by the two programs at various stages, can you explain the pattern in your plots?
1 You may be wondering why someone would release code that doesn't work. The answer is probably that the code did work on the developer's system when it was released, but it doesn't work on our system for one of several possible reasons -- the likeliest of which is a difference between the way the mpicc and mpic++ commands were set up on the original system and the way they are set up on ours. As we say in Unix, YMMV.

2 Is there some deep connection between HPC, classic rock, and muscle cars ? You decide.