Notes on Lab #12
In this lab we will run timing experiments on a nontrivial parallel programming application using
for the N-Body Problem. We will also
gain some experience in running CUDA on the NVIDIA Graphics Processing Unit.
Part I: GalaxSee
Fortunately for us, the Shodor Foundation has developed a nice
simulation of the N-Body Problem, called GalaxSee. As its name suggests, this simulation
allows you to see a galaxy forming from a group of N stars, where (like the number of digits
of π in the previous lab) N is a model parameter that you
can control. Like many HPC simulations, GalaxSee runs on a completely different time scale from
the events it is simulating: millions or billions of years of activity are simulated in a matter of
seconds or minutes (in contrast with e.g.
molecular dynamics simulations,
where simulating a few nanoseconds of activity can take an entire day).
To get started, spend a few minutes playing with the
(Java applet) of GalaxSee, which will give you some idea of a serial implementation of the algorithm.
The New Galaxy button allows you to specify the
number of stars, and if you click the Model Parameters button, you'll see that, just like Vensim, GalaxSee allows you to choose
Euler or Runge-Kutta 4 for doing the numerical integration. For this warmup exercise, I'd suggest leaving all the parameters
at their defaults, except for the number of stars, which you should set to 1000. Then use your wristwatch or the wall clock to get a rough
sense of how how many seconds it takes the galaxy to flatten out to a disk.
Once you've seen GalaxSee in action, it's time to download and run the MPI version. As usual with Unix-based projects, this will involve
some hacking to get the code to run on our system. First,
file, containing a zipped-up version of the HPC GalaxSee code, to the Downloads folder
on your Mac. Now we need to transfer this file to hbar. There are popular
WIMPy tools for doing this sort of file transfer, we're not about
popularity here, are we? Instead, we're going to learn and use another powerful Unix command,
scp. As with the ssh command we learned last week, the s stands for secure. The cp stands for copy,
so scp is a command that allows to copy a file across the internet securely. Before we can use scp, however, we're going to need one more
Unix command: cd, which stands for change directory. Why? Because by default the terminal program starts us in our home directory
(like your H: drive, it's the one where all your personal stuff lives), and we saved the GalaxSee file to our Desktop directory. So, as you did last week,
launch the Terminal application, but this time enter the equivalent (for your computer and username) of the following commands, shown in bold:
32858-biolab09:~ levys$ cd Downloads
32858-biolab09:Desktop levys$ scp Gal.tar levys@hbar:
32858-biolab09:Desktop levys$ ssh -X levys@hbar
A few important things to note here: first, the mandatory colon at the end of the
scp command, which lets scp know that you're specifying another computer
(and not just another file) as the destination. Second, the -X in the
ssh command, which tells your Mac to allow hbar to display graphics on it using the
Unix X Window System. Finally,
you have to type your hbar password twice: once for scp, and once for ssh.
So now we're on the hbar cluster, and we've got the Galaxy project copied to our home directory there. The project is packed into a single file (like a zip
file), which we can unpack using our next Unix command, tar. The original meaning of tar was tape archive, but it's come to be
associated with the term tarball, like a big ball of tar that packs all your files into one. To unpack the tar file, you can issue this command:
[levys@HBAR ~]$ tar xvf Gal.tar
(The xvf part means "extract and verify, using the following file".) This will create a new
directory, Gal which you can change to using the Unix cd (change directory)
command. This directory contains the source code (human-written/readable program code)
for the GalaxSee project, which you can see using ls:
[levys@HBAR ~]$ cd Gal
[levys@HBAR Gal]$ ls
Now it's time to run our next Unix command, make, to build the "executable" GalaxSee
program. Make is a wonderful Unix command that allow you to manage a project
built from multiple sources, so you don't have to keep issuing the mpicc command for example.
[levys@HBAR Gal]$ make
If you're watching carefully, you'll see make automatically issuing the mpicc command
(the one you used in last week's lab) to compile the source files, but then you'll see a large
number of errors. WTF?! Welcome to the world of open-source, where there's no guarantee that
anything works correctly "out of the box". If you were to copy-and-paste part of the error message
(undefined reference to `std::ios_base::Init::Init()) to Google, you'd eventually
see that the problem arises from using a C compiler to compile a program written in C++. In this
case, if we edit the Makefile using vi we see exactly that: the first line,
CC = mpicc, tells Unix to use the MPI C compiler mpicc to compile
our project. Warm up your vi fingers, grab a vi
and use vi to change this line to
CC       = mpic++. Save the Makefile, exit vi, and reissue the make command, and the program should compile
without errors.1 Issuing the ls command again should
reveal some new files, including the executable GalaxSee program, which will probably
be highlighted with an asterisk, special color, or some other indication that it is an executable
command that can be run.
Now we're ready to start some timing experiments. GalaxSee is like the cpi program
from last week's lab: being an MPI program, it allows you to specify
the number of processes, and it also allows you to specify the problem size (for cpi, the
number of digits of π; for GalaxSee the number of stars). So look at how you invoked
cpi in last week's lab, and do the same thing for GalaxSee. As with cpi
and the Java applet you ran, 1000 is a reasonably sized problem to work on, and as usual we'll
start with one process. Time the run as you did with cpi last week, and put the time
in an Excel spreadsheet for plotting later.
With the cpi program we kept increasing the problem size by a factor of 10 until we got
significant differences in run time. So try increasing the number of stars to 10000 and see
what happens. Does it look like the program is going to terminate in a reasonable amount of
time? If not, hit CTRL-C (the Control key at lower right and the C key
simultaneously), which is Unix's way of forcing a program to quit. It looks like we'll have to
try a different strategy to determine how running time scales with problem size. I'll leave it
to you to come up with a set of problem sizes that gives you plottable results in a reasonable
amount of time, for a single process. In tomorrow's class we will discuss why this problem
scales so differently from the digits-of-π problem.
As you did last week, once you've found a problem size that takes a reasonable amount of time,
experiment with the number of processes, and produce a second plot showing running time
(or speedup) as a function of this number.
Part II: CUDA
Now that we've played a little with a message-passing architecture, let's try a shared-memory,
multi-threaded architecture. As we discussed in class, Graphics Processing Units are the
hot multi-threaded platform right now, and the
architecture 2 from
is currently the most popular and widely available. Some Macs have an NVIDIA GPU on them
and so can support CUDA, but the ones you're using right now do not --
so we'll do this part of the lab on hbar as well.
To get started, download
cudatest.cu and serialtest.c onto your
Mac desktop, and scp them to your hbar account as you've done with the other files.
As their names suggest, these two programs will allow you to compare CUDA against ordinary
(serial) execution of the same algorithm. To keep things simple, all this algorithm does is
generate an array of random values and perform the same trivial operation (multiply by 1 and
add 0) to each value a specified number of times. As you did with the MPI programs, you will
have to compile these program on hbar. So once you're back on hbar, issue the following
[levys@HBAR ~]$ cd
[levys@HBAR ~]$ nvcc -o cudatest cudatest.cu
[levys@HBAR ~]$ gcc -o serialtest serialtest.c
The cd command by itself puts you back in your home directory, where the two source files are.
Note the parallel structure among all the Unix compiler commands you've been using
(mpicc, mpic++, nvcc, gcc): all these compilers are built
on the Gnu Compiler Collection
(GCC), which, until Linux came along, was probably the biggest open-source project in history.
Now that you've compiled the two programs, it's time to compare their performance. Naturally, you
will do this by invoking the Unix time command:
[levys@HBAR ~]$ time ./serialtest nrands nops
[levys@HBAR ~]$ time ./cudatest nrands nops
where nrands is the number of random values to use and nops is the number of
operations (again, just multiply by 1 and add 0) to perform on each value. Note the
./ in front of the program name (better yet, just copy-and-paste the
relevant part), which tells Unix to look in the current directory for these programs
(unnecessary with mpirun). Experiment with different large values of the two parameters
to get a non-trivial execution time. Then experiment with changing one or both of the parameters
until you see an obvious benefit from the CUDA version. Plot the execution times for the
two programs under various values of these parameters. What do you see? Based on the messages
output by the two programs at various stages, can you explain the pattern in your plots?
1 You may be wondering why someone would release code that doesn't
work. The answer is probably that the code did work on the developer's system when it was released,
but it doesn't work on our system for one of several possible reasons -- the likeliest of which is
a difference between the way the mpicc and mpic++ commands were set up on the
original system and the way they are set up on ours. As we say in Unix,
2 Is there some deep connection between HPC,
? You decide.