Parallel Programming with Split-C

(CS 267, Feb 7 1995)

Split-C was designed at Berkeley, and is intended for distributed memory multiprocessors. It is a small SPMD extension to C, and meant to support programming in data parallel, message passing and shared memory styles. Like C, Split-C is "close" to the machine, so understanding performance is relatively easy. Split-C is portable, and runs on the

Thinking Machines CM-5,

Intel Paragon,

IBM SP-2,

Meiko CS-2,

Cray T3D,

Sun multiprocessors (eg quad processor SS-10 or SS-20 with Solaris), and

NOW (network of workstations).

The best document from which to learn Split-C is the tutorial Introduction to Split-C. There is a debugger available as well: Mantis.

Extensions of Split-C to include features of multithreading (as introduced in Lecture 6) and C++ classes are under development, and will be released soon.

We begin with a general discussion of Split-C features, and then discuss the solution to Sharks & Fish problem 1 in detail. The most important features of Split-C are

An SPMD programming style. There is one program text executed by all processors.

A 2-dimensional address space for the entire machine's memory. Every processor can access every memory location via addresses of the form (processor number, local address). Thus, we may view the machine memory as a 2D array with one row per processor, and one column per local memory location. For example, in the following figure we have shaded in location (1,4).

Global pointers. These pointers are global addresses of the form just described, (processor number, local address), and can be used much as regular C pointers are used. For example, the assignment

*local_pointer = *global_pointer

gets the data pointed to by the global_pointer, where ever it resides, and stores the value at the location indicated by local_pointer.

Spread Arrays. These are 2 (or more dimensional arrays) which are stored across processor memories. For example, A[i][j] may refer to word j on processor i. Spread arrays and global pointers together support a kind of shared memory programming style.

Split phase assignment. In the above example, "*local_pointer = *global_pointer", execution of this statement must complete before continuing. If this requires interprocess communication, the processor remains idle until *global_pointer is fetched. It is possible to overlap computation and communication by beginning this operation, doing other useful work, and waiting later for it to complete. This is done as follows:

      *local_pointer := *global pointer
      ... other work not requiring *local_pointer ...
      synch()

The "split-phase" assignment operator := initiates the communication, and synch() waits until it is complete.

Atomic Operations are short subroutines which are guaranteed to be executed by one processor at a time. They provide an implementation of mutual exclusion, and the body of the subroutine is called a critical section.

A library, including an extensive reduction operations, bulk memory moves, etc.

Pointers to Global Data

There are actually three kinds of pointers in Split-C:

local pointers,

global pointers, and

spread pointers.

Local pointers are standard C pointers, and refer to data only on the local processor. The other pointers can point to any word in any memory, and consist of a pair (processor number, local pointer). Spread pointers are associated with spread arrays, and will be discussed below. Here are some simple examples to show how global pointers work. First, pointers are declared as follows:

    int *Pl, *Pl1, *Pl2           /*  local pointers  */
    int *global Pg, Pg1, Pg2      /*  global pointers  */
    int *spread Ps, Ps1, Ps2      /*  spread pointers  */

The following assignment sends messages to fetch the data pointed to by Pg1 and Pg2, brings them back, and stores their sum locally:

    *Pl = *Pg1 + *Pg2

Execution does not continue until the entire operation is complete. Note that the program on the processors owning the data pointed to by Pg1 and Pg2 does not have to cooperate in this communication in any explicit way. Thus, it is very much like a shared memory operation, although it is implemented on distributed memory machines, in effect interrupting the processors owning the remote data, getting the data and sending it back to the requesting processor, and letting them continue. In particular, there is no notion of needing matched sends and receives as in a message passing programming style. Rather than calling this send or receive, the operation performed is called a get, to emphasize that the processor owning the data need not anticipate the request for data.

The following assignment stores data from local memory into a remote location:

    *Pg = *Pl

As before, the processor owning the remote data need not anticipate the arrival of the message containing the new value. This operation is called a put.

Global pointers permit us to construct distributed data structures which span the whole machine. For example, the following is an example of a tree which spans processors. The nodes of this binary tree can reside on any processor, and traversing the tree in the usual fashion, following pointers to child nodes, works without change.

     typedef struct global_tree *global gt_ptr
     typedef struct global_tree{
         int value;
         gt_ptr left_child;
         gt_ptr right_child;
     } g_tree

We will discuss how to design good distributed data structures later when we discuss the Multipol library.

Global pointers offer us the ability to write more complicated and flexible programs, but also get new kinds of bugs. The following code illustrates a race condition, where the answer depends on which processor executes "faster". Initially, processor 3 owns the data pointed to by global pointer i, and its value is 0:

        Processor 1             Processor 2
        *i = *i + 1             *i = *i + 2
        barrier()               barrier()
        print 'i=', i

It is possible to print out i=1, i=2 or i=3, depending on the order in which the 4 global accesses to i occur. For example, if

  processor 1 gets *i (=0)
  processor 2 gets *i (=0)
  processor 1 puts *i (=0+1=1)
  processor 2 puts *i (=0+2=2)

then the processor 1 will print "i=2". We will discuss programming styles and techniques that attempt to avoid this kind of bug.

A more interesting example of a potential race condition is in a job queue, a data structure for distributing chunks of work of unpredictable sizes to different processors. We will discuss this example below after we present more feature of Split-C.

Global pointers may be incremented like local pointers: if Pg = (processor,offset), then Pg+1 = (processor,offset+1). This lets one index through a remote part of a data structure. Spread pointers differ from global pointers only in this respect: if Ps = (processor,offset), then

   Ps+1 = (processor+1 ,offset)  if processor < PROCS-1, or
        = (0 ,offset+1)          if processor = PROCS-1

where PROCS is the number of processors. In other words, viewing the memory as a 2D array, with one row per processor and one column per local memory location, incrementing Pg moves the pointer across a row, and incrementing Ps moves the pointer down a column. Incrementing Ps past the end of a column moves Ps to the top of the next columns.

The local part of a global or spread pointer may be extracted using the function to_local.

Only local pointers may be used to point to procedures; neither global nor spread pointers may be used this way. There are also some mild restrictions on use of deferenced global and spread pointers; see the last section of the Split-C tutorial.

Spread Arrays and Spread Pointers

A spread array is declared to exist across all processor memories, and is referenced the same way by all processors. For example,

    static int A[PROCS]::[10]

declares an array of 10 integers in each processor memory. The double colon is called the spreader, and indicates that subscripts to its left index across processors, and subscripts to the right index within processors. So for example A[i,j] is stored in location to_local(A)+j on processor i. In other words, the 10 words on each processor reside at the same local memory locations.

The declaration

    static double A[PROCS][m]::[b][b]

declares a total of PROCS*m*b^2 double precision words. You may think of this as PROCS*m groups of b^2 doubles being allocated to the processors in round robin fashion. The memory per processor is b^2*m double words. A[i][j][k][l] is stored in processor

     i*m+j mod PROCS,

and at offset

     to_local(A) + b^2*floor( (i*m+j)/PROCS ) + k*b+l

In the figure below, we illustrate the layout of A[4][3]::[8][8] on 4 processors. Each wide light-gray rectangle represents 8*8=64 double words. The two wide dark-gray rectangles represent wasted space. The two thin medium-gray rectangles are the very first word, A[0][0][0][0], and A[1][2][7][7], respectively.

In addition to declaring static spread arrays, one can malloc them:

   int *spread w = all_spread_malloc(10, sizeof(int))

This is a synchronous, or blocking, subroutine call (like the first kind of send and receive we discussed in Lecture 6), so all processors must participate, and should do so at about the same time to avoid making processors wait idly, since all processors will wait until all processors have called it. The value returned in w is a pointer to the first word of the array on processor 0:

     w = (0, local address of first word on processor 0).

(A nonblocking version, int *spread w = spread_malloc(10, sizeof(int)), executes on just one processor, but allocates the same space as before, on all processors. Some internal locking is needed to prevent allocating the same memory twice, or even deadlock. However, this only works on the CM-5 implementation and its use is discouraged.)

Split Phase Assignment

The split phases referred to are the initiation of a remote read (or write), and blocking until its completion. This is indicated by the assignment operator ":=". The statement

     c := *global_pointer                 ...   c is a local variable
     ... other work not involving c ...
     synch()
     b = b + c                            ...   b is a local variable

initiates a get of the data pointed to by global_pointer, does other useful work, and only waits for c's arrival when c is really needed, by calling synch(). This is also called prefetching, and permits communication (getting c) and computation to run in parallel. The statement

     *global_pointer := b

similarly launches a put of the local data b into the remote location global_pointer, and immediately continues computing. One can also wait until an acknowledgement is received from the processor receiving b, by calling synch().

Being able to initiate a remote read (or get) and remote write (or put), go on to do other useful work while the network is busy delivering the message and returning any response, and only waiting for completion when necessary, offers several speedup opportunities.

It allows one to compute and communicate in parallel; this is illustrated by the above example. This allows one to hide the latency of the communication network by prefetching.

Split-phase assignment lets one do many communications in parallel, if this is supported by the network (it often is). For example,

   /* lxn and sum are local variables; Gxn is a global pointer */
   lx1 := *Gx1             
   lx2 := *Gx2
   lx3 := *Gx3
   lx4 := *Gx4
   synch()
   sum = lx1 + lx2 + lx3 + lx4

can have up to 4 gets running in parallel in the network, and hides the latency of all but the last one.

By avoiding the need to have processors synchronize on a send and receive, idle time spent waiting for another processor to send or receive data can be avoiding by simply getting the data when it is needed.

The total number of message in the system is decreased compared to using send and receive. A synchronous send and receive actuallys requires 3 messages to be sent (see the figure), where only the last message contains the data. In contrast, a put requires one message with the data, and one acknowledgement, and a get similarly requires just 2 messages instead of 3. For small messages, this is 2/3 as much memory traffic. This is illustrated in the figure. Here, time is the vertical axis in each picture, and the types of arrows indicate what the processor is doing during that time.

Instead of synching on all outstanding puts and gets, it is possible to synch just on a selected subset of puts and gets, by associating a counter just with those puts and gets of interest. The counter is automatically incremented whenever a designated put or get is initiated, and automatically decremented when an acknowledgement is received, so one can test if all have been acknowledged by comparing the counter to zero. See section 10.5 of Introduction to Split-C for details.

The freedom afforded by split-phase assignment also offers the freedom for new kinds of bugs. The following examples illustrates a loss of sequential memory consistency. Sequential consistency means that the outcome of the parallel program is consistent with some interleaved sequential execution of the PROCS different sequential programs. For example, if there are two processors, where processor 1 executes instructions instr1.1, instr1.2, instr1.3, ... in that order, and processor 2 similarly executes instr2.1, instr2.2, instr2.3 ... in order, then the parallel program must be equivalent to executing both sets of instructions in some interleaved order such that instri.j is executed before instri.(j+1). The following are examples of consistent and inconsistent orderings:

    Consistent      Inconsistent
     instr1.1         instr1.1
     instr2.1         instr2.2   *out of order
     instr1.2         instr1.2
     instr2.2         instr2.1   *out of order
     instr1.3         instr1.3
     instr2.3         instr2.3
     ...              ...

Sequential consistency, or having the machine execute your instructions in the order you intended, is obviously an important tool if you want to predict what your program will do by looking at it. Sequential consistency can be lost, and bugs introduced, when the program mistakenly assumes that the network delivers messages in the order in which they were sent, when in fact the network (like the post office) does not guarantee this.

For example, consider the following program, where data and data_ready_flag are global pointers to data owned by processor 2, both of which are initially zero:

        Processor 1              Processor 2
        *data := 1               while (*data_ready_flag != 1) {/* wait for data*/}
        *data_ready_flag := 1    print 'data=',*data

From Processor 1's point of view, first *data is set to 1, then the *data_ready_flag is set. But Processor 2 may print either data=0 or data=1, depending on which message from Processor 1 is delivered first. If data=0 is printed, this is not sequentially consistent with the order in which Processor 1 has executed its instructions, and probably will result in a bug. Note that this bug is nondeterministic, i.e. it may or may not occur on any particular run, because it is timing dependent. These are among the hardest bugs to find!

This sort of hazard is not an artifact of Split-C, but in fact occurs when programming several shared memory machines, as discussed in Lecture 3. So it is a fact of life in parallel computing.