|
CS 267 – Applications of Parallel Computing Assignment 0 Jike Chong January 26, 2005 Self introduction: I am a first year Ph.D. candidate in
electrical engineering. My research interest lies in the disciplined approach
to embedded systems design. In
this class, I am interested in learning about the effects of underlying
system architecture on applications in parallel systems, and its implications
to embedded systems design. My background is in the areas of
VLSI micro-architecture design, analog and digital integrated circuit design,
and integrated circuit design automation. Applications of Parallel Computing
in Integrated Circuit Design Automation Digital CMOS integrated circuits can be conveniently modeled logically as Boolean networks. There exist varies classes of algorithms that can be applied analytically or topologically on the Boolean networks. Parallel computer are often used for synthesis and optimization in this domain. [1] As VLSI circuits become more complex, the computational requirements for performing various CAD tasks increase almost exponentially. It has become clear that new multiprocessor computers are required to solve the problems ahead. [2] There has been extensive research on parallel synthesis for VLSI
design. Professor Prith Banerjee
at
Figure 1. ProperCAD Overview [2] |
|
Many existing sequential logic synthesis algorithms are based on recursions (e.g. unate recursive paradigm) or branch-and-bound mechanisms (e.g. implicit enumeration in automatic test pattern generation). Parallel algorithms have been developed for placement, routing, layout verification, logic synthesis, test generation, and fault simulation, and behavioral simulation in the logic synthesis domain. Most of them are derived from existing sequential algorithms by extracting parallelism inherent in the algorithms and explicitly mapping them to multiprocessor architectures. The recursive and branch-and-bound algorithms allowed impressive speedups to be reported when the algorithms are mapped to shared memory multiprocessors systems. For example, Galivanche and Reddy [3] have proposed a parallel implementation of two-level network synthesis based on ESPRESSO. They presented parallel versions of the procedures in two-level synthesis, such as reduction, expansion and complementation of cubes, and reported speedups of about 7 on an eight processor shared memory multiprocessor. However, shared memory multiprocessor systems offer limited scalability in terms of total memory bandwidth. The ProperCAD suite includes CAD packages that are developed and implemented on the CHARM runtime system such that it can be executed in a variety of parallel machines without any changes in the program. CHARM is a parallel programming system that permits users to write portable parallel programs on MIMD multiprocessors without losing efficiency. It supports an explicitly parallel language which helps control the complexity of parallel program design by imposing a separation of concerns between the user program and the system. It also provides target machine independent abstractions for information sharing which are implemented differently on different types of processors. Supported systems include shared memory multiprocessors such as the Sequent Symmetry, Encore Multimax, SGI Challenge, SUN SPARCCenter 1000; distributed memory multicomputers such as Intel Paragon, IBM SP-2, Thinking Machines CM-5, and networks of workstations. [4] Example Application: Transduction One particular application in this domain is logic
synthesis using transduction. [5] Transduction utilizes the external
don’t care set of a Boolean network to find a compatible set of
permissible functions (CSPF) to simplify a Boolean network. There is a
suite of transformations and optimizations that can be applied to the network
with the help of the CSPF's of the gates and the connections. The
Transduction method consists of four main transformations, namely, gate
substitution, gate merging, generalized gate substitution, and gate
input reduction. It also prunes redundant connections in the network. The algorithmic level parallelism is extracted by partitioning the Boolean network to distribute the work load. Many small partitions are created and submitted to the CHARM runtime system for load balancing. The amount of computation required by each partition is roughly an order of magnitude higher than that of communication time for sending a message between objects. The memory system parallelism and OS parallelism are handled by the CHARM runtime system.
|
|
The following subroutines are parallelized: · Evaluation of Output Function · Evaluation of CSPF · Gate Substitution · Gate Merging · Prunng · Generalized Gate Substitution · Gate Input Reduction Runtime and speedups of the algorithm for different benchmark circuits on different parallel machines are reported. Table I presents the result obtained in an Encore Multimax machine. It is a shared memory MIMD machine with eight processors. Table II presents the results obtained in a Sequent Symmetry machine. It is another shared memory MlMD machine. Table III presents the results obtained in an Inte1/860 machine. It is a distributed memory MIMD machine with eight processors connected to a hypercube configuration. Table IV presents the results in a network of workstations environment. This behaves as a distributed processing system. There is a host workstation, which distributes work to other workstations. The workstations used in this experiments are SUN4 workstations with SPARC I processor in them.
|
|
The results show excellent speedup in all the machines with almost no degradation in the quality of the synthesized network over the uniprocessor algorithm. One can observe that sometimes our algorithm produces “super-linear” speedups. This is due to the fact that synthesis is a search problem and in parallel implementation of a search problem, speedup anomaly is possible. If one transformation precedes another transformation, the search space may be reduced considerably, which may result in a “super-linear” speedup. [1] Susan L. Graham, Marc Snir, and Cynthia A. Patterson, Getting Up to Speed – The Future of Supercomputing, The National Academies Press. [2] ProperCAD problem definition. [www.ece.northwestern.edu/cpdc/ProperCAD/pcad.html] [3] R. Galivanche and S. M. Reddy, “A parallel PLA minimization program,” in Proc. Design Aurom. Conf, 1987. [4] L. V. Kalé, B. Ramkumar, A. B. Sinha, A. Gürsoy, “The Charm Parallel Programming Language and System” [charm.cs.uiuc.edu/papers/CharmSys1TPDS94.www/] [5]
Kaushik
De, Ballkrishna Ramkumar, Prithviraj Banerjee, “A Portable Parallel
Algorithm for Logic Synthesis Using Transduction”, IEEE Transactions on
Computer Aided Design of Integrated Circuits and Systems, vol. 13, no. 5 ,
May 1994 |