Chapter 5: The Randomized System

In earlier chapters I discussed at length the importance of randomization in a testing system, but Arctic's testing system, as I have described it, has few random testing features. To remedy this situation, I designed a second system devoted entirely to random testing. This system is based on the basic structure and modules of the older system, but it is kept separate, because it, unlike the tester-debugger of the previous section, is designed to be run with almost no human intervention.

We have already seen a number of reasons why some kind of random tester is necessary. Perhaps the most important is that it alleviates the burden of generating tests. With a random tester, a design team can let the testing system look for obscure bugs while the team members concern themselves with debugging known trouble spots. Also, randomized testing is desirable for theoretical reasons. The number of possible states in a large chip like Arctic is huge, and it is nearly impossible for a human to pick out a set of tests that will uncover every design bug. A carefully built randomized tester can find this set with a fairly good probability. Douglas Clark expresses this idea well in his paper on hardware verification strategies.

The method I advocate for dealing with the problem of a huge space of behaviors is to sample randomly (really pseudo-randomly) from one or more sets of possible tests. A random selection can pick any test with some probability, and the longer a random test-selector runs, the more likely it is that any particular test will be found. [2]

In this chapter I will present the design and implementation of Arctic's random testing system. This part of the testing project is much more recent, and as a result, important parts of the system have not yet been implemented. Therefore, I will give only a brief overview of the design and where I cannot give examples of problems encountered with the system, I will try to predict the problems that would have been encountered if the system had been finished. To remain consistent with Chapter 4, I will begin by discussing changes to the system's ``hardware'' and then discuss changes to the ``software.''

5.1 Hardware

In the new system, each stub has been modified to hold some kind of randomizer. Each of the stubs connected to the input ports, for example, has a random packet generator that is capable of creating a completely new packet whenever it is needed. The user is given the ability to specify the distribution of these random packets by controlling certain constants for each port, such as the probability that a packet is a high-priority packet, the mean length of a packet (modeled as an exponential random variable), and the mean wait time between receiving a packet and sending a free buffer signal (both of which are also modeled as exponential random variables). This is a very loose control mechanism, allowing the user to have only a general influence on the traffic through the chip. For the most part, the simulations execute without any outside influence at all.

Because all packets could be generated randomly in this system, I chose to create completely new packet management mechanisms that would be contained in the input and output port's stubs. In this system, packets were generated when needed and did not need to be kept in files and loaded at the beginning of a simulation. Also, records of received packets were not written to any file. Instead, each input port's stub maintained records of the packets which it sent. An output port's stub that received a packet would immediately check the sender's record to see if the packet was sent out the correct port, and if it was, the sender's record would be deleted. Managed correctly, this system could continue to send packets forever using a constant amount of storage.

Let's take a closer look at how each sender manages the transmission of packets. When a new packet is created it is given a unique identifier in the form of a timestamp generated with the Verilog command $time. This timestamp is not completely unique, since other input port stubs might be generating packets at exactly the same time, but packets can also be distinguished by their sender, so this does not really matter. Note, however, that I have sacrificed a bit of generality here by requiring each packet to have correct timestamp and point of origin information in its payload.

After the packet has been generated, the input port's stub stores its timestamp, length, and checksum in a queue corresponding to the output port of Arctic from which the packet should leave. When an output port's stub receives a packet, it looks in the queue that should hold a record for the packet. Since packets are sent in FIFO order between any input port and output port, the output port's stub is guaranteed to find the needed record at the front of the queue, as long as the system is working. The stub checks the timestamp, length, and checksum of the packet with the values in the queue, and if they are not the same, it signals an error. Otherwise, it pops the element off the queue and waits for the next packet.

Since there is a limited amount of storage inside Arctic, this system knows the maximum number of outstanding packets there can be at any given time. Therefore, the size of these queues is rather small and constant, a considerable improvement over the old system, which had to hold every packet in a buffer for the entire length of a simulation.

The random packet management system described above is almost all the randomization ``hardware'' that is required for this system. This part of the system has been completed; it generates and receives packets continually, modeling ``correct behavior'' for Arctic by simply keeping track of the packets sent. However, there are two other parts of the system that have not been implemented yet. One of these is, of course, the randomizer connected to Arctic's maintenance interface. This seems, at first glance, to be simple to implement; the transaction functions from the previous system can be used to send randomized configuration and control words to Arctic. The other randomizer that is needed is some kind of error generator to create link errors, bad packets, incorrect maintenance interface operations, and other detectable errors. This seems a bit more complicated, but still manageable since there is a clear definition of what kinds of errors Arctic can detect. These will not be simple to implement, however, because the system must not only generate these inputs, but also predict how those inputs will affect Arctic.

Both of these randomizers can have a dramatic effect on Arctic. The configuration and control words can affect the flow of packets in ways that are very difficult to deal with. Randomly changing the routing information in a port's configuration register, for example, can extremely complicate the problem of tracking randomly generated packets. Also, some detectable errors can create a confusing avalanche of errors. Some link errors can cause this by fooling an input port into thinking that a packet is being transmitted when the port is really idle, thereby creating a flood of errors that are detected on a packet that does not actually exist.

There are a few tricks that can be played to simplify this problem. Maintenance interface operations can be managed by structuring the testing ``software'' so that the effects of each of the operations has a more limited scope of effects. This will leave a larger space of tests untried, but some of these untried tests may not be needed, and those that are can be tested with the older, more flexible system. If we restrict configuration changes to ones that do not change routing control bits, for instance, we can keep packet management simple and still avoid routing bugs by conducting extensive tests on the routing bits with the older testing system. More complex behavior can be avoided by restricting the moments when Arctic's control register can be changed. Changes in the control register that take effect when there is still packet traffic in the system are very hard to model. These situations can be overlooked in the testing system on the grounds that, in most cases, changes in control information take place when packets are not flowing.

Random errors present a slightly different problem. Ideally, we would not like to limit the kinds of errors that can be generated, because error counter tests are hard to create with the old system (as seen in Appendix Section A.3), and we cannot easily claim that the test cases we will miss are unlikely to occur in normal situations. Perhaps an avalanche of errors can be tolerated, though. If, when an error is generated, the testing system stops worrying about all state in Arctic except for the error counter that should reflect the detection of that error, then any avalanche of errors can be tolerated. Once the system generates the error, the system will determine if the error was detected, and then reset Arctic and the testing system to begin from a known state. This is a valid approach because it actually models how Arctic will be used. Arctic is designed to detect and not recover from errors.

Designing these randomizers is not altogether trivial, then. The structure borrowed from the earlier system makes it easier to decide where certain parts can be placed and does provide some functions for managing maintenance interface transactions, but predicting how these random generators affect the state of Arctic can be hard. The benefits are considerable, however. This system is capable of generating a wide variety of tests that a human would never be able to create.

5.2 Software

The ``software'' in this system coordinates the activities of the randomizers. The randomizers are capable of running without any central controller, but the system still needs to occasionally stop sending inputs and determine if Arctic is in a correct state. For this reason, the concept of a configure-execute-check loop is still useful. The system should put Arctic in a known state, send random inputs for a certain amount of time, and then pull state information out of Arctic to determine if Arctic's state matches the predicted state. Also, if a bug is detected, it is convenient to have a checkpoint from which the simulation can be restarted so that the conditions under which the bug appeared can be repeated without re-running the entire simulation. Tests are therefore organized in test groups, just as in the earlier system. Again, the group is the smallest unit of a test, and any group can be run in isolation.

The difference with this system is, of course, that very little human intervention is needed before the simulation can begin. Some constants need to be set to define the average length of packets, the frequency of packets, the probability of generating certain errors, the number of cycles in a test group, etc., but these are provided mostly for the purpose of tweaking the system so that a nice balance of inputs can be found. The only piece of information in the system that need change for each simulation is a single number that serves as the random seed for the entire simulation. All random numbers in the entire system are derived from this seed, and it need change by only one bit to generate a completely new random simulation. The register containing the value for this seed is modified after every random number is generated, and the seed is output at the beginning of every test group. A test group can be re-started simply by using the given seed as the initial seed for the simulation.

Let's look at each of the phases of a test group, individually. At the beginning of a new group, the configuration phase is entered and a new configuration is chosen. As mentioned in the previous section, this configuration is not completely random, because the bits corresponding to routing rules are always the same. All other configuration data that affects packet flow or statistics is randomizable. A random initial value for the control register is also chosen in this phase. When all this is complete, the execute phase begins.

The execute phase is very simple in this system. A counter is set to a user-defined initial value, and begins to count down to the moment when the system will begin to enter the check phase. While the system is running, each input port's stub generates packets, and each output port's stub processes the packets it receives, signalling errors when it detects them. The errors and maintenance interface randomizers cause various errors and non-destructive control transactions as well. Thus, the system can imitate the widely varied interactions that would happen during the normal operation of Arctic.

The check phase begins when the aforementioned counter reaches 0. First, the system waits long enough for all packets that are in transit to emerge from Arctic and for all outstanding maintenance interface operations to complete. Then, the system begins to check the internal state of Arctic. During the execute phase, the system maintained what it believed to be appropriate values for the statistics counters, error counters, control register, and any other registers that could be read out of Arctic. During the check phase, each of these values is read out of the chip, and if any discrepancies are noted, then an error is signalled. At the end of the check phase, some statistics reporting the actions taken on the chip during the test group are printed in the simulation log file, so that the test system itself can be monitored, and the system re-enters the configure phase.

This system is rather pretty, as described above, but there are a few ugly details that have not quite been worked out yet. This biggest of these problems is deciding exactly how to manage the changes in packet flow that are caused by some changes to Arctic's control register. Arctic has some functions, manipulated through the control register, that can block, re-route, or flush packets. The behavior of each of these functions can depend on whether or not any one of the other functions preceded it and whether or not packets were flowing at the time the function went into effect. Even if we decide to simplify the problem by not allowing packets to be flowing when a function goes into effect, there are still problems deciding how to stop the flow of packets to make the change in the control register and how to embed knowledge into the system about how one function should transition into another function. At present, I see no solution other than brute force -- building into the system all the information it needs to know to modify the control register safely and to predict the resulting changes in packet flow.

It seems then, that building this randomized testing system is no simple task. The structure of the earlier system did, however, give me a clear starting place for this work and was flexible enough to accommodate all of my changes. Though this system has not been completed, I believe it could be extremely helpful in eliminating subtle design bugs. Remember, the advantage of this system is that it has some probability of catching almost any bug. The bugs that it is not capable of catching can still be found by focusing some directed tests in the areas that the system cannot reach. The benefit of the random tester is not that it takes care of all testing, but that it narrows the amount of extra testing that needs to be done to something more tractable. Thus, this random tester does not replace the older system, but merely adds to it, because the older system is still necessary as a debugging tool and tester for those functions not testable with the randomized system.