One proposed method for reducing energy consumed by on-chip buses is to send frequently transmitted values less frequently. This is accomplished by inserting a transmitter on one side of the bus and a receiver circuit on the other side. These circuits effectively reduce the number of bit transitions by caching previously sent values and if a data entry is in the cache, only the cache index is sent across the bus.
We have put together a layout for the bus traffic encouder and performed a netlist extraction with full parasitics. The layout is shown in figure 1. The netlist was then modified to allow us to measure energy consumed by individual components.
One of the first surprising findings was that the majority of energy was consumed by the control circuitry, i.e. enable and clock lines that run to all CAM cells. Therefore, optimizations to control circuits were our first priority. Specifically, we have made the following modification so far. The main CAM cell originally was designed such that it required 5 control lines (shift, enA, enA_bar, enB, enB_bar). The storage element was redesigned to require only 1 control signal (simple level sensitive latch built with 6 transistors). A schematic of the CAM cell follows.
The graph below shows the amount of energy consumed by
our circuit to process 50 cycles of gcc benchmark. The first column is
the result prior to any modification and the second column is after
modification. It is important to note that the layout was not
"compressed" after unnecessary elements were removed. In other words,
once we create a new layout each cell can be even smaller, requiring
less energy than these results show.
Currently, we are formulating the strategy to minimize the energy consumption even further. We plan to address the following questions and determine their effect on the overall performance.
We are investigating a way to determine the optimal configuration for the matching circuit with the "inverted" behavior, i.e. only the matched entry will discharge the match_bar line, and consume energy. Look below in paper summaries for more discussion on this.
While providing a good overview of a variety of techniques employed to reduce power in SRAM circuits, this paper provided only limited amount of useful information for us. Of particular interest were two techniques describing ways to reduce capacitance on word and bit-lines: Divided Word-Line structure (DWL) and Single Bit-line Cross Point cell activation (SCPA). The first structure results from splitting a word into sub-words (typically 4 to 8) and allowing a global decoder to drive only "sub-word enable" signals. They in turn are responsible for activating a set of local sub-word decoders. This technique reduces driven capacitance on bit-lines by disabling the sub-words that are not accessed from emitting their values on the bit-lines. This technique is well-suited for a structure where storage cells are organized in an array and share the bit-lines.
The Single Bit-line Cross Point cell activation pursues similar goal. Two pass transistors, instead of one controlled by a word line, are attached to each cell. They are connected in series: the first is controlled by a word-line and the second by a "vertical" decoded line. This allows designer to control the number of bits accessed (and thus energy consumed to drive the bit-lines) to a level of one bit.
Relevant to the project we are working on, this paper presents an implementation of a simple CAM cell structure consisting of 6T SRAM cell and 2T differential pass gate XOR logic which performs the matching. The circuit is simple. It utilizes either pseudo-nMOS or dynamic NOR gate, such that each bit in a word is compared against the contents of a SRAM cell, and if there is a mismatch, a NMOS transistor pulls down the "match" line. The paper describes several techniques to lower energy consumption of CAM cell.
The primary problem described in the paper (which is also a serious problem for our work as well), is that the energy in the CAM cell is wasted majority of time. Since the structure described above discharges the "match" line if the word does not match, all its charge is wasted. And, unfortunately, in structures where CAMs are employed, only one of the entries typically would match a given word, all but that entry throw away the energy that was used to pre-charge their match lines.
The proposed solution is to "invert" the logic: only when the match succeeds will the output line be discharged, maintaining a logical 1 for all unsuccessful matches which are more frequent. The problem with this approach is a "sequential" nature of the equality comparison: all bits must match in order to pull down the "not_match" line. In our case, for word size of 32, this would require a chain of 32 NMOS transistors, which would severely affect the performance. Of course, the paper presents a compromise: partition the word into sub-words and match each sub-word individually, then simply OR the results. If the output is logical 0, we have a successful match. In our work, we will take a look at an optimal configuration for a word-level CAM block to find a way to conserve energy on the words that do not match.