A Novel Design for Low-Power, High Performance and Space Efficient Address Decoder for SRAM

Authors: Lone Foziya, Er. Ritesh Kumar Ojha

DOI Link: https://doi.org/10.22214/ijraset.2022.41407

Abstract

We discover the optimum decoder structure for fast low-power SRAMS. The results are ideal when the decoder is designed as a binary tree without the precoder. We find that skewed circuits with self-resetting gates work well, and we explore some basic scaling algorithms for minimal delay and power in the SRAM data path. Signal oscillations on high capacitance nodes like bit lines and data lines are decreased, resulting in low power operation. Clocked voltage sense amplifiers are required for low sensing power, and accurate construction of their sense clock is required for high-speed operation. By restricting bit line and NO line swings, we examine tracking circuits to aid in the development of the sense clock and allow timed sense amplifiers. The tracking circuits successfully use a replica memory cell and a replica bit line to track the memory cell\'s delay throughout a wide variety of manufacturing and operating conditions. The results of two different prototypes\' experiments are reported. Finally, we investigate the speed and power development trends of SRAMs as a function of size and technology, discovering that if the connection delay is negligible, the SRAM delay grows as the logarithm of its size. Wire delay becomes increasingly important for SRAMs after the fifth generation. The wire delay worsens as the process shrinks, requiring wire redesign to keep the wire delay proportionate to the gate delay. Hierarchical SRAM topologies provide enough of array space for fat Wires and may be used to control time. Throughout the procedure, the wire delay for 4Mb and smaller designs diminishes.

Introduction

I. Introduction

Fast low-power SRAMs are now a critical component of many VLSI circuits. This is especially true in the case of microprocessors, where on-chip cache capacities are increasing with each generation to close the gap between processor and main memory speeds [1-2]. Simultaneously, power dissipation has become a critical issue because of increased integration and operating speed, as well as the exponential rise of battery-operated goods [3,]. This thesis looks at the design of SRAMs, with a focus on reducing delay and power consumption. While process [4-5] and supply [6-11] scaling remains the primary drivers of fast, low-power designs, this thesis examines a variety of circuit approaches that may be used in combination to achieve quick, low-power operation.

Conceptually. The architecture of an SRAM is shown in Figure 1. It's made up of memory cells organised in a two-row, two-column matrix. In each memory cell of an SRAM, a pair of cross coupled inverters produces a bi-stable element. These inverters are connected to a pair of bit lines via nMOS pass transistors, giving differential read and write access. An SRAM comprises some column and row circuitry to access these cells. The address input is separated into m row address bits and n column address bits, which identify the cell to be accessed. One of the two "word lines" that connect each row's memory cells to their corresponding bit lines is activated by the row decoder. The column decoder uses a pair of column switches to connect one of two 2"-bit line columns to the peripheral circuitry.

During a read operation, the bit lines are recharged to some reference voltage, which is usually close to the positive supply. When the word line goes high, the access nfet coupled to the cell node carrying a '0' begins discharging the bit line.

The write data is communicated to the relevant columns during a grite operation by grounding either the bit line or its complement and pushing the data into bitline pairs. If the cell data varies from the write data, the '1' node is discharged when the access nfet connects it to the discharged bitline, causing the cell to be written with the bitline value.

The core SRAM structure might be greatly modified to reduce latency and power at the cost of some space overhead. The RAM cell's design and arrangement are the first steps in the optimization process. This is done with the assistance of process technologists [4]. The thesis, for the most part, assumes that a ram cell has been appropriately constructed and investigates how to link the cells effectively.

In the SRAM data route, switching the bitlines and [/0 lines, as well as biassing the sense amplifiers, use a significant amount of the overall power, especially in wide access width memory. Tracking circuits are used to limit bitline and 1/O line swing and assist in the development of the sense clock, allowing for timed sensing.

We next take a step back from circuit design specifics to look at SRAM speed and power as size and technology increase. We use some fundamental analytical models for delay and power that are a function of size, organisation, and technology for the decoder and data path. The models are then used to calculate the best delay and power allocations for each SRAM size and technology generation, allowing us to visualise scaling trends.

II. Literature Review

Initial low-power SRAM solutions advocated using dual Vt to reduce leakage power and some decoding advances [15]. Most contemporary SRAM cells currently use these approaches and nevertheless have large leakage currents. This is since in nanoscale nodes, most of these leakage control approaches become useless. By lowering the gate voltage and body biassing the access transistors, leakage power in 6T SRAM cells may be decreased.

Dual biasing and PMOS transistors can also be used to reduce leakage [4]. SRAM can be power gated towards a nominal supply voltage at sub-array granularity to reduce the idle power consumption [16]. This suggestion takes use of the data retention features of these SRAM cells by keeping the supply voltage above the data retention voltage (DRV) of most of these SRAM cells. The data retention voltage is the voltage over which an SRAM cell's data integrity is almost guaranteed. Within a big cluster of cells, however, the DRV value tends to fluctuate.

As the technological node gets smaller, the DRV gets bigger, triggering bit flips in SRAM cells and retention failures. As a result, modern last-level caches use ECC to provide some error prevention [11]. Many manufacturers leverage DRV features to implement fault-tolerant SRAM systems at the system level [9].

III. Methodology

The delay and power of realistic SRAMs have been reduced over time as array organisation and circuit design have improved. This chapter delves into both topics, as well as the issues raised by this theory.

A. SRAM Partitioning

To be used with large SRAMs. By partitioning the cell array into smaller subarrays, considerable improvements in latency and power can be achieved. rather than the single monolithic array seen in Figure 1.A large array is frequently partitioned into multiple smaller sub arrays (known as macros), each of which carries a fraction of the accessible word, referred to as the sub word, and which are all triggered at the same time to access the complete word [2]. Low-power SRAMs usually only contain one macro, but high-performance SRAMs can have as many as [6] macros. Separate RAMs can be compared to the macros. except that the decoder's component may be shared.

Every macro follows the same basic structure as the one shown in Figure 1. The word line activates all of the cells in a row when a row is accessed, and the column multiplexers are utilised to access the relevant sub word. This style has two limitations for macros with many columns: The bit line power increases linearly as the number of columns increases, but the word line RC delay increases as the square of the number of cells in the row increases.

The Divided Word Line (DWL) technique, initially presented by Yoshimoto et al. in [17], can be used to remedy these faults by subdividing macros into smaller blocks of cells. The DWL method breaks a long word line in a typical array into R pieces, each of which is triggered independently, reducing the word line length by k and hence the RC delay by k3. The DWL architecture divides a 256-column macro into four blocks, each with 64 columns (see Figure 2). The row selection procedure has now been divided into two sections. First, a global word line is initiated.

Due to its reduced capacitive loading, the global word line has a smaller RC delay than a full-length word line, while being almost the width of the macro. Instead of viewing all 256 cells, lt just sees the four-word line drivers' input loading. Its resistance may also be lowered since it may use larger wires on a higher-level metal layer. The word line RC delay is reduced by another factor of four by keeping the word drivers in the middle of the word line segments and shortening each segment. Because just 64 cells in the block are active rather than all 256 cells in the undivided array, the column current is reduced by a factor of four. For large RAMs, the Hierarchical Word Decoding (HWD) technique [45] is based on the concept of recursively dividing the word line on the global word line (and the block select line). As detailed in the following section, partitioning can also be utilised to reduce bitline heights.

B. Circuit techniques in SRAMS

The SRAM access path is made up of two parts: a decoder and a data route. The decoder includes the circuits that connect the address input to the word line. The data route refers to the circuits that link the cells to the I/0 ports.

The logical function of the decoder is like that of two 2" n-input AND gates with a hierarchical implementation of the large fan-in AND operation. The partially decoded product fi. AOAIA2A3, which consists of two sets of four address inputs and their complements (A0, A0, Al, Al,) is decoded first to activate one of the 16 precoder output wires. The predecoder outputs are then concatenated to activate the word line at the following step.

The decoder delay can be significantly lowered by optimising the circuit style used to create the decoder gates. To perform the decode logic function, older designs employed a simple combinational technique employing static CMOS circuits (Figure 3) [17-19]. At any one time in this design, one of the two"" word lines will be active.

IV. System architecture

Using some simple analytical models for the various components of the SRAM, this section examines the scaling of delay and power of SRAMs with size and technology. Exploring the enormous design space with traditional SPICE circuit simulation would take a long time, hence simplified analytical models are quite useful. Such models are useful not only for developing SRAMS for the present generation, but also for future generations. However, it may be used to predict future trends.

Two impacts stand out when feature size shrinks by a factor of two every 18 months: interconnect delay is increasing worse compared to transistor delay, and transistor threshold mismatches are not scaling with supply voltage [14,13]. Because SRAMs require information to be broadcast globally throughout the whole array, and part of the signal route inside the array employs tiny signal swings followed by sense amplification, both effects are expected to have a significant impact on SRAMs. With the use of analytical models, the section explores both of these impacts.

A. SRAM Partitioning

At the very top. The number of macros (nm) that make up the array, as well as the block width (bw) and block height (bh) of each of the sub blocks that make up a macro, may all be represented by three variables. Figure 4 illustrates how a 1024x1024 array of cells in a micro-SRAM may be partitioned for a 64-bit access.

???????B. Modelling of the SRAM

We make certain simplifying assumptions about key features of the design to explore the enormous SRAM design space in a tractable way. In the next section, we explain and justify the essential assumption, we provide all the assumptions. The SRAM path may be split down into two components, as discussed, the row decoder and the RAM data path. We create basic delay analytical models. These should be measured in terms of area and power, and then compared to HSPICE circuit simulations.

Decoder Model

The total of the extrinsic delays of each gate in the path (each of which is equal to the extrinsic delay of the fanout 4 inverter, Tf04x), their intrinsic delays, and the wire delays equals the decoder delay with the fanout of 4 size rules

Since the ?nal stage is sized to have a fanout of 4. the total delay of the stage is the sum of a fanout of 4 inverter delay (rf04) and the RC delay of the local word line (rw, * Cwl/ 8 if the word drivers drive the local word line from the centre of the line).

The total device widths inside the word drivers are represented as a linear function of the area of the word drivers. Fitting this function to the regions acquired from the arrangement of six distinct word drives yielded the constants [11, 12]. When fanout 4 sizing is employed for the gates, the overall device width within the driver is anticipated to be 1.25 times the size of the final buffer, since the active area of the pre driver can approximate to be a quarter of the final inverter. The overall decode area is also increased to accommodate the vertical predecode and block select wires. As an illustration. The regions for 64 local word drivers accommodate for the increase in SRAM array width owing to the 12 to 4096 decoder of Figure 5. There is just one global word driver. and 16 predecode wires and 64 block select wires on vertical wiring tracks.

2. Output Mux

The bitline mux, which passes the cell data into the sense amplifiers, makes up the output mux. as well as the data line mux, which connects the sensing amplifiers to the output.

Because the signal levels in both mixes are low (less than 100 mV), Both muxes' input signal sources may be thought of as ideal current sources.

For a current source input, the delay degradation over an RC network is different than for a voltage source input. Consider an ideal current source as depicted in Figure 6 driving an RC 7: network. Figure 6 b depicts the voltage waveforms of nodes 1 and 3, as well as the waveform when the resistance is zero (dashed line).-

Nodes 1, 2, and 3 slew at the same rate in steady state (t >> TRC), and the latency to achieve a swing of V at node 3 is the same. Figure 7 'L'RCI shows a Bitline Circuit. The bitline RC network's time constant can be estimated. This equals the delay when there is no resistance plus the network's time constant. This formula is used to calculate the bit-line and data-line muxes' delays.

A single level bitline mux is depicted, which is treated as an ideal current source driving an RC network. The capacitances and resistances in the network are influenced by the local and global bitline wires, as well as the mux switches. Equation 7 calculates the bitline delay getting an 8v signal swing as the sum of the delay to generate the voltage swing with no resistance and the RC network's time constant. Because of the line resistance, long local word lines might take a long time to climb. We must include the rising time in the delay mode since it influences the cell delay. The influence of the rising time (T,) can be reflected by adding a proportionate extra term to the delay equation [7]. The proportionality constant or is determined by the ratio of the cell's access device's threshold voltage to the supply voltage, and we estimate it to be about 0.3 over a wide variety of block widths based on simulations. In the bitline delay equation, the RC time constant, IRC, is calculated as in Equation 4. Figure 8 depicts the estimated and HSPICE measured delay as it passes via the local word driver, resistive word line, and bit-line, up to the sense amps' input. When the bitline height is at least 32 rows, the projected delay is within 2.4 percent of the HSPICE latency for both short and long word lines (16 columns) (1024 columns).

Figure 9 shows the sense amplifier buffer chain, which consists of a basic cross coupled latch, a chain of inverters, and a pair of nMOS drivers [6. 2]. Both the local and global sense amplifiers employ the latch to transform the tiny swing input signal to a full swing CMOS signal. The local sense amplifiers are an example of this.

The inverter Chain buffers the latch output and drives it to the output nMOS drivers' gates. Both calculations and circuit simulations show that at a gain of roughly 20 with simply the self-loading of the amplifier, 1, is about 2 If04. If all the latch's transistors are sealed in the same proportion, the output resistance and input capacitance may be written as simple functions of the size of the latch's cross coupled nMOS. ws, as illustrated in Figure 9. The nMOS drivers are modelled as current sources, with current output proportional to w".

To reduce the overall output mux delay, ideal sizes w, w, w" are calculated, just as they are in decoders. Equation 7 sums the delays of the bitline mux, the latch sense amp, the buffers, and the nMOS drivers to capture the essential components of the output mux delay needed for this optimization. The effect of the latch sense amp size on the bitline mux time constant is neglected to simplify the technique for determining the ideal sizes, and only the cell delay is considered.

The vertical area of the SRAM array is increased by the switches in the bitline mux and the circuitry of the sensing amplifier, pre-charge, and write drivers (Figure 10). These components' area estimations are based on data from a prior design [13]. Precharge and mux transistors have not been optimised since the write driver. We add a 4. l and 2 memory cell overhead, respectively.

The area of the local sense amps is modelled as a linear function of the total device width within the sense amp. The model's parameters are determined by fitting it to data from five distinct designs [2, 4], as illustrated in Figure 5.7. The size parameters WK, w" & w, are used to estimate the entire device width inside the sensing amp structure. The total of all device widths within the latch is calculated as ws * 8.7, with the factor of 8.7 taken from [3]. The active area of the buffers prior to each nMOS output driver is no more than 1/3 of the driver width W" with fanout of 4 sizes. As a result, 2 * w" * 1.33 is the active area of two nMOS drivers and their corresponding buffers.

The outcomes of utilising these models to analyse multiple RAM organisations of varying sizes in various technological generations will be described next.

V. Results

Using the basic models presented before, we enumerate all RAM groups and estimate their area, delay, and energy. This enables us to identify the most efficient organisations based on a weighted objective function of delay, area, and energy.

Figure 11 shows the delay of SRAMs structured for minimum delay with and without wire resistance in the 0.25pm technology, with sizes ranging from 64Kb to l6Mb with a 64-bit access width. The SRAM delay without wire resistance is roughly 15 Tf04 for a 64Kb design, and it is proportional to the log of the capacity, as seen in [6]. For every doubling of RAM capacity, the latency rises by around 1.2 tf04, which may be explained in terms of the delay scaling of the row decoder and the output route. In an ideally structured SRAM, the delays for both of these are displayed in the same graph and are almost equal. In the case of the row decoder, each address bit selects Elf the array, and hence the loading seen by the address bit is proportional to S/2, where S is the total number of bits in the array. The number of stages in the decoder will be proportional to the logarithm to base 4 of the total loads using the fanout 4 sizing algorithm, with each stage having a delay of around one Tf04. As a result, each doubling of the number of bits adds half a tf04 delay. The wire capacitance in the data line mux rises by around 1.4 for every doubling of the size of the output path. since it is proportional to the array's perimeter, the local sense amps' delay rises by around 0.25 Tf04

The remaining increase is due to the doubling of the multiplexor size for the bitline and data line mux, and its precise amount is determined by the memory cell's unit drain junction capacitance and unit saturation current, as well as the nMOS output drivers' unit saturation current.

Figure 11 shows the SRAM delay with wire resistance as the final curve. The width of the global wires for this curve is expected to be 107» (7.5 ohms/mm). Because wire RC delay increases with wire length, the wire delay for global wires in the SRAM scales with SRAM size and becomes dominant for big SRAMs.

To lessen the impact of connection latency, wire width optimization can be done. Figure 12 depicts the overall delay for a 4Mb SRAM for two distinct wire widths across four technological generations. The metallization in 0.18um and lower is believed to be copper. When the wire resistance is considered to be zero, the lowest curve depicts the delay.

Because the voltage difference at the threshold remains constant as technology advances. Because the signal swings on the bitline and data lines do not scale in proportion to supply voltage, their delays will worsen in comparison to the remainder of the RAM.

As previously stated, a skewed gate's delay is around 70% that of a non-skewed gate and develops slower with increasing load. When the predecoder and global word driver are used, the decoder latency in 64Kb RAM is reduced to roughly 6 rf04 instead of the static implementation's 8 rf04, resulting in a net decoder gain of up to l rf04 when losses due to redundancy and a smaller local word driver are considered. However, when the RAM capacity is doubled, the decoder latency increases by roughly 0.3 Tf04 instead of 0.5 Tf04 in the static scenario. The sense clock for the local sense amplifiers is frequently padded with additional delay in the output channel. to make it possible to operate. With these modifications. Due to higher decoder performance, we would anticipate the overall delay to start at around the same position in Figure 5.8 but expand at a slower pace with size than projected. Figure 5.8 shows that the decoder and output route delays are almost equal. As we'll see in a moment. To minimise the local bitline delay, the memory has been partitioned to have the smallest block feasible, and the output path latency cannot be decreased further by partitioning.

As a result, dynamic decoders are used. At least for the circuits we've imagined, the output path delay will not be able to keep up with the decoder delay.

Partitioning helps you to balance delay, area, and power. Equation 9 may be used to generate trade-off curves. varying the values of the parameters 0t and B When B is equal to 0 Figure 5.10 shows the result of the delay-area trade-off for a 4Mb SRAM in a 0.25um process. Any point on this curve reflects the smallest area that may be achieved by RAM restructuring with the given delay. Starting with a neatly partitioned least delay design. By minimising the amount of partitioning and incurring a minor delay cost, significant advances in this area are attainable. As the partitioning is reduced, the area improvement decreases, resulting in a higher delay penalty. The figure depicts the partitioning parameters for three points A, B, and C. When compared to the fastest implementation, points A and B are in the sweet part of the curve, with A being around 22 percent slower and 22 percent smaller area and B being 14 percent slower and 20 percent smaller area.

The RAM delay is the most sensitive of the different organisation characteristics, and faster access times are attained by employing lower block heights. For varying block heights, Figure 14] displays the delay and area for a 4Mb SRAM. while utilising the best settings for the rest of the organization's parameters Because the RAM size grows owing to the overhead of bitline partitioning, small block heights minimise the latency of bitlines but increase the delay of global wires.

As a result, greater block heights are desirable in low-energy designs, as indicated in [8-9]. Because the junction capacitances from the memory cell's access transistor are extremely modest compared to the junction capacitances in the data line mux, most of the multiplexing may be done in the bitline mux. The energy consumption of efficiently ordered SRAMs may also be stated as a sum of two components, according to our findings.

. One is unaffected by capacity and is only constrained by access width. The local word line, the precharge signal, local and global sense amps, and other factors all contribute to this. The other component, which is connected to the power dissipation in the global wires and decoders, scales as the square root of the capacity as seen in.

Conclusion

We investigated design trade-offs for fast low-power SRAMs in this study. The read data path and the row decoders are the two parts of the gRAM access path. Techniques for optimising both were investigated. We sketched out the best decoder structure for fast low-power SRAMs. When the \"coder,\" omitting the predecoder, is implemented as a binary tree, optimal decoder implementations arise. As a result, power dissipation is minimised since only a few numbers of lengthy decode wires change. Beyond the fifth generation, the wire delay becomes increasingly critical for RAMs. As the process shrinks, the wire delay worsens, necessitating wire redesign to preserve the wire delay in proportion to the gate delay. A divided word line structure for the decoders and column muxing for the bitline path creates enough space across the array for fat wires to be employed to regulate wire delay for 4Mb and smaller designs throughout process shrinks. The wire delay has a lower bound set by the speed of light, and this is about 1.75 rf04 for the 4Mb SRAM, and doubles with every quadrupling of capacity. As a result, high-performance RAM architectures at the 16Mb and greater levels are required. Instead of the existing technique, where signals are broadcast worldwide over the array, the RAM architecture has to be altered to employ address and data routing (see for example [6]). Because wire delay is related to cell area, cell designs with smaller areas, even if the cells are weaker, will win out for big RAMs. As a result, the DRAM cell, multi-valued cells, TFT-based cells, and other innovative cell designs will be worth looking into for future high-performance, high-capacity RAMS.

References

[1] P. Barnes, “A 500MHz 64b RISC CPU with 1.5Mb On-Chip Cache”, 1999 IEEE International Solid State Circuits Conference, Digest of Teclmical Papers, pp. 86- 87. [2] S. Hesley, et. al. “A 7th-Generation x86 Microprocessor”, 1999 IEEE International Solid State Circuits Conference, Digest of Technical Papers, pp. 92- 93. [3] Special issue on low power electronics. Proceedings of the IEEE, vol. 83, n0. 4, April 1995. [4] S. Subbanna, et. al., “A High-Density 6.9 sq. um embedded SRAM cell in a High- Performance 0.25 pm—generation CMOS Logic Technology”, IEDM Technical Digest. pp. 275-278, 1996. [5] G.G. Shahidi, et. al., “Partially depleted 80] technology for digital logic”, ISSCC Digest of Technical Papers, Feb. 1999, pp. 426-427. [6] AP. Chandrakasan, et. al., “Low-Power CMOS Digital Design”, IEEE Journal of Solid-State Circuits. vol. 27. no. 4, p. 473-484. April 1992. [7] W. Lee. et. al., “A 1V DSP for Wireless Communications”, 1997 IEEE International Solid State Circuits Conference, Digest of Technical Papers. pp. 92- 93. [8] M. lzumikawa, et. al., “A 0.25-pm CMOS 0.9-V lOO-MHz DSP Core”, IEEE Journal of Solid-State Circuits, vol. 32. no. 1, pp. 52-61, Jan. 1997. [9] K. lshibashi, et. al., “A 1V TFT-Load SRAM using a Two-Step Word Voltage Method”, 1992 IEEE International Solid State Circuits Conference, Digest of Technical Papers, pp. 206-207. [10] H. Yamauchi, et. al., “A 0.5V / 100MHz Over-Vcc Grounded Data Storage (OVGS) SRAM Cell Architecture with Boosted Bit-Line and Offset Source Over- Driving Schemes”, 1996 IEEE International Symposium on Low Power Electronics and Design, Digest of Technical Papers, pp. 49-54. [11] K. ltoh, ARFridi. A. Bellaouar and M1. Elmasry, “A deep sub-V, single power- supply SRAM cell with multi-Vt, boosted storage node and dynamic load”,1996 Swzposium on VLSI Circuits, Digest of Technical Papers, pp. 132- 133. [12] R. C. J aeger, “Comments on ‘An optimized output stage for MOS integrated circuits’,” IEEE Journal of Solid-State Circuits. vol. SC-lO. no. 3. pp. 185-186. June 1975. 135 [13] C. Mead and L. Conway, Introduction to VLSI Systems, Reading, MA, Addison- Wesley, 1980. [14] N. C. Li, et. al., “CMOS tapered buffer”, IEEE Journal of Solid-State Circuits, vol. 25. no. 4. pp. 1005-1008. August 1990. [15] J. Choi, et. al., “Design of CMOS tapered buffer for minimum power-delay product”, IEEE Journal of Solid-State Circuits, vol. 29, no. 9, pp. 1142-1145, September 1994.

Copyright

Copyright © 2022 Lone Foziya, Er. Ritesh Kumar Ojha. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET41407

Publish Date : 2022-04-12

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here