Design and Optimization of the VLSI Architecture for Discrete Cosine Transform Used in Image Compression

Mario Kovač¹, Mario Žagar¹ and N. Ranganathan²

¹ Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, CROATIA
² Center for Microelectronics Research, Department of Computer Science and Engineering, University of South Florida, Tampa, USA

Keywords: DCT, data compression, VLSI, computer architecture, algorithms, multimedia.

1. Introduction

Presentation of images plays a significant role in today’s information exchange. Numerous applications that have been introduced in last few years, such as video teleconferencing, HDTV, wirephoto, fax, computer tomography, interactive visualization, multimedia and other, are based on image presentation and distribution procedures. Disadvantage of using digital images in these applications is enormous amount of space needed for image storage. For example, a 1024 × 1024 color image with 24 bits per pixel requires 3.15 M bytes in the raw form. It is obvious that efficient handling of images is possible only with the introduction of data compression techniques. Data compression is the reduction or elimination of redundancy in order to achieve savings in storage and communication costs. Data compression techniques can be classified in many ways, but the most common classification defines two basic categories: lossless and lossy. In lossless methods, the exact original data is recovered while in lossy methods only a close approximation of the original data can be obtained. Lossless methods are used in applications where no loss of information is allowed, such as text compression, medical imaging, and similar. Lossy methods are mostly used in image and audio compression where designers or users can select the quality of the restored data. Disadvantage of lossless methods is that the compression ratio is relatively small and runs between 3:1 and 4:1, while lossy methods can achieve compression ratios of up to 100:1. Even with compression ratios of that magnitude, due to the large volume of data, real-time operation of image using applications requires development of very high speed implementation of image compression/decompression techniques.

In recent years, a working group known as Joint Photographic Expert Group (JPEG) consisting of three international standard organizations, International Telegraph and Telephone Consultative Committee (CCITT), International Organization for Standardization (ISO) and International Electrotechnical Commission (IEC), has established an international standard for coding and compression of continuous-tone still images. This standard is commonly referred to as the JPEG standard. The primary aim of the JPEG standard is to propose an application independent image compression algorithm that would be and aid VLSI implementation of data compression [32].

In this paper we describe design and optimization of the most critical part of the efficient single chip architecture for JPEG image compr-
sion: the Discrete Cosine Transform (DCT) module. To maximize the throughput and reduce the silicon area required the architecture has been designed on the basis of the novel hardware algorithms, and later optimized to reduce any redundancy in the logic.

The paper is organized as follows. In the second chapter we briefly describe JPEG Compression Standard. The third chapter reviews previous work in the field. The fourth chapter presents detailed description of the design, optimization and implementation of the DCT architecture and it is followed by conclusions and references.

2. JPEG Compression Standard

The basic model for the JPEG encoder is shown in Fig. 1. The encoder model transforms the input image into an abstract representation which is more suitable for further processing. To achieve this transformation, the encoder model may require parameters stored in some model tables. The entropy encoder is a compression procedure which converts the output of the encoder model into a compressed form. Also, the entropy encoder may use tables for storing the entropy codes. Four distinct coding processes were derived based on the above described JPEG model: (i) baseline process, (ii) extended DCT-based process, (iii) lossless process and (iv) hierarchical process.

The baseline and the extended processes are also known as DCT-based processes since they use DCT within the encoder model. The lossless process uses prediction based methods within the encoder model. The hierarchical process uses the encoder model from the extended process or the lossless process. The baseline process uses Huffman codes for entropy encoding, while the other three processes use either Huffman or arithmetic. Since the focus of this paper is on VLSI implementation of the baseline process, we describe the baseline process in detail in the rest of this section. For a complete overview of the JPEG standard and the various processes, the reader is referred to [32,44].

The encoder model for the baseline process is shown in Fig. 2. The input image is divided
into nonoverlapping blocks of $8 \times 8$ pixels and input to the baseline encoder. The pixel values are converted from unsigned integer format to signed integer format and DCT computation is performed on each block. DCT transforms the pixel data into a block of spatial frequencies that are called the DCT coefficients. Since the pixels in the 8x8 neighborhood typically have small variations in gray levels, the output of DCT will result in most of the block energy being stored in the lower spatial frequencies. On the other hand, the higher frequencies will have values equal to or close to zero and hence, can be ignored during encoding without significantly affecting the image quality. Selection of frequencies based on their importance can affect the quality of the final image. JPEG allows for this by letting the user redefine the quantization tables used in the quantization step following the DCT computation. The selection of quantization values is critical, since it affects both the compression efficiency and the reconstructed image quality.

The block of DCT coefficients output by the encoder model is rearranged into one dimensional data using zig-zag reordering as shown in Fig. 3. The location $(0, 0)$ of each block $I$ contains the DC coefficient for the block, represented as $DC_i$. This DC coefficient is replaced by the value $\Delta DC_i$ which is the difference between the DC coefficients of block $I$ and block $I-1$. Since the pixels of adjacent blocks are likely to have similar average energy levels, only the difference between the current and previous DC coefficients is used, which is commonly known as differential pulse code modulation (DPCM) technique. It should be noted that due to the zig-zag reordering the high frequency coefficients that are more likely to be zeroes, get grouped at the end of the one dimensional data.

The entropy encoder details are shown in Fig. 4. In order to encode the rearranged DCT coefficients the entropy encoder uses variable length encoding based on a statistical model. In the entropy encoder the quantized DCT coefficients are converted into a stream of [runlength, count, category] pairs. For each pair, there is a corresponding variable length Huffman code, which will be used by the Huffman encoder to perform the compression. The Huffman codes are stored in a table.

In order to achieve better compression results, input images are transformed very often to a
different color space (or color coordinates) representation before being input to the encoder. Although the JPEG algorithm is unaffected by the color, since it processes each color independently, it has been shown that by changing the color space, the compression ratio can be significantly improved.

3. Related work

Discrete Cosine Transform, a key function in the JPEG compression/decompression process, has been widely used in many applications and hence there is a vast amount of research work published on this topic. Of particular interest are the papers discussing hardware implementation approaches. It is well known that two-dimensional DCT computation can be implemented as a sequence of two one-dimensional DCT's which is commonly referred to as the separability property. It is simpler to implement this approach in hardware. It was shown by Haralick [21] that the DCT of N points can be computed using two N-point FFT's by exploiting the symmetry of the inputs. Later, Tseng and Miller [49] showed that the DCT can be obtained more efficiently by just computing the real part of the first N coefficients of the 2N-point DFT. The computation of 8-point DCT needed for JPEG can be replaced by 16-point DFT computation followed by scaling. An optimum form for 16-point DFT was developed by Winograd [53]. Arai, Agui and Nakagima adapted Winograd's solution for 8-point DCT reducing the computation by means of the symmetry property [7]. The hardware implementation of one-dimensional scaled DCT in our architecture is based on the algorithm by Arai et al. [7]. Their computational flow graph requires 5 multiplications, 29 additions and 16 two’s complement operations (referred to as multiplications by −1 by Arai et al. [7]).

In our paper [38] we have described a new algorithm for one dimensional DCT computation which further reduces the number of operations. By using this approach one dimensional, scaled DCT introduces a 25% reduction in the number of two’s complement operations compared to Arai et al. In this paper we will describe the design and optimization process of the VLSI architecture that is based on that algorithm.

A few special purpose VLSI chips implementing the JPEG baseline compression standard have been built and successfully commercialized. The Intel's i750 video processor [23,26] consists of two chips, the 82 750PB pixel processor and the 82 750DB display processor. The pixel processor can be programmed to implement the JPEG compression standard. The C-CUBE CL550 is a single chip processor for JPEG image compression and decompression [38]. The core of the chip is a compression/decompression unit consisting of the FDCT/IDCT, the quantizer, the run-length encoder/decoder and the Huffman encoder/decoder. The chip can operate at up to 35 MHz. The chip can draw the data at rates up to 17.5 million pixels per second and produce compressed data at a rate of approximately 2 million bytes per second. Since the entropy encoder in the chip operates at a slower speed than the DCT module, a FIFO buffer is used between the two modules to avoid overflow during compression. Whenever the amount of data in the buffer reaches a certain level a delay signal is generated, which stalls the DCT computation and the data input to the system. LSI Logic announced a chipset for JPEG compression that consists of L64735 DCT processor, L64745 JPEG coder and L74765 color and raster-block converter. The chipset operates at the maximum rate of 35 MHz and processes still image data at up to 30 million bytes per second. LSI Logic offers a single chip JPEG coprocessor L64702 designed for graphics and video applications in personal computers, engineering workstations and laser printers [40]. The chip is capable of compressing and decompressing the data at rates up to 8.25 million bytes per second with an operating frequency of 33 MHz.

Design and optimization of high speed, pipeline architecture for DCT and category selection was one of the most demanding tasks in the development of the single chip for the JPEG compression. Our goal was to design the architecture that would be able to accept new data and compute new result every clock cycle. In this paper we will present the details of both DCT and category selection architectures that are based on the novel algorithms developed by the authors. Efficient logic that controls the operation of the circuitry is described as well.
4. DCT Architecture

Efficient hardware implementation of two dimensional DCT is feasible using the separability property of the transform. A sequence of two one-dimensional DCT's will produce the same result but the amount of logic needed to implement two 1D DCT's is significantly smaller. Another reduction in computation can be achieved if DCT coefficients are allowed to be scaled by some constant factor (which is the case with the DFT based methods, as explained in previous chapter). Since DCT is followed by the quantization procedure, as per JPEG standard, all scaling factors can be combined with the quantization factors, resulting in the notable reduction in computation. We have chosen this approach in the development of the DCT algorithm. For the sake of reference this algorithm is restated in Table 1.

Based on the above algorithm we have developed the efficient, fully pipelined, VLSI architecture with minimized control logic. It is seen from the algorithm, that if appropriate logic and control signals are designed all steps can be computed parallelly within the 8 clock cycles. Number of 8 clock cycles was chosen on the grounds of the efficiency calculation for the whole circuit.

Maximum throughput for the above algorithm, i.e. 8 pixels/clock cycle can be achieved if all computations in a step are performed parallelly (Fig. 5). In that case, DCT architecture would consist of up to 8 arithmetic units for each step. Although attractive, this approach is inappropriate for most of the applications for two major reasons: (1) JPEG decoders should be cost effective and an unreasonable amount of logic (arithmetic units) would significantly increase the chip cost, and (ii) entropy encoding portion of the JPEG architecture processes data in a serial fashion and, using the same technology, is not capable of processing 8 coefficients/clock cycle.

\[
\begin{align*}
\text{Step 1:} & \quad b_0 = a_0 + a_7; \quad b_1 = a_1 + a_6; \quad b_2 = a_3 - a_4; \quad b_3 = a_1 - a_6; \\
& \quad b_4 = a_2 + a_5; \quad b_5 = a_3 + a_4; \quad b_6 = a_2 - a_5; \quad b_7 = a_0 - a_7; \\
\text{Step 2:} & \quad c_0 = b_0 + b_5; \quad c_1 = b_1 - b_4; \quad c_2 = b_2 + b_6; \quad c_3 = b_1 + b_4; \\
& \quad c_4 = b_0 - b_5; \quad c_5 = b_3 + b_7; \quad c_6 = b_3 + b_6; \quad c_7 = b_7; \\
\text{Step 3:} & \quad d_0 = c_0 + c_3; \quad d_1 = c_0 - c_3; \quad d_2 = c_2; \quad d_3 = c_1 + c_4; \\
& \quad d_4 = c_2 - c_5; \quad d_5 = c_4; \quad d_6 = c_5; \quad d_7 = c_6; \\
\text{Step 4:} & \quad e_0 = d_0; \quad e_1 = d_1; \quad e_2 = m_3 \times d_2; \quad e_3 = m_1 \times d_1; \\
& \quad e_4 = m_4 \times d_6; \quad e_5 = d_5; \quad e_6 = m_1 \times d_3; \quad e_7 = m_2 \times d_4; \\
\text{Step 5:} & \quad f_0 = e_0; \quad f_1 = e_1; \quad f_2 = e_5 + e_6; \quad f_3 = e_5 - e_6; \\
& \quad f_4 = e_3 + e_8; \quad f_5 = e_8 - e_3; \quad f_6 = e_2 + e_7; \quad f_7 = e_4 + e_7; \\
\text{Step 6:} & \quad S_0 = f_0; \quad S_1 = f_4 + f_7; \quad S_2 = f_2; \quad S_3 = f_5 - f_6; \\
& \quad S_4 = f_1; \quad S_5 = f_3 + f_6; \quad S_6 = f_3; \quad S_7 = f_4 - f_7;
\end{align*}
\]

where: \( a_i \) input elements (0 i 7) \( m_i \) fixed multipliers:

\[
\begin{align*}
m_1 &= \cos(4/16); \\
m_2 &= \cos(6/16); \\
m_3 &= \cos(2/16) - \cos(6/16); \\
m_4 &= \cos(2/16) + \cos(6/16);
\end{align*}
\]

Table 1.
We have chosen another approach which is not computationally intensive and which maximizes performance/silicon ratio. Since entropy encoding logic can't accept more than one element per clock cycle, DCT was designed to have exactly the same throughput. The algorithm allows that all operations are performed within the common time frame of eight clock cycles by using only one arithmetic unit per step. This is an optimal performance since it provides maximal output rate that further logic can handle (max. performance); at the same time, any further reduction of the number of arithmetic units would create that output rate (min. silicon needed). The algorithm was developed for the VLSI implementation, with special care to remove any operand feedbacks. While designing the architecture, several rules have been followed to enable maximum throughput:

- All stages should follow the same speed of operation (1 pixel/clock cycle)
- No intermediate result of the operation should be forwarded from one step to the next one until all operations in that particular step are completed. This rule allows fine optimization of the architecture for each step, maintaining the consistency of the overall architecture.

- Based on the statistical information, arithmetic units within each stage should be optimized for speed and silicon.

Following the above rules we have designed a 1D-DCT module that maps our DCT algorithm in the hardware. The architecture consists of six partitions as shown in the Fig. 6, so that each step in the algorithm corresponds to a partition in the architecture. Each partition contains a register set (RS), an arithmetic unit and the associated control logic. To enable parallel execution of operations each register set was designed to be able to concurrently evaluate algorithm formulas and receive new operands to be used later in the process. For that reason each register set contains several register pairs where a register pair is added for every input variable.
in the algorithm step. As mentioned earlier, all the computations for a single step are performed within 8 clock cycles time frame. The circuit accepts one pixel per clock cycle and the entire processing is performed as a linear pipe. During each clock cycle output pixels are stored in one of the eight registers from the left column of the register set RS-a. After eight clock cycles, the left column is filled with eight data elements that are now ready for processing, and the entire column is copied onto the corresponding registers in the right column. During the next eight clock cycles the adder logic performs the computations as in step 1 of the algorithm while the left column keeps receiving new input data. A similar process occurs in each of the partitions simultaneously. By using this approach we have successfully created the architecture that utilizes arithmetic units to the maximum extent, providing the required output data rate.

Another goal in the design process was to optimize arithmetic units (adders, multiplier) to

a) reduce DCT architecture latency
b) reduce silicon area

Both of the above tasks are closely related. Reduction of the circuit latency can be achieved by the logic reduction, which will then result in a smaller silicon area. Reduction of logic was based on the theoretical analysis performed by the authors. During the analysis we were looking for max. values that can be expected as a result of every operation. This gave us the maximum number of bits needed for the result. In addition, we have intentionally introduced reductions in the output precision and calculated the error caused by that reduction. Since JPEG allows small variations in the results, we performed over 300 simulations with different reductions to derive the optimal size of the arithmetic units. After the sizes were defined we focused our efforts on the design of high speed CLA’s (Carry Lookahead Adders) with one, two, three and four bit inputs. These optimized adders were then used for designing larger adders and a multiplier. Every adder within the DCT architecture was designed as a cluster of several CLA’s which resulted in the high speed operation. All adders, including the largest adder (14 bit) that is part of the second 1D-DCT module, were extensively simulated and it was shown, using the worst-case timing analysis, that they could perform operations within a single clock cycle.

Multiplication unit has been designed as a reduced Wallace tree multiplier with seven segments. As known from the literature, the most complex operation in the Wallace tree multiplier takes place during the final addition of intermediate results, while other steps are simple and straightforward. Hence, our 7 segment Wallace tree multiplier requires 4 clock cycles to perform multiplication. First six steps in the process of the multiplication are performed us-
ing the CSA’s (Carry Save Adders) during 3 clock cycles, while the final step is performed in the 4th clock cycle. Since the optimization of the multiplier requires significant design time, it was decided to design and optimize one multiplier and reuse the design in both 1D-DCT modules. Fig. 7 shows pipelined organization of the multiplier stages. Due to the limited page width and size of the multiplier schematics it was not possible to show all details in this figure.

In one multiplier design we also addressed the fact that second operands can be chosen from the set of only 4 constants \((m_1, m_2, m_3 \text{ and } m_4)\) listed in the Table 2. Using the simulation we have slightly reduced the number of bits for these constants to reduce multiplier logic. The principle was to try to eliminate 1’s from certain bit position for all four constants. If this was possible then this bit position was excluded from the list of partial sums in the multiplication. Without significantly affecting the precision we have arrived to the new constants that are listed in Table 3. In this table we show final values for the constants, together with the error introduced and the ratio of 1’s after and before this rounding.

After the simulation of a circuitry was performed and arithmetic units optimized in size, we were able to define the final architecture of the DCT module. Table 4 lists final sizes for elements of the first 1D-DCT circuit. Latency of the arithmetic units introduced ‘phase shifting’ of the control signals between register sets. For certain algorithm steps it was possible to compensate this delay as the circuitry is not

\[
\begin{align*}
\phi_1 & = 0.70710678 = (0.10110101000001001111)_2 = (0.B504F)_{16} \\
\phi_2 & = 0.38268343 = (0.0110000111110111000)_2 = (0.61F78)_{16} \\
\phi_3 & = 0.5411961 = (0.10001010100010111101101)_2 = (0.8A8ED)_{16} \\
\phi_4 & = 1.30656296 = (1.01001110011011011101)_2 = (1.4E7AE)_{16}
\end{align*}
\]

Table 2. Constants used for DCT

\[
\begin{align*}
m_1 &= (0.B50)_{16} & \Delta &= (0.0001)_{10}, & 5/10 \\
m_2 &= (0.620)_{16} & \Delta &= (0.0003)_{10}, & 3/11 \\
m_3 &= (0.8A9)_{16} & \Delta &= (0.0001)_{10}, & 5/10 \\
m_4 &= (1.4E8)_{16} & \Delta &= (0.00005)_{10}, & 6/13
\end{align*}
\]

Table 3. New DCT
used in all 8 clock cycles and it was possible to execute concurrently two operations from a single step. Fig. 8 shows all details of the control needed for the high speed operation of the proposed 1D-DCT VLSI architecture. Bold X-boxes represent the moment when values are copied from input to the output registers within the register set. As it can be seen from the figure control logic for the whole DCT module can be done using the single three-bit counter and 3-to-8 decoder. The counter is directly connected to the system clock while decoder outputs are connected to the appropriate control lines of every register set and arithmetic unit. Such efficient control approach significantly improved the DCT performance and allowed faster clock speeds. DCT circuit has a latency of 50 clock cycles calculated as 9 clock cycles for stages RS-a and RS-b and 8 clock cycles for all other stages. The same architecture is replicated for column-wise DCT computation.

After the design and simulation under worst-case timing conditions, we verified whether the architecture was capable of processing one pixel in every clock cycle at the frequency of 100 MHZ resulting in the compression speed of 30 1k by 1k color images per second. By means of the Cadence Opus design tools the architecture was mapped into the VLSI chip using the 2 micron CMOS technology. The chip was designed using the 2-phase non-overlapping clocking scheme and fitted on the 6.8 by 6.9 mm² MOSIS frame. Prototype samples of the
chip were fabricated under MOSIS[38] (Fig. 9).

5. Conclusions

DCT architecture represents the key module in the JPEG compression/decompression chip. In this paper we have presented design, optimization and implementation issues of an efficient, highly parallelized architecture that performs DCT computations. The architecture is organized as a multistage linear pipeline with minimal control logic using only one 3 bit counter and a 3-to-8 decoder. The circuitry within the architecture has been optimized for both speed and size, resulting in only few major computing elements. Intensive simulation using Verilog and other Cadence design tools has shown that the architecture can operate with clock frequency of more than 100 Mhz. As a result, DCT module can transform 30 color images of 1024 × 1024 pixels per second. A prototype chip implementing the proposed architecture has been designed and fabricated under MOSIS.

References


[31] ISO/IEC FTC 1/SC 29/WG 10, personal communications


[50] U.S. picture coding committee X3L3, personal communications


MARIO ŽAGAR received the B.S. and M.S. degrees in Electrical Engineering and Doctorate degree in Computer Science all from the University of Zagreb, Faculty of Electrical Engineering and Computing (FER) in 1975, 1978, and 1985 respectively. In 1977 he joined FER where he is currently Associate Professor. He received British Council fellowship (UMIST — Manchester, 1983) and Fulbright fellowship (UCSB Santa Barbara, 1983/84). His research interests include Computer Architectures, Design Automation, Real-time Microcomputer Systems, Operating Systems and Open Computing. M. Žagar is author and co-author of more than eighty articles and several books: UNIX and how to use it, Introduction to microcomputers, ATLAS — microcomputer architecture simulation. Prof. Žagar is a member of the IEEE computer society committee of the Croatian Open Systems Users Group — HiOpen and EuroOpen Forum executive board member.

N. RANGANATHAN received the B.E. (Honors) degree in Electrical and Electronics Engineering from Regional Engineering College, Trichy, Tamilnadu, University of Madras, India in 1983, and the Ph.D. degree in computer science from the University of Central Florida, Orlando in 1988.

He is currently an Associate Professor in the Department of Computer Science and Engineering and the Center for Microelectronics Research at the University of South Florida, Tampa. His research interests include VLSI design and hardware algorithms, computer architecture and parallel processing. He is currently involved in the design of and implementation of VLSI architectures for computer vision, image processing, pattern recognition, databases, data compression and signal processing applications. He has published widely in reputed journals and conferences and owns four U.S. patents with one pending.

Dr. N. Ranganathan is a Senior Member of IEEE, member of IEEE Computer Society, IEEE CS Technical Committee on VLSI, ACM and VLSI Society of India. He served as the Program Co-Chair for VLSI Design’94 and as the General Co-Chair for VLSI Design’95. He has served on the program committees of international conferences such as ICCD, ICPP, IPPS, SPDP, VLSI Design, CAMP and ICHPC. Currently, he serves as an Associate Editor of IEEE Transactions on VLSI Systems. He is on the editorial boards of Pattern Recognition and VLSI Design journals. He guest edited a special issue of International Journal of Pattern Recognition and Artificial Intelligence on VLSI for PR and AI to be published in April 1995. He is the editor of a two-volume series titled VLSI Algorithms and Architectures published by IEEECS Press in June 1993.

Dr Ranganathan was the recipient of IEEE Region X Outstanding Student Award in 1982, the Rotary Foundation Fellowship in 1984, IEEE Computer Society R. E. Merwin Scholarship Award in 1987, University of South Florida Outstanding Researcher Award in 1996 and the Prof. A. K. Chowdhury Best Paper Award in International Conference on VLSI Design, 1995.