Post-Placement Functional Decomposition for FPGAs

Similar documents
Reduction of Bitstream Transfer Time in FPGA

GOLOMB Compression Technique For FPGA Configuration

VLSI Design 12. Design Styles

Application of Bayesian Networks to Shopping Assistance

ECE 697B (667) Spring 2003

Synchronous Sequential Logic. Topics. Sequential Circuits. Chapter 5 Steve Oldridge Dr. Sidney Fels. Sequential Circuits

A Novel Decode-Aware Compression Technique for Improved Compression and Decompression

Iteration: while, for, do while, Reading Input with Sentinels and User-defined Functions

Design-for-Testability for Path Delay Faults in Large Combinational Circuits Using Test-Points

Layout Design II. Lecture Fall 2003

Replay using Recomposition: Alignment-Based Conformance Checking in the Large

Distributed Control Systems

- 2 - Companion Web Site. Back Cover. Synopsis

EVOLVING HEXAPOD GAITS USING A CYCLIC GENETIC ALGORITHM

THE PROBLEM OF ON-LINE TESTING METHODS IN APPROXIMATE DATA PROCESSING

Safety Manual VEGAVIB series 60

A HYBRID METHOD FOR CALIBRATION OF UNKNOWN PARTIALLY/FULLY CLOSED VALVES IN WATER DISTRIBUTION SYSTEMS ABSTRACT

Lab 4 VGA Display Snake Game

At each type of conflict location, the risk is affected by certain parameters:

Lossless Comparison of Nested Software Decompositions

Safety Manual VEGAVIB series 60

Emergent walking stop using 3-D ZMP modification criteria map for humanoid robot

The system design must obey these constraints. The system is to have the minimum cost (capital plus operating) while meeting the constraints.

HW #5: Digital Logic and Flip Flops

Profile-driven Selective Code Compression

A Study on Algorithm for Compression and Decompression of Embedded Codes using Xilinx

Open Research Online The Open University s repository of research publications and other research outputs

ACCURATE PRESSURE MEASUREMENT FOR STEAM TURBINE PERFORMANCE TESTING

UNIVERSITY OF WATERLOO

Provably Secure Camouflaging Strategy for IC Protection

Adiabatic Switching. A Survey of Reversible Computation Circuits. Benjamin Bobich, 2004

Computing s Energy Problem:

Safety Manual OPTISWITCH series relay (DPDT)

Transposition Table, History Heuristic, and other Search Enhancements

Intelligent Decision Making Framework for Ship Collision Avoidance based on COLREGs

THE CANDU 9 DISTRffiUTED CONTROL SYSTEM DESIGN PROCESS

Safety Manual. Process pressure transmitter IPT-1* 4 20 ma/hart. Process pressure transmitter IPT-1*

7 th International Conference on Wind Turbine Noise Rotterdam 2 nd to 5 th May 2017

AGA Swiss McMahon Pairing Protocol Standards

Basic CPM Calculations

NCSS Statistical Software

Matrix-based software test data decompression for systems-on-a-chip

The Incremental Evolution of Gaits for Hexapod Robots

Questions & Answers About the Operate within Operate within IROLs Standard

Product Decomposition in Supply Chain Planning

A 64 Bit Pipeline Based Decimal Adder Using a New High Speed BCD Adder

AN AUTONOMOUS DRIVER MODEL FOR THE OVERTAKING MANEUVER FOR USE IN MICROSCOPIC TRAFFIC SIMULATION

Line Following with RobotC Page 1

Reliability of Safety-Critical Systems Chapter 3. Failures and Failure Analysis

Modeling of Hydraulic Hose Paths

Application of Dijkstra s Algorithm in the Evacuation System Utilizing Exit Signs

BHATNAGAR. Reducing Delay in V2V-AEB System by Optimizing Messages in the System

The Game of Yinsh (Phase II)

CT433 - Machine Safety

VLSI Design 8. Design of Adders

A Game Theoretic Study of Attack and Defense in Cyber-Physical Systems

From Bombe stops to Enigma keys

LOCOMOTION CONTROL CYCLES ADAPTED FOR DISABILITIES IN HEXAPOD ROBOTS

Software Reliability 1

Front-end Realization of ASIC for Traffic Light Control with Real Time Clock Synchronization

Efficient Placement of Compressed Code for Parallel Decompression

International Journal of Engineering Trends and Technology (IJETT) Volume 18 Number2- Dec 2014

AC : MEASUREMENT OF HYDROGEN IN HELIUM FLOW

A CO 2 Waveform Simulator For Evaluation and Testing of Respiratory Gas Analyzers

SIDRA INTERSECTION 6.1 UPDATE HISTORY

ORF 201 Computer Methods in Problem Solving. Final Project: Dynamic Programming Optimal Sailing Strategies

A IMPROVED VOGEL S APPROXIMATIO METHOD FOR THE TRA SPORTATIO PROBLEM. Serdar Korukoğlu 1 and Serkan Ballı 2.

Designing of Low Power and Efficient 4-Bit Ripple Carry Adder Using GDI Multiplexer

INSTITUTE AND FACULTY OF ACTUARIES. Curriculum 2019 AUDIT TRAIL

Implementation of Modern Traffic Light Control System

Rescue Rover. Robotics Unit Lesson 1. Overview

CT PET-2018 Part - B Phd COMPUTER APPLICATION Sample Question Paper

An Architecture of Embedded Decompressor with Reconfigurability for Test Compression

An Architecture for Combined Test Data Compression and Abort-on-Fail Test

NATIONAL INSTRUMENTS AUTONOMOUS ROBOTICS COMPETITION Task and Rules Document

1 Suburban Friendship League 2 3 Procedures and Processes SFL Team Rosters 4 (Effective March 9, 2018) OVERVIEW 8 9 The SFL Team Roster and

Quality Planning for Software Development

A4 Operation Manual. Fig.1-1 Controller Socket Diagram

CMPUT680 - Winter 2001

Performing Hazard Analysis on Complex, Software- and Human-Intensive Systems

A Novel Approach to Predicting the Results of NBA Matches

VLSI Design I; A. Milenkovic 1

CPE/EE 427, CPE 527 VLSI Design I L21: Sequential Circuits. Review: The Regenerative Property

Tennis Plots: Game, Set, and Match

Optimizing Cyclist Parking in a Closed System

Detection of Proportion of Different Gas Components Present in Manhole Gas Mixture Using Backpropagation Neural Network

A4s Operation Manual

Comparing the calculated coefficients of performance of a class of wind turbines that produce power between 330 kw and 7,500 kw

Ingersoll Rand. X-Series System Automation

Integrating Best of Breed Outage Management Systems with Mobile Data Systems. Abstract

EFFICIENCY OF TRIPLE LEFT-TURN LANES AT SIGNALIZED INTERSECTIONS

Evolving Gaits for the Lynxmotion Hexapod II Robot

1.1 The size of the search space Modeling the problem Change over time Constraints... 21

Breaking Up is Hard to Do: An Investigation of Decomposition for Assume-Guarantee Reasoning

[CROSS COUNTRY SCORING]

Using MATLAB with CANoe

Operational Comparison of Transit Signal Priority Strategies

Queue analysis for the toll station of the Öresund fixed link. Pontus Matstoms *

#19 MONITORING AND PREDICTING PEDESTRIAN BEHAVIOR USING TRAFFIC CAMERAS

IWR PLANNING SUITE II PCOP WEBINAR SERIES. Laura Witherow (IWR) and Monique Savage (MVP) 26 July

Transcription:

Post-Placement Functional Decomposition for FPGAs Valavan Manohararajah, Deshanand P. Singh, Stephen D. Brown and Zvonko G. Vranesic Altera Toronto Technology Center 151 Bloor Street West, Suite 200 Toronto, Ontario, CANADA vmanohar dsingh sbrown zvranesi@altera.com ABSTRACT This work explores the effect of adding a simple functional decomposition step to the traditional field programmable gate array (FPGA) CAD flow. Once placement has completed, alternative decompositions of the logic on the critical path are examined for potential delay improvements. The placed circuit is then modified to use the best decompositions found. Any placement illegalities introduced by the new decompositions are resolved by an incremental placement step. Experiments conducted on Altera s Stratix chips indicate that this functional decomposition technique can provide a performance improvement of 7.6% on average, and up to 26.3% on a set of industrial designs. 1. INTRODUCTION Recent research effort in physical synthesis has strived to eliminate the artificial separation that exists between the various steps in CAD. Most of the existing work is applicable to ASIC CAD flows [2, 4, 5, 6, 7, 8]. However, a few of the more recent efforts explore FPGA CAD flows [3, 9, 10]. Some have concentrated on making the synthesis step more aware of what happens during placement and routing [2, 3], while others have explored the use of synthesis type algorithms during placement and routing [4, 5, 6, 7, 8, 9, 10]. Our work falls into the latter category. It considers the effect of a simple functional decomposition algorithm that is used after placement. Most of the delays in an FPGA circuit are due to the programmable routing network [1]. These delays will not be known for certain until the routing step completes. It would be advantageous to perform local optimizations once routing has completed and accurate routing delays are available. However, making changes to the circuit during the routing step is extremely complicated. Here we choose to perform local optimizations at the placement step, which is sufficiently close to the routing step that reasonably accurate delays are known. Furthermore, small changes to the circuit can still be made without much difficulty. 2. THE STRATIX ARCHITECTURE Altera s Stratix chips were used as the target device for the experiments. At the highest level, the chip is made up of LABs, memory elements and DSP blocks that are connected to a programmable routing network. Each LAB contains a set of LEs which are the basic logic elements in the Stratix architecture. An LE consists of a four-input lookup table (4-LUT) and a programmable register. The construction of arithmetic circuits is facilitated by the presence of carry chain circuitry that links adjacent LEs together. The LEs also contain cascade chain circuitry that can be used to link registers in adjacent LEs. We do not consider memory elements and DSP blocks any further, as our work applies only to LABs. A detailed description of the internal structure of the Stratix chips can be found in [11]. 3. A FRAMEWORK FOR POST-PLACEMENT OPTIMIZATION Figure 1 illustrates the CAD flow used in our work. In the first step, design entry, the design is described in terms of a hardware description language such as VHDL or Verilog. Logic synthesis optimizes the circuit obtained from design entry. During logic synthesis the netlist is represented in terms of a generic gate library. The technology mapping stage converts the netlist so that it uses logic elements available in the target FPGA architecture. In order to reduce the size of the problem the placer has to deal with, a clustering step is used to decompose the technology mapped circuit into a set of clusters. In the Stratix architecture, the clustering step creates a set of LABs. Following clustering, placement determines a position for each cluster in the circuit. Once placement is completed, various local optimization techniques are used to improve the circuit s critical path. Functional decomposition is one of the many local optimization techniques that can be used during this step. The local optimization techniques may make changes that result in an invalid placement. For example, a functional decomposition algorithm may create new wires that violate the constraint on the number of wires entering a LAB. A logic duplication algorithm may create new LEs which would then require placement. Incremental placement is used to integrate the

Design Entry Synthesis Technology Mapping Packing Placement Local Optimizations Routing Incremental Placement Figure 1: Post-placement optimizations in the CAD flow. changes made by the local optimization techniques into the existing placement. It uses a greedy algorithm whose goal is to integrate the changes while making a minimum of change to the existing placement. A detailed description of the incremental placement step can be found in [12]. The final step in the CAD flow, routing, determines the wires that will be used to connect the elements that make up the circuit. 4. FUNCTIONAL DECOMPOSITION 4.1 Overview An overview of the functional decomposition algorithm is presented in Figure 2. Timing analysis is the first major task performed by the algorithm. This analysis is only approximate as actual wire delays are not available until the routing step has completed. The analysis computes wire delay assuming that the best possible routes will be used for each connection in the circuit. Timing analysis establishes both the length of the critical path as well as the slack [16] for each sink. Slack is defined to be the amount of delay that can be added to a sink before it becomes critical. A slack of 0 indicates that the sink is on the critical path. The algorithm performs a number of decomposition iterations until a user specified maximum is reached. Each iteration attempts to find alternative decompositions of the logic around the near-critical sinks returned by CriticalSinks. A near-critical sink is one that has a slack below some threshold. In the experiments, the threshold was set to a multiple of the average delay through an LE. For each near-critical sink, s, BestDecomp is called to determine the best possible decomposition. The function examines both the LE that drives the sink as well as the LE that is driven by the sink. Alternative decompositions exist if both LEs are being used as 4-LUTs. Figure 3 illustrates the situation. The function considers every possible decomposition of the 7-input function f(i 1,i 2,i 3,i 4,i 5,i 6,i 7) formed by the two 4-LUTs, A and B, connected by the near critical sink s. The best decomposition, as determined by a cost function (see Section 4.4), is then returned. 1: initt, bestt TimingAnalysis() 2: initc, bestc circuit 3: for i 1 upto MaxIterations 4: D 5: for s CriticalSinks() 6: D D BestDecomp(s) 7: end for 8: for d Sort(D) 9: DoDecomp(d) 10: end for 11: t TimingAnalysis() 12: if t bestt then 13: bestt t, bestc circuit 14: end if 15: end for 16: circuit bestc 17: IncrementalPlacement() 18: if TimingAnalysis() > initt then 19: circuit initc 20: end if Figure 2: An overview of the functional decomposition algorithm. i 1 i 2 i 3 i 4 A i 5 i 6 i 7 s B f Figure 3: Two cascaded LEs.

i 1 i 2 i 3 i 5 i 6 i 7 000 001 010 011 100 101 110 111 000 1 1 1 1 1 1 1 1 001 0 1 1 0 1 1 1 1 010 0 1 1 0 1 1 1 1 011 1 1 1 1 1 1 1 1 100 0 0 0 0 0 0 0 0 101 1 0 0 1 0 0 0 0 110 1 0 0 1 0 0 0 0 111 0 0 0 0 0 0 0 0 Figure 4: A 6-variable decomposition chart. i 1i 2i 3 000 001 010 011 100 101 110 111 0 1 1 0 1 1 1 1 Figure 5: A function for LE A. Tests for decomposability and the generation of new functions for A and B are handled by standard textbook methods [13, 14]. Figure 4 illustrates a 6-variable decomposition chart. The variables feeding LE A (bound set) are arranged along the columns and the variables feeding LE B (free set) are arranged along the rows. A decomposition exists as long as there are no more than two distinct column patterns in the chart (column multiplicity of two). Functions for LE A and LE B can be derived from the chart as well. A function that selects between the two distinct columns is implemented in A and a function that outputs the required column is implemented in B. Encoding the first column type as 0 and the second column type as 1, a function for LE A is obtained. An example of this function for the chart of Figure 4 is denoted f A and is illustrated in Figure 5. The function for LE B, presented in Figure 6, uses f A to select between the two distinct columns in Figure 4. Once all near-critical sinks have been examined, the set of decompositions, D, is ordered from best to worst. This ordering is necessary because the decompositions present in D are not mutually exclusive some decompositions interfere with others. For example, consider a set D containing two decompositions, one involving LE A and LE B and one involving LE B and LE C. The B-C decomposition can potentially interfere with the A-B decomposition if the A-B f A i 5 i 6 i 7 0 1 000 1 1 001 0 1 010 0 1 011 1 1 100 0 0 101 1 0 110 1 0 111 0 0 Figure 6: A function for LE B. decomposition is no longer valid when the B-C decomposition is applied. The ordering ensures that when interference does occur, a higher ranked decomposition takes precedence over a lower ranked one. The decompositions in set D are applied in sorted order. As the decompositions are applied in DoDecomp, the LUTs involved in the decomposition are marked so that other decompositions which involve the same LUTs will be skipped. Again, this is done to prevent problems where one decomposition interferes with another. Once the best decompositions have been applied to the circuit, a call to timing analysis determines the effect of the changes on the critical path. This timing analysis will involve additional approximations as the changes made to the circuit have not yet been integrated into a legal placement. The analysis assumes that the modified LEs remain in their original positions. This assumption does not introduce a significant error because the changes made by the decomposition algorithm usually involve wire swaps between LEs, which are almost always legal. The best circuit seen during the decomposition iterations, bestc, will be made the default circuit before calling IncrementalPlacement to integrate the changes. A final call to timing analysis (line 18) determines if the final placed circuit is an improvement over the initial circuit. The modified circuit is retained only if there is an improvement. 4.2 Special Cases There are a variety of situations that warrant special case handling during decomposition. Some of these situations arise due to the structure of the Stratix LE. Consider the situation depicted in Figure 3. When LE A is used in arithmetic mode, alternative decompositions do not exist because the LUT inputs are used to compute a hidden carry function that is not connected to the routing network. Any changes to the inputs of A will modify the hidden carry function. A similar problem exists when the Stratix LE s programmable register is used. If the output of A is registered, then usable decompositions do not exist because any change to A s functionality will result in a wrong value being registered. Another situation that requires special handling occurs when the output of A in Figure 3 fans out to more than one sink. Once again the functionality of A cannot be changed without changing the functionality seen at all sinks. However, if we duplicate A and attach every sink except s to the duplicate, the decomposition involving A can still be performed. Although this duplication allows the decomposition to occur, it may create area problems in the LABs containing A and B. If the LABs containing A and B are nearly full, the cost function (see Section 4.4) computes the slack improvements for the decomposition assuming that the version of LE A feeding LE B will be moved to a neighboring LAB. 4.3 Nondisjoint Decomposition The decomposition function, BestDecomp, assumes that the inputs to A and B in Figure 3 are unique. However, there are cases where the best implementation is achieved using nondisjoint decomposition. Figure 7 gives a cost-efficient circuit for the 4-to-1 multiplexer. Inputs s 0 and s 1 select

i 0 i 1 s 0 s 1 A i 2 i 3 s B f Figure 7: A 4-to-1 multiplexer using two LEs. one of the four inputs i 0, i 1, i 2 and i 3 to appear at the output. LUT A implements the function, f A = s 1(s 0i 0 + s 0i 1)+s 1s 0 and LUT B implements the function, f B = s 1 f A + s 1 (f A i 2 + f A i 3 ). An alternative decomposition does not exist if the inputs to A and B are assumed to be unique. However, the knowledge that input s 1 is shared allows a non-disjoint decomposition such as f A = s 1 (s 0 i 2 + s 0 i 3 )+s 1 s 0 f B = s 1f A + s 1(f A i 0 + f Ai 1). The decomposition function, BestDecomp, discovers these nondisjoint decompositions by using a preprocessing step which detects input sharing and marks off the 7-input cubes that are don t cares when input sharing is considered. Returning to our 4-to-1 multiplexer example, the decomposition function will consider alternative decompositions of the 7-input function f(i 0,i 1,s 0,s 1,s 1,i 2,i 3). Signals s 1 and s 1 are different versions of the same signal one connects to LE A and the other connects to LE B. The preprocessing step will will mark those cubes where s 1 and s 1 take on differing values. The presence of don t cares in the 7-input function allows the discovery of additional decompositions. Often it is possible to assign values to the don t cares in such a way that the column multiplicity requirement in the decomposition chart is met. 4.4 Cost Function Alternative decompositions found by BestDecomp will move some of the inputs from LE A to LE B and vice versa. Each of these moves will change the slack of the input being moved. An accurate LE delay model and an approximate routing network delay model (which assumes best case routing) are used to estimate the change in slack for each input that was moved. The cost function is based on the changes to the slack of the inputs being moved and is modeled after the VPR cost function [15]. The slack of an input i before and after decomposition is denoted t i and t i, respectively. The criticality of an input depends on the maximum slack in the circuit, t max, and the initial slack as follows r i =1 t i t max A criticality value of 1 is associated with inputs on the critical path. Using the criticality and the change in slack, the gain of an input i as a result of decomposition is given by g i = ri e (t i t i ) where e is a parameter that determines the relative importance of criticality to the cost calculations. In addition to considering the effect of the decomposition on individual inputs, the overall effect of the decomposition on the two LEs A and B is also considered. The minimum slack at the inputs of A and B before and after decomposition is denoted t A,B and t A,B, respectively. The criticality and gain are defined similarly: r A,B = 1 t A,B t max g A,B = ra,b(t e A,B t A,B ) The cost of a decomposition is defined as a weighted sum of the overall effect on the LEs and the individual effect on each input c = λg A,B +(1 λ) i where λ is a parameter that determines the relative importance of the two components. Decompositions with higher values of c are preferred and any decomposition with a negative value of c is rejected. The cost function is not particularly sensitive to the values selected for e and λ. Values for e between 2 and 16 and values for λ between 0.25 and 0.95 were found to work well. 5. EXPERIMENTAL RESULTS In our experiments, the first three steps of the FPGA CAD flow, design entry, synthesis and technology mapping, were performed by a leading third-party synthesis tool capable of targeting Altera devices and the last four steps, including functional decomposition and incremental placement, were performed by a modified version of Quartus II v2.2 [17] 1. Table 1 presents the performance of the functional decomposition technique on 22 industrial circuits containing between 5000 and 10000 LEs. These results were obtained with the maximum number of decomposition iterations set to 8. This value was selected as each decomposition iteration increases compile time and there were very few circuits that benefited from a larger number of iterations. The functional decomposition and incremental placement steps increase the compile time of Quartus II by approximately 30%. The third column (Speedup) indicates the improvement in the final (postrouting) critical path as a result of applying the functional decomposition technique. An average performance improvement of 7.6% was obtained. Although the technique rejects bad decompositions there are circuits whose performance is 1 The functional decomposition technique described in this paper is part of Quartus II s physical synthesis feature. In addition to the technique described here, the physical synthesis feature includes several other techniques that target both combinational and sequential logic. Refer to [18] for a description of how the physical synthesis feature can be used to speed up a design. g i

Circuit Size Speedup (LEs) (%) cct1 5162 4.1 cct2 5308 14.0 cct3 5342 5.0 cct4 5677 5.3 cct5 5775 8.5 cct6 5872 6.5 cct7 5947 4.2 cct8 6036 7.9 cct9 6201 4.9 cct10 6404 26.3 cct11 6482 13.3 cct12 7267-1.4 cct13 7269 9.1 cct14 7277 0.0 cct15 7494 9.5 cct16 7685 4.2 cct17 8107 17.7 cct18 8278 8.0 cct19 8346-2.6 cct20 8768-2.0 cct21 8886 12.4 cct22 9768 11.6 Average 7.6 Table 1: Performance of the post-placement functional decomposition. degraded as a result of applying the technique (cct12, cct19 and cct20). These are a result of the approximation errors in the timing analysis performed by the technique. Decompositions that involve duplication will increase the size of the circuit. On average, these duplications increase a circuit s size by 0.3%. 6. CONCLUSION We described a method for post-placement optimization based on functional decomposition. The technique is iterative in nature and attempts to find alternative decompositions of the logic around near-critical sinks. This technique, in concert with incremental placement, produces an average speedup of 7.6% while incurring an area penalty of 0.3%. 7. REFERENCES [1] M. Sheng and J. Rose. Mixing Buffers and Pass Transistors in FPGA Routing Architectures. In Proceedings of the ACM Int. Symposium on FPGAs, Monterey, CA, Feb. 2001, pp. 75 84. [2] M. Pedram and N. Bhat. Layout Driven Logic Restructuring/Decomposition. In Proceedings of the Int. Conf. on Computer-Aided Design, San Jose, CA, Nov. 1991, pp. 134 137. [3] J. Y. Lin, A. Jagannathan and J. Cong. Placement-Driven Technology Mapping for LUT-Based FPGAs. In Proceedings of the ACM Int. Syposium on FPGAs, Monterey, CA, Feb. 2003, pp. 121 126. [4] Y. Jiang, A. Krstic, K. Cheng and M. Marek-Sadowska. Post-Layout Logic Restructuring for Performance Optimization. In Proceedings of the Design Automation Conference, Anaheim, CA, June, 1997, pp. 662 665. [5] Y. Lian and Y. Lin. Layout-based Logic Decomposition for Timing Optimization. In Proceedings of the Asia Pacific Design Automation Conference, Hong Kong, Hong Kong, Jan. 1999. [6] G. Stenz, B. Riess, B. Rohfleisch and F. Johannes. Timing Driven Placement in Interaction with Netlist Transformations. In International Symposium on Physical Design, Napa Valley, CA, 1997, pp. 36 41. [7] T. Tien, H. Su and Y. Tsay. Integrating Logic Retiming and Register Placement. In Proceedings of the Int. Conf. on Computer-Aided Design, San Jose, CA, 1998, pp. 136 139. [8] L. Kannan, P. Suaris and H. Fang. A Methodology and Algorithms for Post-Placement Delay Optimization. In Proceedings of the Design Automation Conference, San Diego, CA, June 1994, pp. 327 332. [9] D. Singh and S. Brown. Integrated Retiming and Placement for Field Programmable Gate Arrays. In Proceedings of the ACM Int. Syposium on FPGAs, Monterey, CA, Feb. 2002, pp. 67 76. [10] K. Schabas and S. D. Brown. Using Logic Duplication to Improve Performance in FPGAs. In Proceedings of the ACM Int. Syposium on FPGAs, Monterey, CA, Feb. 2003, pp. 136 142. [11] Altera. Stratix Device Handbook. Vol. 1, v2.0, July 2003. [12] D. P. Singh and S. D. Brown. Incremental Placement for Layout-Driven Optimizations on FPGAs. In Proceedings of the Int. Conf. on Computer-Aided Design, San Jose, CA, 2002, pp. 752 759. [13] R. Ashenhurst. The Decomposition of Switching Functions. In Int. Symposium on Theory of Switching Functions, 1959, pp. 74 116. [14] H. Curtis. A generalized tree circuit. Journal of the ACM, 1961, 8:484 496. [15] A. Marquardt, V. Betz and J. Rose. Timing-Driven Placement for FPGAs. In Proceedings of the ACM Int. Syposium on FPGAs, Monterey, CA, Feb. 2000, pp. 203 213. [16] R. Hitchcock, G. Smith and D. Cheng. Timing Analysis of Computer-Hardware. In IBM Journal of Research and Development, Jan. 1983, pp. 100 105. [17] Altera. Quartus II Version 2.2 Release Notes. v1.0, Dec. 2002. [18] Altera. Quartus II Development Software Handbook v4.0 (Three-Volume Set). v1.0, Feb. 2004.