Friday, October 17, 2008

Timing Closure

Doug Amos from Synopsys wrote at article in FPGA Journal about the Timing Closure problem comparing it to whack-a-mole.

I just got through a painful timing closure process. I feel like I came out of a month-long coma. I'll share some anecdotes about the experience to help people in the same situation.

The "Post-Map" results are never good enough for timing. You will want to run Place-and-Route so the tools can take a really long time (sometimes multiple hours) to get through compilation. You will want to automate the tools using scripts and run the EDA toolchain on many computers. See the catalystac.com/trac how to use the "subprocess" module in Python with a dictionary specifying the parameter list for all of the Xilinx toolchain (I found similar code on the net and hacked it to pull the project constraints and synthesis options from Excel). Combine this with SimpleXMLRPCServer and you can manage a botnet running multiple tool instances. It is possible to comment out multiple sections of your code using a simple preprocessor to automate many compilation processes to narrow in on which segments of your code need to be fixed for timing. Smaller code compiles faster too so you can run many permutations over a weekend. (I'll add these scripts to the trac after I have a chance to make them non-specific to this particular project).

Modular partitioning and floorplanning make the compilation and timing closure process a lot easier, but if you want to optimize across hierarchy you can't use them. This is one aspect of behavioral synthesis that needs some serious consideration: how can we avoid running the entire toolchain for minor code modifications? Also since synthesis and optimization can generate weird names for internal signals it is non-obvious what paths are causing the timing errors when they are reported. Usually you can figure out the bad paths by reading the report, but I really wish there was some better way to tie the timing bug back to the code so you know what to modify to fix the bug. There doesn't seem to be a more elegant solution than the brute force method of commenting out sections of the code described in the previous paragraph.

Now let me explain some of the timing bugs I found and how I fixed them. My PDP-11 board has an 8 ns clock period for SRAM access with 4 ns per 18 bit word on each half-clock = 4.32 Gbps read and 4.32 Gbps write (full-duplex). The data word arrives before the rising-edge and is used as a 16 bit data word, the second word is used as 16 bit parity word. Each data nibble has 4 parity bits allowing single correction and double detection. A modular test of the ECC would have us believe that from the arrival of the second half-clock word we could determine if there were errors within the 4 ns between the parity word arriving and the next clock cycle. Unfortunately, we discovered that the parity path was causing a timing bug between one of the Data pins and the SingeBitError / ResultValid signal.

The simple solution is to burn a clock cycle for parity correction, but this would cost us a clock cycle. It would be nice to have an error correcting memory that doesn't consume an entire clock cycle to compute correctness so we can use the data word for 8ns and simply reject the result using the parity check in 4ns. If we don't know that the word is valid for an entire cycle, then we cannot speculatively use the data word unless we are willing to rewind the pipeline an entire clock cycle which is certainly possible, but it's a scary proposition none-the-less (better to burn a cycle in this design, since we prefer reliability and simplicity).

Turns out the reason for the timing bug was one of the data pins was attached to a different I/O bank than the other pins and so the routing delay made the critical path. This was discovered by tracing the path in the floorplanner. The solution here was to turn off the pin since we have 18 data bits for memory and only use 16 for our word. The next board revision will also fix this.

The next major timing bug was the operand fetching pipeline somewhere between the Register and Mode and the Virtual Address register (see here for PDP-11 architecture info). The error here was very small (300 picoseconds) and would go away whenever the fan out and fan in of some of the signals were decreased by commenting out some functions.

We have 8 nanoseconds to decode an opcode from the arrival of the memory word. This is
enough time to read the register from the register file and decode the address mode, but not enough time to guarantee that the virtual address is ready in the address mode case where we must pre-decrement the register to generate the virtual address. I fixed the timing of the operand fetch stage by only decoding the address mode and fetching the register word during the opcode decode stage. The operand fetch stage now uses the decoded address word and register word to generate a virtual address in a second cycle. It would be possible to optimize this process so that in the majority of cases where the register is the virtual address: we can start with that assumption and then invalidate the result when we discover that another cycle is required to decode the address mode.

The most pernicious timing bug involved our floating point multiplier which was partitioned into multiple DSP48 blocks. The multiplier core was generated by coregen to have enough pipeline stages to meet timing. Compiling it in it's own project revealed a maximum throughput of 2.3 ns, but it just barely broke the timing when compiled with everything else: it was off by a factor of the timing uncertainty. We thought that perhaps the tool is retiming the multiplier pipeline to just barely meet timing, and then clock jitter and uncertainty were added later causing it to break the constraint. We did multiple permutations of synthesis options and ran recompilations to no avail. We also added stages to the pipeline to no avail.

To solve this problem, we just created an entirely new project without any timing constraints and set it do the best it could and we met timing with 35 picoseconds to spare. Hallelujah!

If you are in the middle of a painful timing closure, I'm sorry for you, and I hope you can find something useful from this post.