Monday, June 05, 2006

genetic algorithm for genetic research on FPGA

In anticipation of my impending internship at Xilinx, I have been searching for an appropriate algorithm to accelerate on an FPGA. A good friend of mine in the Broad Institute at MIT sent me an email last week describing the problem of understanding the binding strategies of the transcription factors of human DNA. On many occasions in the past, we have distracted ourselves from homework by talking about FPGAs (my interest) and bioinformatics (his interest). It now seems some collaboration is in order.

Here is the best I can describe the problem: my understanding of the biology is extremely limited, but the underlying algorithmic problem should be clear (my friend is clearly responsible for any biological aspects of this work). Suppose a set of proteins regulate DNA transcription for genes A, B and C but not genes D, E and F, then there must be some a common pattern in the DNA for genes A, B and C that is not in genes D, E, and F. Unfortunately this DNA pattern matching is not necessarily straightforward: it is not simply the case that genes A, B, and C have the sequence "GGACT" in their regulatory region, which is not in the regulatory region of genes D, E and F. The codes may have a more complex "logic" to them. For example, it may be the case that the pattern is of the form "Does not contain (GGATTC or ACCTAG) within 100 base pairs of a code with a Hamming distance of 2 from ACGGTCCGT." If A, B and C match this pattern and D, E and F do not, then this would explain why the transcription factors in question regulate A, B and C and not D, E and F.

After discussing the problem for a while, we came to the perhaps not coincidental conclusion that a genetic algorithm would be best suited for this problem and that exploiting the parallelism and granularity benefits of an FPGA implementation would offer a lot of acceleration potential. For such a genetic algorithm we test populations of "binding strategies" in an environment consisting of a set of genes joined with a 1 or a 0 depending on if the transcription factors in question regulate that gene or not. The fitness of a binding strategy is measured by the number of times the strategy correctly predicts regulation (1) or no regulation (0). As the algorithm progresses, we select the better binding strategies, breed them and mutate them. We may breed two seperate strategies by combining them with a logical combinator or the "within k base pairs of" combinator. We may mutate the strategies by negating them, by changing, adding or removing characters, by adding or removing logical combinators, or by adding "hamming distance of m" conditions (negating the worst performing binding strategy in a population actually makes sense since it is going to produce a more fit member in the next generation). Generally, we would prefer it if our binding strategies were "simple" to avoid overfitting, and there will be a preference for simplicity in the selection strategy.

This problem lends itself to an FPGA implementation since we may implement an entire population of binding strategies and evaluate their fitness in parallel. We may also perform the population breeding process in parallel and reconfigure the system to evalute the next generation. An well designed approach will be able to share logic among binding strategies in a similar fashion as how the Rete algorithm avoids repeating sub-pattern matching. FPGAs have been shown to produce 1000x speedups over CPU implementations for similar genetic algorithms and also shown equivalently absurd speed-ups for pattern matching algorithms. Certainly we will have to test an FPGA impementation against the cluster supercomputer at the Broad Institute.

One neat idea for accelerting genetic algorithms on an FPGA is to use partial reconfiguration to allow dynamic mutation and breeding during execution. If a member of the population is consistently failing after examining only 10% of the environment data, it may as well be killed off and replaced with a child of more successful members of the population. For this specific problem, after a binding strategies fails to predict the transcription factors behavior on 123 genes, we can recycle the hardware it occupies and replace it with the child of a more successful binding strategy. Alternatively, if a binding strategy succeeds 10% more than any other, it may kill off a "weaker" binding strategy and recycle its hardware for one of its children. This may prove a much better genetic algorithm strategy than using discrete generations.