Tuesday, October 03, 2006

profiling and load balancing on heterogeneous architectures

I am working on a paper. Here is a draft.

Most profiling tools only consider execution time on a single CPU or a homogeneous array. What metrics are useful for profiling an application on a heterogeneous and reconfigurable platform?

I offer the unit GOPS/$ as a metric for computational efficiency. GOPS stands for “billions of operations per second” and is dependent on the application. TeraFlops, for contrast, is a measure of explicitly floating-point operations. The cost function is dependent both on the type of operation being performed and the computational fabric.

To quantify GOPS I will use the concept of information entropy. If a boolean random variable is evenly distributed, p(0) = p(1) = .5, then we gain 1 bit of information from resolving its value. Consider a two input NOR gate with evenly distributed inputs. The output of the NOR gate has p(0) = .75 and p(1) = .25. Resolving the output of the NOR gate provides us with .811 bits of information.

Consider now if both of the inputs to the NOR gate are independent and have p(1) = .99 and p(0) = .01. The output of our NOR gate now has p(1) = .0001 and p(0) = .9999. Resolving the NOR gate only provides us with .0015 bits of information, substantially less than before. Yet the circuitry providing us with this information has the same cost as in the previous case.

"Information entropy" provides a "physical" basis for measuring an operation. If GOPS is billions of operations per second, then it's "physical" unit is (information entropy / time). GOPS/$ = "information entropy / (time * cost)" If the information entropy of an output pin is small, then it may not be worth the cost of implementing the hardware.

For example, consider an adder whose inputs have high probability of being small and low probability of being large. The information entropy of the output bits is very low for the high order bits. Depending on the costs and probabilities, it may be worthwhile to use an 8 bit adder instead of a 32 bit adder. If there is some finite probability of inputs being larger than 8 bits then we will need some detection circuit to handle this case. This adds a fixed cost to the circuitry. We can quantify the cost as follows:

$(8 bit adder) = [p(<> 8 bit input) * $(> 8 bit addition)] + [$(detection circuitry) + $(adder circuitry)]

If we compare this cost function across a variety of bit widths we can deduce an optimal bit width for our adder. The cost functions don't look exactly like this for all bit widths: if we had used a 4 bit adder for example, the cost for performing 4 bit, 8 bit, 12 bit, and 16 bit addition would all be different and would have to be taken into account.

We also want to consider profiling across multiple co-processors. Suppose we wish to perform N FFT operations of size s and we have the option of using either our CPU, a GPGPU or an FPGA. Let's suppose we only wish to perform 1 FFT of size 128. In this case it may not be worth the overhead of offloading the process to the GPGPU or the FPGA since we only need to perform 1 operation. As a result, GOPS/$ is maximized by using the CPU to compute the FFT.

Consider now that we have 128 FFT operations of size 128. In this case, the throughput benefits associated with offloading the process ammortizes the cost of doing so. We may offload the task to either the FPGA or the GPGPU. If the FPGA already has FFT circuitry configured and assuming it performs FFTs substantially better than the GPGPU, then the task should be performed in the FPGA. However, if the FPGA is not configured with an FFT, then for a problem of this size the overhead associated with configuring the FPGA may preclude using it for this operation. Thus we will use the GPGPU if the FPGA does not already contain an FFT. Now suppose that we want to perform 2048 FFTs of size 2048. The cost of configuring the FPGA for this task is ammortized by the size of the job and thus it will always be beneficial to perform the FFT on the FPGA.

The result of this discourse is that choosing an accelerator methodology in a heterogeneous reconfigurable fabric may be a runtime consideration depending on the size of the operation to be performed and the configuration of the system. A load balancing subsystem will need to simplify the task of profiling an application by determining some high dependency variables. To keep the overhead associated with a run-time load balancer extremely low, we will want to generate a "condition set" at profile-time and link each condition with a particular configuration and methodology.

To manage such a load balancer, I propose using a financial model in which processes get "paid" for performing their tasks and use this money to pay for the resources they use. A well designed economic system will have its basis in meaningful cost metrics. Some primary factors for the cost function is the power, time, thermal dissipation and area required to perform the computation. Remember that GOPS/$ has units of (bits / (time*cost)). We put time into the denominator and into the cost function since all things being equal we would prefer a faster solution for the same cost. If speed costs substantial amounts of energy we will need to take that into consideration. The cost associated with time is split between to factors: the ammortized cost over the device life of the hardware and the urgency of the computation.

The urgency factor of the time cost of an operation is highly dependent on it's location in the pipeline. For example, if task A and task B are both prerequisites for task C, then we will want to accomplish A and B as fast as possible. Suppose that A takes 4 times longer than B if we solely optimize for time. We now have flexibility to minimize the cost of B. For instance we may lower the voltage of the circuitry processing B which will slow the circuit, but may save us substantially in terms of power. Suppose B is a 32 bit addition, we may decide to transition it to an 8 bit adder to save on space though require four times as long to produce a 32 bit result. Depending on the cost functions we may choose to go with a middle-ground: a 16 bit adder with a slightly lower voltage that still completes the task in time. This decision may be made to avoid the opportunity cost associated with not using the circuitry at it's full compute capacity,

Alternatively, we may find that task B is common in our process schedule that we wish to share the resources to perform B with different processes. We may choose among various methods to share B. If task C is highly critical, we will want to use a dedicated-priority sharing manager that will only share B if there is no request for C. Similarly a non-dedicated-priority sharing manager will assign priorities to each of the possible tasks who may want to use its resources. Presumably a task could pay more to have higher priority. A non-priority sharing manager offers its resources at the same price to everyone, with no guarantee that a given task will receive priority, though there will be some guarantee on latency.

An adaptive profiling and load balancing mechanism will also need to be optimized to determine how to minimize the overhead costs associated with profiling and optimization. In order to do this, we will want to keep a strategy database to provide the load balancer with information about how to manage a process topology (a set of processes to be executed simultaneously). We can ascribe a set of modes for a task dispatched by the load-balancer. In the simple "do nothing unless told to" mode, the load-balancer only dispatches based on specific directives from the application. In "simple management mode" the load balancer will use it's strategy database to manage only those process topologies it has encountered before. In "agressive management mode" the load balancer will make assumptions about the process topologies (such as bit width assumptions or timing assumptions) to relate the topology to previously encountered topologies. Presumably there is some gradient of optoins between simple and agressive management modes. We will prefer the simple management mode or the "do nothing unless told to" mode for "semi-constant" (mostly the same process topology through all time) or diagnostic applications which we want to have lower level control over the hardware.

The aggressive mode will be preferable when we have flexibility to tinker with the application while it is running to determine more optimal partitioning configurations. If we take the aggressive mode to it's logical extreme we have "profile-mode," in which execution is extremely slow, yet the load balancer will be able to produce an analysis of the execution of the task on across a variety of platforms and topologies. We would probably want to enter "profile mode" during idle time and we will want to consider process topologies that we have encountered in the past to build up the strategy database.


Anonymous said...

Random comments FWIW: (1) There are some metrics in common between run-time load balancing and design-time partitioning. There are also some important differences in precision of result and run-time overhead. There may be a more immediate need for good exposition on partitioning, IMO. (2) The entropy approach smacks of similar (yet very different) work in value predition by Lipasti that has an intriguing premise but yielded more intrigue and papers than realization in time. But papers are good, right? (3) The notion of "speculate and detect" is currently getting attention in research regarding operating circuits near there failure region wrt supply voltage for the sake of low-power in the presence of process variation. Similar yet very different. (4) FYI, see PeakStream startup for a GPGPU programming approach out the Stanford Brook project

Kamana said...

Have you already published this paper? I am interested to read the paper.