Tuesday, October 24, 2006
The Economics of Sharing
The internet has made essentially free the "cost-of-distribution" of accessing extra computing power for things as SETI@Home and file sharing networks. In these cases the resource being distributed is essentially free. In contrast, an open source developer must dedicate substantial time to his work. The article poses the question: why would a developer freely distribute open source works? Developing trust and reciprocity, as well as gaining respect of the programming community are offered as reasons. This is true, but I think a major reason why programmers release open source code is because they enjoy the products of their coding and they wish to share the fruits of their labor with others. Open source software is like the grafitti murals of the software industry.
The essay on sharing is by Yochai Benkler of the Yale Law School. Benkler argues that sharing is a new mode of economic production using distributed computing and carpooling as case studies. Carpooling and distributed computed are similar in that they are very wide-scale sharing practice of a privately-owned good, whose excess capacity has established market models. Benkler writes: "In these characteristics carpooling and distributed computing are like peer-to-peer networks, or ad hoc wireless mesh networks, or like the labor individual programmers put into free software."
Benkler uses the notion of "bumpiness" to define the quantization of capactiy. A car is often designed to accomodate multiple passangers and thus generally has excess capactiy. Similarly computers have been designed to operate extremely quickly for media intensive applications, but are often used simply for word processing. In both domains, cars or computers, there is an excess capacity due to the quanitization of the supplied good.
There are established market models to capitalize on quantization inefficiencies. Bus systems meet transportation demand and data centers meet computational demands. In both of these models we consolidate multiple subsystems in order to achieve high utilization. The added efficiency can often present an economic advantage to the bus rider or data-center customer alike. Another relevant example of consolidation and sharing reducing quantization induced capacity innefficiencies are fabless semiconductor companies that eliminate extremely large capital expenditures by outsourcing fabrication.
A great way to identify opportunities for business innovation is by identifying such quantization inefficiencies. In order to identify if there is a market for resource sharing, we must consider the model of both the consumer and the requisite material providers. How must our consumers adapt to apply our innovation? How much dependency is there on prerequisite component providers? In the case of fabless semiconductor industry these may be more specific: how must the engineer specify the IC design? Can fabrication facilities be made "exchangeable?" If there is a variance between fabrication facilities, can we measure tradeoffs affecting the design. For example suppose we may decide between using a 90nm and a 65nm fabrication facility to implement an HDL specification. If the resulting 65 nm chip would have higher unit costs at the anticipated volume level, yet run faster and consume less dynamic power such that, it may be sold at a higher price. Depending on cost analyses, we may need to make a decision between a variety of different options.
The very same decision must be made by airline consumers who must choose between faster or more convenient, but also more expensive flights. If time is a strong constraint, as it may be for a person flying on a tight schedule, then the person may be willing or required to pay a premium price for a particularly convenient flight. An airline must optimize its flight scheduling to make sure each different type of airplane is maximally shared in order to maximize profit. The pricing and scheduling model for an airline company reflects demand for different flights. Jet Blue has been able to dramatically reduce operating costs by managing a homogenous fleet. By reducing operating costs and serving high demand markets they can reduce prices to maximize quantity sold and thus maximize profit. Yet clearly there is a market space for an alternative to Jet Blue, and we will not want to use a passanger ailrliner to ship cargo. Similarly some degree of heterogeneity will maximize the efficiency computing array. Could the management costs of a heterogenous computing array present an overhead such that we would prefer to view computational resources as homogeneous clusters?
In order to have an effective process scheduling system for a heterogenous reconfigurable computer, processes must share resources with varying costs and constraints. In a heterogeneous array, there may be x86s, GPUs, FPGAs, which overlap in functionality, but are optimal for specific instances, just as there are planes, buses, cars, bikes, and feet which can all get you there. By incentivizing "car pooling," we may find ways to maximize efficiency. It's not even a distant a metaphor to suggest that shared resources should get to travel in the fast lane.
An algorithm to optimize computational resource sharing should inherit semantics and structure from a general economic model. Such an economic model would provide a method for multiple locally optimizing (greedy) process scheduling agents to globally optimize the system.
Examining strategies derived from athropological and economic bases provides a good perspective for exploring complex multi-agent management models. Benkler's essay presents a case study on "dynamic ad-hoc carpooling." An example relevant to security, is that a women is much less likely to carpool with two men both of whom she does not know (page 286 of the law journal). Thus "security" and "trust" are relevant motivational factors for making a sharing decision. That same women would probably not have any objection to sharing an airplane regardless of the gender ratio. This is based on a trust relationship with the service provider: each agent sharing the bus has been authenticated prior to enterring the shared arrangement. In a shared, distributed computer system, methods of autheticating agents and forming trust relationships between agents must be established to guarantee some level of security.
Security issues related to technological innovations are usually not unbridgeable gaps in the system design, but rather psychological gaps. Such situations similarly require that the client and vendor have a means of forming a trust relationship or autheticating eachother through other relationships. There may be a psychological barrier to convince someone to outsource their data-center due to security concerns, yet Google mass appeal demonstrates that trust relationships can develop between a user and a comptuing service provider.
Distributed computing networks present two separate security issues on the requisitie provider and the computation consumer. Suppose I were to sell a service using spare cycles for distributed analytics. I buy the spare computing cycles from companies A, B, C, D and E, and use them to perform distributed analytics for company A. I would have to be able to convince A that its analysis will be secure, and not revealed to B through E. Similarly I would have to convince B through E that their internal network is secure when they are serviceing requests from A. This problem is a psychological problem rather than an inherent limitation of the computing system, since all of the computers invovled can be securely consolidated to provide A through E with their computation requirements already. It is likely that this will be the most major limiting factor to utility computing.
Finally, on the economics of sharing and open source programming. During my experience at Xilinx, I got a better understanding of the motivation for closed source tools. It has everything to do with dependency and part lock-in, less to do with support and only psychologically related to security (one of the great one-liner that I took out of DEFCON this year was "The system is secure because it is only accessible by our proprietary readers"). Today I read a thread from 2000 on comp.arch.fpga about FPGA openness. The FPGA hardware market is hindering it's own growth by each manufacturer structuring it's toolset around closed specifications. There is an enormous waste of potential with the goal of disincentivizing interchangeability. This is contrary to the natural tendancy towards commoditization of computing components. Xilinx has been enjoy high margins and even with respectably high volume, though I fully believe they can trigger an increase volume by a factor of 5 without causing margins to drop by more than 1/5. This argument is clearly hand-waving, but the point is that they could be able to take on new customers without losing anything substantial. The first FPGA company (there aren't many to choose from) to open critical parts of their system and toolset will see an appreciable increase in their computing market share because they will be more easily adopted as accelerator co-processors. This may in fact be a case where 1024 chickens (open source programmers) would plow the field faster if only the two strong oxens (Xilinx and Altera) would get their proprietary formats out of the way.
With or without Xilinx, the market for accelerator co-processors will continue to grow as consolidated heterogeneous data-centers become more and more commonplace. I'm willing to bet the next seven years of my life to it.
Thursday, October 19, 2006
Globally Asynchronous Locally Synchronous
Anyway so more on GALS. GALS is nice because it eases clock distribution which consumes a large portion of the power on the chip. With multiple local clocks you can control dynamic voltage and frequency scaling in a localized manner to power optimize a process load balancer. Ring oscillators provide a clock for a small region of the chip which asynchronously communicates with other regions.
Asynchronous communication schemes are pretty well established, especially since the Internet and asynchronous processing methods are emerging now with AJAX applications dispatching client side programs. On chip asynchronous processing should be able to inherit it's programming metaphor from how people run distributed applications on the Internet.
As for running distributed applications with an asynchronous protocol, Joe and I have been cooking up a system which provides AJAX connection to a server side pipe. We've connected a scheme interpretter to the server pipe (we also put Python on there for a bit). We also took jsScheme which is a client-side read-eval-print-loop (REPL) implemented in Javascript and put it up on the same page as the server side REPL. We haven't connected the server to the client yet and a number of security issues and protocol issues that need to be determined before we provide a demo of a massively distributed application over AJAX. We do have the ability to launch multiple client side or server side REPL environments, which is interesting for distributed multiprocessing. I am currently implementing a job queue and a communication protocol.
when the client webpage loads it says:
(tell server '(repl-requested))
the server that receives the '(repl-requested) has responds with a client-side (Javascript) REPL environment and a command to contact an available authentication agent:
(tell client `(authenticate ,(pop-authentication-agent!)))
the authentication agent profiles the client REPL (how long does it take to respond? who is the user? do we trust this anonymous user?). After authenticating the client REPL and determining privileges, it tells the client REPL to report to a task manager.
The task manager provides the client with a set of definitions and instructs it to compute some process. At the end of the process the client will return the result and request a new process. If the client side comes upon an undefined symbol it can make a server request to define the symbol. I think we'll get to this point in a few weeks.
What will be really interesting will be to see what kinds of applications can attract a lot of user time, and benefit from distributed computation. Online games with distributable AI seems like a good fusion of these characteristics and might provide an alleviation for the scalability issues associated with high traffic.
This web programming seems like a departure from reconfigurable computing but the same GALS programming provides a strong metaphor for FPGA control and a lot of the process management must be "reconfigurable" in order to tolerate faulty or defective clients.
Here's a well-written article on AJAX.
Joe in the news
Wednesday, October 11, 2006
why reconfigurable computing hasn't caught on yet
The problem is that reconfigurable computing isn't a hardware problem. It's a software problem and it's a really big and hairy software problem too. There's no clear starting point either: part of the software problem is even just identifying which problems need solving. Most articles about "The Next Big Thing in Computing" have a skeptic keenly observe that tool support for reconfigurable computing is lacking. Lacking tool and system standards we are unlikely to truly produce any "next big thing."
So several companies have come along to deliver compiler tools to make development easier in some way or another. These companies fair better since they can gain traction without enormous expenditure. But reconfigurable computing still hasn't made its promised impact. This is because tools are only a small slice of the software problem. What we have is a chicken and egg problem: do the tools create the demand for applications or does demand for applications create the tools? The real software problem is to produce an important application that relies on reconfigurable computing technology. Driving demand for reconfigurable computing applications will have the effect of growing the toolset for developing such applications.
This feedback is nice, but it cannot be forced by pumping up compiler tools alone. Without substantial attention put into runtime environments and user interfaces it is unlikely that reconfigurable computing applications will take off. If you want to know how I think I can overcome these barriers, please email me.
Tuesday, October 03, 2006
profiling and load balancing on heterogeneous architectures
Most profiling tools only consider execution time on a single CPU or a homogeneous array. What metrics are useful for profiling an application on a heterogeneous and reconfigurable platform?
I offer the unit GOPS/$ as a metric for computational efficiency. GOPS stands for “billions of operations per second” and is dependent on the application. TeraFlops, for contrast, is a measure of explicitly floating-point operations. The cost function is dependent both on the type of operation being performed and the computational fabric.
To quantify GOPS I will use the concept of information entropy. If a boolean random variable is evenly distributed, p(0) = p(1) = .5, then we gain 1 bit of information from resolving its value. Consider a two input NOR gate with evenly distributed inputs. The output of the NOR gate has p(0) = .75 and p(1) = .25. Resolving the output of the NOR gate provides us with .811 bits of information.
Consider now if both of the inputs to the NOR gate are independent and have p(1) = .99 and p(0) = .01. The output of our NOR gate now has p(1) = .0001 and p(0) = .9999. Resolving the NOR gate only provides us with .0015 bits of information, substantially less than before. Yet the circuitry providing us with this information has the same cost as in the previous case.
"Information entropy" provides a "physical" basis for measuring an operation. If GOPS is billions of operations per second, then it's "physical" unit is (information entropy / time). GOPS/$ = "information entropy / (time * cost)" If the information entropy of an output pin is small, then it may not be worth the cost of implementing the hardware.
For example, consider an adder whose inputs have high probability of being small and low probability of being large. The information entropy of the output bits is very low for the high order bits. Depending on the costs and probabilities, it may be worthwhile to use an 8 bit adder instead of a 32 bit adder. If there is some finite probability of inputs being larger than 8 bits then we will need some detection circuit to handle this case. This adds a fixed cost to the circuitry. We can quantify the cost as follows:
$(8 bit adder) = [p(<> 8 bit input) * $(> 8 bit addition)] + [$(detection circuitry) + $(adder circuitry)]
If we compare this cost function across a variety of bit widths we can deduce an optimal bit width for our adder. The cost functions don't look exactly like this for all bit widths: if we had used a 4 bit adder for example, the cost for performing 4 bit, 8 bit, 12 bit, and 16 bit addition would all be different and would have to be taken into account.
We also want to consider profiling across multiple co-processors. Suppose we wish to perform N FFT operations of size s and we have the option of using either our CPU, a GPGPU or an FPGA. Let's suppose we only wish to perform 1 FFT of size 128. In this case it may not be worth the overhead of offloading the process to the GPGPU or the FPGA since we only need to perform 1 operation. As a result, GOPS/$ is maximized by using the CPU to compute the FFT.
Consider now that we have 128 FFT operations of size 128. In this case, the throughput benefits associated with offloading the process ammortizes the cost of doing so. We may offload the task to either the FPGA or the GPGPU. If the FPGA already has FFT circuitry configured and assuming it performs FFTs substantially better than the GPGPU, then the task should be performed in the FPGA. However, if the FPGA is not configured with an FFT, then for a problem of this size the overhead associated with configuring the FPGA may preclude using it for this operation. Thus we will use the GPGPU if the FPGA does not already contain an FFT. Now suppose that we want to perform 2048 FFTs of size 2048. The cost of configuring the FPGA for this task is ammortized by the size of the job and thus it will always be beneficial to perform the FFT on the FPGA.
The result of this discourse is that choosing an accelerator methodology in a heterogeneous reconfigurable fabric may be a runtime consideration depending on the size of the operation to be performed and the configuration of the system. A load balancing subsystem will need to simplify the task of profiling an application by determining some high dependency variables. To keep the overhead associated with a run-time load balancer extremely low, we will want to generate a "condition set" at profile-time and link each condition with a particular configuration and methodology.
To manage such a load balancer, I propose using a financial model in which processes get "paid" for performing their tasks and use this money to pay for the resources they use. A well designed economic system will have its basis in meaningful cost metrics. Some primary factors for the cost function is the power, time, thermal dissipation and area required to perform the computation. Remember that GOPS/$ has units of (bits / (time*cost)). We put time into the denominator and into the cost function since all things being equal we would prefer a faster solution for the same cost. If speed costs substantial amounts of energy we will need to take that into consideration. The cost associated with time is split between to factors: the ammortized cost over the device life of the hardware and the urgency of the computation.
The urgency factor of the time cost of an operation is highly dependent on it's location in the pipeline. For example, if task A and task B are both prerequisites for task C, then we will want to accomplish A and B as fast as possible. Suppose that A takes 4 times longer than B if we solely optimize for time. We now have flexibility to minimize the cost of B. For instance we may lower the voltage of the circuitry processing B which will slow the circuit, but may save us substantially in terms of power. Suppose B is a 32 bit addition, we may decide to transition it to an 8 bit adder to save on space though require four times as long to produce a 32 bit result. Depending on the cost functions we may choose to go with a middle-ground: a 16 bit adder with a slightly lower voltage that still completes the task in time. This decision may be made to avoid the opportunity cost associated with not using the circuitry at it's full compute capacity,
Alternatively, we may find that task B is common in our process schedule that we wish to share the resources to perform B with different processes. We may choose among various methods to share B. If task C is highly critical, we will want to use a dedicated-priority sharing manager that will only share B if there is no request for C. Similarly a non-dedicated-priority sharing manager will assign priorities to each of the possible tasks who may want to use its resources. Presumably a task could pay more to have higher priority. A non-priority sharing manager offers its resources at the same price to everyone, with no guarantee that a given task will receive priority, though there will be some guarantee on latency.
An adaptive profiling and load balancing mechanism will also need to be optimized to determine how to minimize the overhead costs associated with profiling and optimization. In order to do this, we will want to keep a strategy database to provide the load balancer with information about how to manage a process topology (a set of processes to be executed simultaneously). We can ascribe a set of modes for a task dispatched by the load-balancer. In the simple "do nothing unless told to" mode, the load-balancer only dispatches based on specific directives from the application. In "simple management mode" the load balancer will use it's strategy database to manage only those process topologies it has encountered before. In "agressive management mode" the load balancer will make assumptions about the process topologies (such as bit width assumptions or timing assumptions) to relate the topology to previously encountered topologies. Presumably there is some gradient of optoins between simple and agressive management modes. We will prefer the simple management mode or the "do nothing unless told to" mode for "semi-constant" (mostly the same process topology through all time) or diagnostic applications which we want to have lower level control over the hardware.
The aggressive mode will be preferable when we have flexibility to tinker with the application while it is running to determine more optimal partitioning configurations. If we take the aggressive mode to it's logical extreme we have "profile-mode," in which execution is extremely slow, yet the load balancer will be able to produce an analysis of the execution of the task on across a variety of platforms and topologies. We would probably want to enter "profile mode" during idle time and we will want to consider process topologies that we have encountered in the past to build up the strategy database.
Sunday, October 01, 2006
ramble on
Of course the question of which accelerator to use is extremely dependent on profiling tools and compilers, but more importantly, scalability and "future proof"-ness. Here is where I think FPGAs will win since they will continue to be technology drivers and will lead the way to 45nm and below. One other nice facet to having a data center full of FPGAs, is that you can identify useful hardware structures and provide feedback to manufacturers on which features to "harden" into the next fabric.
More important than developing the framework for accelerating applications is actually delivering on the applications even using existing tools. I see a convergence between utility computing (data center outsourcing) and the accelerator co-processor markets: accelerators will benefit from utility computing (where higher utilization will ammortize hardware investment) and data centers will benefit from accelerators (increased GOPS/$).