This list is for discussion of the design and implementation of field-programmable gate array based processors and integrated systems. It is also for discussion and community support of the XSOC Project (see http://www.fpgacpu.org/xsoc).
|
(Excuse me for my poor english) Mainstream processor cannot provide more than few GFlops per chip, however there are already the technologies needed to create a single chip supercomputer. But this require to lose compatibility with actual platforms. Let's go to problem. A 3 Ghz CPU capable of up to three FPU operations per clock can deliver up to 9 GFlops (peak theoretical performance). With current instruction architecture we cannot do much better. Array operation and VLIW instruction set can reach better performance. If we built a 1 Ghz VLIW CPU with 256 bit wide instruction word, that word can contain up to eight 32 bit instructions packed together. If that CPU has eight array units able to work on eight floating point numbers packed in a 256 bit register we can reach the amazing performance of 64 GFlops (p.t.p.), which is far better than the previous CPU, even if the clock speed is only a third. The drawback of a VLIW CPU is the lack of code density, wasting memory. This require a very large L1 cache too, L1 cache need to be at least eight times that of current cpu to contain the same amount of cached instructions. Instruction Set Compression can save instruction cache space. Guess you have built that cpu, and than created an OS and a working suit of software as well, then suppose that all the code could be fit in a 4 GWord memory (a very huge memory of 128 GByte), so not more than 2^32 different VLIW instructions are used of the 2^256 possible. This mean that the CPU can use an hardwired 32 to 256 bit instruction decoder. (The set of usefull instructions could be selected by a computer). This way we can create a 8-way VLIW CPU with use only 32 Bit instruction. This improve the code density and require less bandwidth for instruction fetching. Giving more bandwidth for feeding the array units. Integrating four CPU like this in a chip and rising the speed to 4 GHz we reach the p.t.p. of 1 TFlops!! This is the conventional way to step over the GFlop scale. I'm sure there should be at least another way to reach that speed. But how? Many supercomputer applications require a short loop of instructions to be executed for a large number of times. Hence we can create a 'algorithm unit' capable to execute the inner loop in only one clock cycle. If the loop is of about one hundred instruction, we can reach the performance of 100 GFlops a 1 GHz of speed. The new idea is to execute a large number of sequential instructions at the same time in a programmable pipeline. If we can feed the data to this customizable pipeline at one input per clock we can say that the unit can execute all instructions in a single clock. So if the 'algorithm unit' contains five hundred instructions we can reach the p.t.p of 1 TFlops at 2 Ghz. Obviously we need a CPU that can feed our 'algorithm unit'. The problem now is how to make an 'algorithm unit'. My idea is to build a grid of processing element, each composed by an FP ALU a four data register (register R0-R3) and four i/o (mapped as register R4-R7) and a program register (PR). The PR contains the instruction that processing element must execute. Each processing element has a one-instruction program, no longer program are needed. The program is made of a 32 bit 3 field instruction: the first field is the test, the other two are the instructions to execute upon the base of the test. The instruction looks like: IF Ra (=|<|>|!=) Rb THEN Rc = Rd (+|-|*|/|...) Re ELSE Rf = Rg (+|-|*|/|...) Rh Reading from an input (R4-R7) require an interlock, the processing element must wait until the element to witch is connected does a write onto the link. Each processing element must execute each instruction in the same time. The connected CPU must feed the 'algorithm unit' and collect the results, so must have the ability to read and write each data register and the program register of each PE in the unit. A 32x32 toroidal grid easy contains a five hundred instructions loop. Now i'm asking if someone will take the quest to design a 1TFlops CPU... Surmolotto _________________________________________________________________ Filtri antispamming e antivirus per la tua casella di posta http://www.msn.it/msn/hotmail |
|
|
|
> so not more > than 2^32 different VLIW instructions are used of the 2^256 > possible. This mean that the CPU can use an hardwired 32 to > 256 bit instruction decoder. This decoder will implement a boolean function with 32 inputs and 256 outputs. It can be proven that at least halve the functions in that category can not be implemented smaller than a 128GByte ROM. You need a mapping function that is regular enough that it is one of the very few functions that can be implemented a lot smaller than exponential size. Regards, Kolja Sulimma |