Automatic Code-Generation Techniques for Micro-Threaded RISC Architectures
There has been an ever-widening gap between processor and memory speeds, resulting in a 'memory wall' where the time for memory accesses dominates performance. To counter this, architectures that use many very small threads that allow multiple memory accesses to occur in parallel have been under investigation. Examples of these architectures are the CARE (Compiler Aided Reorder Engine) architecture, micro-threading architectures and cellular architectures, such as the IBM Cyclops family, implementing using processors-in-memory (PIM), which is the main architecture discussed in this thesis. PIM architectures achieve high performance by increasing the bandwidth of the processor to memory communication and reducing that latency, via the use of many processors physically close to the main memory. These massively parallel architectures may have sophisticated memory models, and I contend that there is an open question regarding what may be the ideal approach to implementing parallelism, via using many threads, from the programmer's perspective. Should the implementation be at language-level such as UPC, HPF or other language extensions, alternatively within the compiler using trace-scheduling? Or should it be at library-level, for example OpenMP or POSIX-threads? Or perhaps within the architecture, such as designs derived from data-flow architectures? In this thesis, DIMES (the Delaware Iterative Multiprocessor Emulation System), which is being developed by CAPSL at the University of Delaware, was used as a hardware evaluation tool for such cellular architectures. As the programing example, the author chose to use a threaded Mandelbrot-set generator with a work-stealing algorithm to evaluate the DIMES cthread programming model. This implementation was used to identify potential problems and issues that may occur when attempting to implement massive number of very short-lived threads.