Project Activities Project Activities Background Modern embedded systems (ES) and HPC systems must operate under evermore similar requirements or constraints, which has pressured both domains into increasingly heterogeneous systems. While ES are put under increasing load by complex applications, HPC systems are struggling to provide power-efficiency as they scale [10]. The use of specialized and potentially self-adaptive hardware which targets specific and very demanding tasks has been identifies as a possible solution to address these issues for both domains [9]. Many applications in both domains have a small number of regular computational kernels that account for most of the execution time and energy consumption. The manual development of accelerators requires significant design time and hardware expertise. However, it is vital to not compromise developer productivity by demanding manual hardware development and source code alterations. The PEPCC project was created from the converging scientific works of its team members, which collectively address the issue of how to design systems, in the embedded and high-performance computing (HPC) domains, which provide computation that is both fast and power-efficient, by means of heterogenous computing architectures and respective compilation methods which target them [1-8]. Objectives and Research Plan PEPPC aims to address the highlighted issues regarding efficient computation by furthering exploring the research paths of its constituent team members by structuring said efforts into a convergent, coherent, and structured overarching research plan. Given the convergence of requirements between ES and HPC, the mapping of computations to specialized or reconfigurable circuits at runtime (dynamic mapping) has growing importance as a research goal. An efficient implementation of this capability will allow for hardware resources of reconfigurable circuits to be exploited by any application running in the system, addressing both power efficiency and performance concerns, and contribute to code portability and performance scalability. If effective and efficient, runtime mapping schemes will increase productivity by offloading demanding computation to specialized hardware with minimal developer intervention, and based on actual application workload. Exploiting runtime information is paramount in order to fulfill this goal. The project is divided into several activities, whose joint goal is too: Improve performance and power efficiency of regular computational kernels in ES and HPS systems, by efficient runtime mapping of computations to specialized CGRA, while ensuring the most transparency to the application developer. This will be accomplished by researching advanced binary trace analysis methods for Just-in-Time hardware generation and (re-)configuration. Information from this process will also be used to dynamically adapt existing CGRAs by hardware modification, or to select existing specialized CGRAs based on the workload to process. To study solutions which balance specialization and programmability, the project will also further develop the existing Versat architecture [5]. Finally, the binary analysis and hardware generation methods to be developed in the aforementioned activities will require studying methods though which they can be exploited and integrated into ES and HPC systems. A1.Advanced Trace Analysis for JIT Hardware Generation Extracting more information from the binary traces will enable the identification and exploitation of more parallelism than currently exposed by static analysis. Binary traces called Megablocks [2,3] will be used as the starting point for the representation of knowledge about repetitive instruction traces; the model will be expanded to include the additional extracted information. The project will address these specific issues: loop dependency analysis, memory access disambiguation and detection of streaming access patterns, and data specialization. Milestones M1.1: Algorithm for constrained carried loop dependency detection from binary traces M1.2: Algorithm for memory access disambiguation and detection of streaming memory access patterns M1.3: Algorithm for data specialization of Megablocks M1.4: Experimental evaluation of proposed methods A2.Customized CGRAs for Dynamic Mapping of Computations The project will develop CGRA architectures which efficiently support dynamic mapping of computations, as well as the respective scheduling and hardware compilation algorithms and implementations. The point is to develop CGRA architectures which are suitable targets for JIT hardware generation/(re-)configuration. Two strategies will be employed: 1) designing an architecture for FPGA devices wherein a fixed set of computing resources is complemented by resorting to Dynamic Partial Reconfiguration to add/modify additional specialized hardware, and 2) improving existing CGRA architectures via multistage interconnection networks to allow for more versatile data flow at an affordable hardware overhead. Milestones M2.1: CGRA Generation 1 - Module for dynamic mapping of Megablocks to customized 1D CGRAs M2.2: Prototype validation of M2.1 M2.3: CGRA Generation 2 - Enchancement for memory disambiguation and data resuse M2.4: Prototype validation of M2.3 A3.Programmable CGRA Architecture for Fast Reconfiguration We aim to improve existing CGRA architectures, namely Versat [5], by employing the results of A1 regarding runtime binary information. Namely, improving Versat requires: 1) extracting loop parameters, 2) sequencing configurations in order to speed up CGRA reconfiguration time and hide data movement by pre-fetching, and 3) exposing data level parallelism, to enhance acceleration beyond pipelining. Another aspect that can be explored with the Versat controller is Thread Level Parallelism (TLP). Small independent datapaths can be set to run within Versat, since the 4 dual-port embedded memories present in the architecture have independent address generation units for each port. Milestones M3.1: Extracting high-level Intermediate Representations (IRs) from memory access patterns and dependencies M3.2: Translation of IR into Versat assembly code M3.3: Integration into HPC and ES environments A4.Run-time Management for HPC Systems We will exploit the opportunities offered by the Intermediate Representation’s (IR) access to an HPC system whose nodes have an experimental Intel chip that combines a 12-core Xeon and a large Intel Arria 10 FPGA, a powerful platform that will enable us to measure the impact of the proposed approaching a live environment. The objective is to implement a hardware accelerator infrastructure, based on tailored CGRAs, and run-time management algorithms to support N customized accelerators, dynamically choosing the most appropriate hardware to execute heavy workload, along with mapping the binary representations (Megablocks) of the workload to said chosen hardware. This approach will be compared with the use of the Versat programmable CGRA. Milestones M4.1: Creation of a CGRA library based on binary trace analysis M4.2: Runtime management - Use fixed policies for accelerator (i.e., CGRA) selection M4.3: Runtime management - Use guided policies for accelerator (i.e., CGRA) selection M4.4: Experimental evaluation of performance and power efficiency vs. static accelerators A5.Run-time Management for Embedded Systems This activity addresses the implementation of the complete runtime mapping infrastructure for embedded systems. We will follow a hardware/software co-design approach, as we will analyze which components need to be implemented as custom hardware and each ones can be implemented by simple processors (such as a MicroBlaze or an embedded ARM). The extraction of trace information at runtime requires the processing of large amounts of data with limited resources. Therefore, our approach will be centered on stream-based versions of the algorithms developed in A1, in order to deal with a number of system constraints such as storage of execution traces and intermediate representations. The algorithms will have to work on limited scope using local memories storing windows of execution traces. Milestones M5.1: Embedded Runtime Management 1 M5.2: Embedded Runtime Management 2 - Versat Support M5.3: HLS/GPU Accelerators M5.4: Experimental evaluation of accelerators [1] N. Paulino, J. C. Ferreira, and J. M. P. Cardoso, “Dynamic Partial Reconfiguration of Customized Single-Row Accelerators,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 27, iss. 1, pp. 116-125, 2019. [Bibtex] @ARTICLE{8502926, author={Paulino, Nuno and C. Ferreira, João and M. P. Cardoso, João}, journal={{IEEE Transactions on Very Large Scale Integration (VLSI) Systems}}, title={{Dynamic Partial Reconfiguration of Customized Single-Row Accelerators}}, year={2019}, volume={27}, number={1}, pages={116-125}, doi={10.1109/TVLSI.2018.2874079}, ISSN={1063-8210}, month={Jan} } [2] J. Bispo and J. M. P. Cardoso, “On Identifying and Optimizing Instruction Sequences for Dynamic Compilation,” in 2010 International Conference on Field-Programmable Technology (FPT), 2010, pp. 437-440. [Bibtex] @INPROCEEDINGS{5681454, author={Bispo, João and M. P. Cardoso, João}, booktitle={{2010 International Conference on Field-Programmable Technology (FPT)}}, title={{On Identifying and Optimizing Instruction Sequences for Dynamic Compilation}}, year={2010}, pages={437-440}, doi={10.1109/FPT.2010.5681454}, month={Dec} } [3] J. Bispo and J. M. P. Cardoso, “On Identifying Segments of Traces for Dynamic Compilation,” in 2010 International Conference on Field Programmable Logic and Applications (FPL), 2010, pp. 263-266. [Bibtex] @INPROCEEDINGS{5694260, author={Bispo, João and M. P. Cardoso, João}, booktitle={{2010 International Conference on Field Programmable Logic and Applications (FPL)}}, title={{On Identifying Segments of Traces for Dynamic Compilation}}, year={2010}, pages={263-266}, doi={10.1109/FPL.2010.61}, month={Aug} } [4] N. Paulino, J. C. Ferreira, and J. M. P. Cardoso, “Generation of Customized Accelerators for Loop Pipelining of Binary Instruction Traces,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, iss. 1, pp. 21-34, 2017. [Bibtex] @ARTICLE{7506263, author={Paulino, Nuno and C. Ferreira, João and M. P. Cardoso, João}, journal={{IEEE Transactions on Very Large Scale Integration (VLSI) Systems}}, title={{Generation of Customized Accelerators for Loop Pipelining of Binary Instruction Traces}}, year={2017}, volume={25}, number={1}, pages={21-34}, doi={10.1109/TVLSI.2016.2573640}, month={Jan} } [5] J. D. Lopes and J. T. de Sousa, “Versat, a Minimal Coarse-Grain Reconfigurable Array,” in High Performance Computing for Computational Science – VECPAR 2016, Cham, 2017, p. 174–187. [Bibtex] @InProceedings{10.1007/978-3-319-61982-8_17, author={D. Lopes, João and T. de Sousa, José}, editor={Dutra, Inês and Camacho, Rui and Barbosa, Jorge and Marques, Osni}, title={{Versat, a Minimal Coarse-Grain Reconfigurable Array}}, booktitle={{High Performance Computing for Computational Science -- VECPAR 2016}}, year={2017}, publisher={{Springer International Publishing}}, address={Cham}, pages={174--187}, doi={10.1007/978-3-319-61982-8_17}, isbn={978-3-319-61982-8} } [6] N. Paulino, J. C. Ferreira, and J. M. P. Cardoso, “A Reconfigurable Architecture for Binary Acceleration of Loops with Memory Accesses,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 7, iss. 4, p. 29:1–29:20, 2014. [Bibtex] @article{Paulino:2014:RAB:2699137.2629468, author = {Paulino, Nuno and C. Ferreira, João and M. P. Cardoso, João}, title = {{A Reconfigurable Architecture for Binary Acceleration of Loops with Memory Accesses}}, journal = {{ACM Transactions on Reconfigurable Technology and Systems (TRETS)}}, issue_date = {January 2015}, volume = {7}, number = {4}, month = dec, year = {2014}, issn = {1936-7406}, pages = {29:1--29:20}, articleno = {29}, numpages = {20}, url = {http://doi.acm.org/10.1145/2629468}, doi = {10.1145/2629468}, acmid = {2629468}, publisher = {ACM}, address = {New York, NY, USA} } [7] J. Bispo, N. Paulino, J. M. P. Cardoso, and J. C. Ferreira, “Transparent Trace-Based Binary Acceleration for Reconfigurable HW/SW Systems,” IEEE Transactions on Industrial Informatics, vol. 9, iss. 3, pp. 1625-1634, 2013. [Bibtex] @ARTICLE{6392266, author={Bispo, João and Paulino, Nuno and M. P. Cardoso, João and C. Ferreira, João}, journal={{IEEE Transactions on Industrial Informatics}}, title={{Transparent Trace-Based Binary Acceleration for Reconfigurable HW/SW Systems}}, year={2013}, volume={9}, number={3}, pages={1625-1634}, doi={10.1109/TII.2012.2235844}, ISSN={1551-3203}, month={Aug} } [8] N. Paulino, J. C. Ferreira, and J. M. P. Cardoso, “Architecture for Transparent Binary Acceleration of Loops with Memory Accesses,” in Reconfigurable Computing: Architectures, Tools and Applications, Berlin, Heidelberg, 2013, p. 122–133. [Bibtex] @InProceedings{10.1007/978-3-642-36812-7_12, author={Paulino, Nuno and C. Ferreira, João and M. P. Cardoso, João}, editor={{Philip Brisk, José Gabriel de Figueiredo Coutinho, and Pedro C. Diniz}}, title={{Architecture for Transparent Binary Acceleration of Loops with Memory Accesses}}, booktitle={{Reconfigurable Computing: Architectures, Tools and Applications}}, year={2013}, publisher={{Springer Berlin Heidelberg}}, address={Berlin, Heidelberg}, pages={122--133}, doi={10.1007/978-3-642-36812-7_12}, isbn={978-3-642-36812-7} } [9] M. Duranton, K. De Bosschere, C. Gamrat, J. Maebe, H. Munk, and O. Zendra, HiPEAC Vision 2017 – HiPEAC High-Performance Embedded Architecture and Compilation, 2017. [Bibtex] @misc{hipeac1, author = {Duranton, Marc and De Bosschere, Koen and Gamrat, Christian and Maebe, Jonas and Munk, Harm and Zendra, Olivier}, isbn = {978-90-9030182-2}, pages = {1--167}, title = {{HiPEAC Vision 2017 - HiPEAC High-Performance Embedded Architecture and Compilation}}, year = {2017}, } [10] M. Malms, J. Nominé, and M. Ostasz, ETP4HPC Computing Strategic Research Agenda 2015 Update – HiPEAC High-Performance Embedded Architecture and Compilation, 2015. [Bibtex] @misc{hipeac2, author={Malms, Michael and Nominé, Jean-Philippe and Ostasz, Marcin}, title = {{ETP4HPC Computing Strategic Research Agenda 2015 Update - HiPEAC High-Performance Embedded Architecture and Compilation}}, year={2015}, isbn={} }