Research Areas

The Chair for Embedded Systems is devoted to Research in Design and Architectures for Embedded Systems.
A current focus of research are multi-core systems, dependability and low power design.
The Chair has currently the following research groups:

Approximate Machine Learning Accelerators

In recent years, we have been experiencing a rapid growth in Artificial Intelligence (AI) and Machine Learning (ML), especially in deep learning and Deep Neural Networks (DNNs) that achieved superhuman levels of accuracy in many tasks such
as natural language processing, object detection, speech recognition and more. However, these accuracy improvements came at the cost of a vast increase in computational demands leading to elevated and many times prohibitive power
consumption, latency etc.

Approximate computing emerged in recent years as a design/computing paradigm that exploits the inherent error resilience of several application domains to deliberately introduce some error in the performed computations but gain in other metrics such as area, power, performance etc [1]. Machine learning applications exhibit an intrinsic error tolerance, being able to produce results of acceptable quality despite their underline computations being performed in an approximate manner. As a result, machine learning forms a perfect candidate for approximate computing [1].

In our chair, we focus on the design of approximate ML accelerators targeting ML inference on ultra-resource constrained devices [2-3] as well as more complex DNN accelerators [4-6]. Specifically, we employ systematic software-hardware co-design approaches in order to achieve low power and/or high performance while keeping the introduced accuracy loss at bay [2-6].

References
[1] Giorgos Armeniakos, Georgios Zervakis, Dimitrios Soudris, Jörg Henkel, “Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey,” ACM Computing Surveys (CSUR), Mar 2022
[2] Giorgos Armeniakos, Georgios Zervakis, Dimitrios Soudris, Mehdi B. Tahoori, Jörg Henkel, “Cross-Layer Approximation For Printed Machine Learning Circuits”, Design, Automation and Test in Europe Conference (DATE'22), Antwerp, Belgium, Mar 14-23 2022. 
[3] Konstantinos Balaskas, Georgios Zervakis, Kostas Siozios, Mehdi B. Tahoori, Jörg Henkel, “Approximate Decision Trees For Machine Learning Classification on Tiny Printed Circuits,” International Symposium on Quality Electronic Design (ISQED'22), 6-8 April 2022.
[4] Georgios . Zervakis, Ourania Spantidi, Iraklis Anagnostopoulos, Hussam Amrouch, and Jörg Henkel, “Control Variate Approximation for DNN Accelerators”, Design Automation Conference (DAC), San Francisco, Dec 5-9 2021.
[5] Ourania Spantidi, Georgios Zervakis, Iraklis Anagnostopoulos, Hussam Amrouch, and Jörg Henkel, “Positive/Negative Approximate Multipliers for DNN Accelerators,” in International Conference On Computer Aided Design (ICCAD), Nov 1-5 2021.
[6] Zois-Gerasimos Tasoulas, Georgios Zervakis, Iraklis Anagnostopoulos, Hussam Amrouch, Jörg Henkel, “Weight-Oriented Approximation for Energy-Efficient Neural Network Inference Accelerators,” IEEE Transactions on Circuits and Systems I: Regular Papers (Volume 67, Issue 12), Sep 2020.

Contact
Dr. Georgios Zervakis
Prof. Dr. Jörg Henkel

Top

Resource-Constrained Machine Learning

The number of devices that autonomously interact with the real world is growing fast, not least because of the rapid growth of the internet of things (IoT). Low manufacturing costs and low power/energy consumption are key for manufacturing such devices in large quantities. As a consequence, the devices have very limited resources w.r.t. computational power, energy, storage, or communication.

Simultaneously, there is a strong trend towards employing machine learning-based algorithms on end-devices to process input data such as images, videos, or sensor data in general. This not only comprises on-device inference but more importantly also on-device online learning. Online on-device learning enables us to build adaptable and customizable devices that can be employed in dynamically-changing environments.

In our chair, we focus on how to enable continuous on-device online learning with limited computational, communication, and data resources. This problem can only be solved by considering all levels. Algorithm-level techniques that for instance leverage changes in the topologies and learning parameters of neural networks [1] are equally relevant as software-level techniques that enable resource-aware learning, and hardware-level techniques like accelerators for efficient usage of available resources. A major use case for such techniques are distributed systems in which many independent resource-constrained devices cooperatively learn, such as federated learning.

References
[1] Martin Rapp, Ramin Khalili, Kilian Pfeiffer, Jörg Henkel. DISTREAL: Distributed Resource-Aware Learning in Heterogeneous Systems. in Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI'22), Vancouver, Canada, Feb 22 - Mar 01 2022.

Contact
Dr. Martin Rapp
Kilian Pfeiffer
Prof. Dr. Jörg Henkel

Top

Machine Learning for Resource Management

Sophisticated resource management becomes a pressing need in modern computing systems, where many connected devices collaborate towards achieving a specific goal and each of these devises may execute several applications. The goal of resource management is to allocate resources to applications while optimizing for system properties, e.g., performance, and satisfying its constraints, e.g., temperature. To achieve full potential for optimization, resource management needs to exploit relevant knowledge about the system with its two parts; hardware and software, within its decision making process.

The CES has developed techniques that employ analytical models and/or design-time profiling to obtain knowledge about the system [1-3]. However, building analytical models is not always feasible due to the high complexity of software and hardware. Moreover, building a model depending on design-time profiling is possible only for a-priori known applications and cannot cope with runtime variation of application execution.

Machine learning (ML) can tackle these challenges, as we have shown in [4]. Therefore, CES has started to employ and adapt various ML algorithms to support resource management.

Supervised learning has been used in [5] to build a model that predicts the application sensitivities to voltage and frequency changes in terms of performance and power. This model empowers building a smart boosting technique that is able to maximize the performance under a temperature constraint. In [6], supervised learning has been also used to build a model that predicts the slowdown in the application execution induced by cache contention between applications running on the same cluster. This models enables a resource management technique, which selects application-to-cluster mapping that satisfies the performance constraints of applications. Imitation learning has been employed in [7] to migrate applications to different cores at run time such that the temperature of the chip is minimized while satisfying QoS targets of applications. This work shows that imitation learning enables to use the optimality of an oracle policy, yet at low runtime overhead. The following video summarizes our research activities w.r.t. ML for resource management within the project Invasive Computing:

 


Overview of our research activities w.r.t. ML for resource management within the project Invasive Computing.

Further investigating and adapting ML methods to support resource management is a key focus of research at the CES.

References
[1] Santiago Pagani, Heba Khdr, Waqaas Munawar, Jian-Jia Chen, Muhammad Shafique, Minming Li, Jörg Henkel, "TSP: Thermal Safe Power - Efficient power budgeting for many-core systems in dark silicon", International Conference on Hardware - Software Codesign and System Synthesis (CODES+ISSS), New Delhi, India, pp. 1-10, 2014.
[2] Heba Khdr, Santiago Pagani, Muhammad Shafique, Jörg Henkel, “Thermal Constrained Resource Management for Mixed ILP-TLP Workloads in Dark Silicon Chips”, in Design Automation Conference (DAC), San Francisco, CA, USA, Jun 7-11 2015.
[3] Heba Khdr, Santiago Pagani, Éricles Sousa, Vahid Lari, Anuj Pathania, Frank Hannig, Muhammad Shafique, Jürgen Teich, Jörg Henkel, “Power density-aware resource management for heterogeneous tiled multicores”, in IEEE Transactions on Computers (TC), Vol.66, Issue 3, Mar 2017.
[4] Martin Rapp, Anuj Pathania, Tulika Mitra, and Jörg Henkel, “Neural Network-based Performance Prediction for Task Migration on S-NUCA Many-Cores”, in IEEE Transactions on Computers (TC), 2020.
[5] Martin Rapp, Mohammed Bakr Sikal, Heba Khdr, and Jörg Henkel. "SmartBoost: Lightweight ML-Driven Boosting for Thermally-Constrained Many-Core Processors" In Design Automation Conference (DAC), pp. 265-270. IEEE, 2021.
[6] Mohammed Bakr Sikal, Heba Khdr, Martin Rapp, and Jörg Henkel. "Thermal- and Cache-Aware Resource Management based on ML-Driven Cache Contention Prediction" In Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2022.
[7] Martin Rapp, Nikita Krohmer, Heba Khdr, and Jörg Henkel. "NPU-Accelerated Imitation Learning for Thermal- and QoS-Aware Optimization of Heterogeneous Multi-Cores" In Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2022.

Contact
Dr. Heba Khdr
Dr. Martin Rapp
Mohammed Bakr Sikal
Prof. Dr. Jörg Henkel

Top

Adaptive and Self-Organizing On-Chip Systems

Embedded systems are no longer designed and dedicated to a specific and narrow use case. Instead, they are often expected to provide a large degree of flexibility, for instance, users of handhelds can download and start new applications on demand. These requirements can be addressed at different levels for embedded multi-/many core systems, i.e. the processor level and the system level. The main goal of this group is to investigate adaptivity at these levels in order to:

  • increase the performance to fulfill the user's expectations or constraints
  • increase the efficiency, e.g. ‘performance per area' or ‘performance per energy'
  • increase the reliability, i.e. the correct functionality of the system needs to be tested regularly and if
    a component is faulty it needs to be repaired (e.g. by using redundancy) or avoided (i.e. not used any more)

Adaptive-on-chip systemAt the processor level, this group investigates processor microarchitectures (e.g. caches and execution control) and different types of reconfigurable processors (allow changing parts of their hardware at run time). Reconfigurable processors can be realized by hardware structures that are used in FPGAs (field-programmable gate arrays) as they allow changing parts of the FPGA configuration at run time. These run-time reconfigurable processors are often used to provide application-specific accelerators on demand. A run-time system is used to decide which accelerators shall be reconfigured, which is especially challenging in multi-tasking and multi-core scenarios. Run-time reconfiguration can also be used to improve the reliability of a system, which is important in environments with a high radiation (space missions) and for devices that are built using unreliable manufacturing processes (as observed and predicted for upcoming nanometer technology nodes). Here, run-time reconfiguration can help to test the functionality of the underlying hardware, to correct temporary faults, and to adapt to permanents faults. In addition to microarchitectures and run-time systems, also compiler support for reconfigurable processors is investigated to enable a programmer-friendly access to the concepts of reconfigurable processors.

For multi-/many core systems as e.g. Intel's Single-chip Cloud Computer (SCC, providing 48 Pentium-like cores on one chip) or even larger systems, the question arises how these systems can be managed in an efficient way. For instance, the decisions ‘which application obtains how many cores' and ‘on which cores it shall execute' are crucial for performance and efficiency. A run-time system is required that scales with the number of cores and the number of applications that shall execute. Among others, it needs to consider the communication between the applications, the available memory bandwidth, or the temperature of the cores. Strategies that are based on distributed components that negotiate with each other (so-called Agents) are investigated for maintaining scalability and flexibility.

References
[1] Lars Bauer, Muhammad Shafique, Simon Kramer, Jörg Henkel, "RISPP: rotating instruction set processing platform", ACM/IEEE/EDA 44th. Design Automation Conference (DAC'07), San Diego, CA, USA, pp. 791-796. June 2007.
[2] Mohammad Abdullah Al Faruque, Thomas Ebi, Jörg Henkel, "Run-time adaptive on-chip communication scheme", IEEE/ACM International Conference on Computer-Aided Design (ICCAD'07), San Jose, California, USA, pp. 26-31, Nov. 2007.
[3] Muhammad Shafique, Lars Bauer, Jörg Henkel, "Adaptive Energy Management for Dynamically Reconfigurable Processors", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), vol. 33, Issue 1, pp. 50-63, Jan. 2014.

Top

 

Check also: CES PhD Thesis