The time zone for all times mentioned at the DATE website is CEST – Central Europe Summer Time (UTC+1). AoE = Anywhere on Earth.
DATE 2023 Detailed Programme
The detailed programme of DATE 2023 will continuously be updated.
More information on ASD Initiative, Keynotes, Tutorials, Workshops, Young People Programme
Navigate to Monday, 17 April 2023 | Tuesday, 18 April 2023 | Wednesday, 19 April 2023.
Monday, 17 April 2023
OC Opening Ceremony
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 08:30 CET - 09:00 CET
Location / Room: Queen Elisabeth Hall
Session chair:
Ian O’Connor, Ecole Centrale de Lyon, FR
Session co-chair:
Robert Wille, Technical University of Munich, DE, Jürgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), DE
Time | Label | Presentation Title Authors |
---|---|---|
08:30 CET | OC.1 | WELCOME ADDRESSES Speaker: Ian O'Connor and Robert Wille, DATE, BE Authors: Ian O'Connor1 and Robert Wille2 1Lyon Institute of Nanotechnology, FR; 2TU Munich, DE Abstract Welcome Addresses from DATE 2023 Chairs |
08:40 CET | OC.2 | PRESENTATION OF AWARDS Speaker: David Atienza, Georges Gielen and Yervant Zorian, DATE, BE Authors: David Atienza1, Georges Gielen2 and Yervant Zorian3 1EPFL, CH; 2KU Leuven, BE; 3Synopsys, US Abstract Presentation of Awards from Chairs |
OK1 Opening Keynote 1
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 09:00 CET - 09:45 CET
Location / Room: Queen Elisabeth Hall
Time | Label | Presentation Title Authors |
---|---|---|
09:00 CET | OK1.1 | BUILDING THE METAVERSE: AUGMENTED REALITY APPLICATIONS AND INTEGRATED CIRCUIT CHALLENGES Presenter: Edith Beigné, Meta Reality Labs, US Author: Edith Beigné, Meta Reality Labs, US Abstract Augmented reality is a set of technologies that will fundamentally change the way we interact with our environment. It represents a merging of the physical and the digital worlds into a rich, context aware and accessible user interface delivered through a socially acceptable form factor such as eyeglasses. One of the biggest challenges in realizing a comprehensive AR experience are the performance and form factor requiring new custom silicon. Innovations are mandatory to manage power consumption constraints and ensure both adequate battery life and a physically comfortable thermal envelope. This presentation reviews Augmented Reality and Virtual Reality applications and Silicon challenges. |
OK2 Opening Keynote 2
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 09:45 CET - 10:30 CET
Location / Room: Queen Elisabeth Hall
Time | Label | Presentation Title Authors |
---|---|---|
09:45 CET | OK2.1 | THE CYBER-PHYSICAL METAVERSE – WHERE DIGITAL TWINS AND HUMANS COME TOGETHER Presenter: Dirk Elias, Robert Bosch GmbH, DE Author: Dirk Elias, Robert Bosch GmbH, DE Abstract The concept of Digital Twins (DTs) has been discussed intensively for the past couple of years. Today we have instances of digital twins that range from static descriptions of manufacturing data and material properties to live interfaces to operational data of cyber physical systems and the functions and services they provide. Currently, there are no standardized interfaces to aggregate atomic DTs (e.g., the twin of the lowest-level function of a machine) to higher-level DTs providing more complex services in the virtual world. Additionally, there is no existing infrastructure to reliably link the DTs in the virtual world to the integrated CPSs in the real world (like a car consisting of many ECUs with even more functions). This keynote will address how the Metaverse can become the virtual world where DTs of humans and machines live and how to reliably connect DTs to the physical world. Insights in current activities of Bosch Research and its academic partners to move towards this vision will be provided. |
ASD1 ASD technical session: Designing Fault tolerant and resilient autonomous systems
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 11:00 CET - 12:30 CET
Location / Room: Gorilla Room 1.5.4/5
Session chair:
Selma Saidi, TU Dortmund, DE
Time | Label | Presentation Title Authors |
---|---|---|
11:00 CET | ASD1.1 | MAVFI: AN END-TO-END FAULT ANALYSIS FRAMEWORK WITH ANOMALY DETECTION AND RECOVERY FOR MICRO AERIAL VEHICLES Speaker: Yu-Shun Hsiao, Harvard University, US Authors: Yu-Shun Hsiao1, Zishen Wan2, Tianyu Jia3, Radhika Ghosal1, Abdulrahman Mahmoud1, Arijit Raychowdhury2, David Brooks1, Gu-Yeon Wei4 and Vijay Janapa Reddi1 1Harvard University, US; 2Georgia Tech, US; 3Peking University, CN; 4Harvard University / Samsung, US Abstract Safety and resilience are critical for autonomous unmanned aerial vehicles (UAVs). We introduce MAVFI, the micro aerial vehicles (MAVs) resilience analysis methodology to assess the effect of silent data corruption (SDC) on UAVs' mission metrics, such as flight time and success rate, for accurately measuring system resilience. To enhance the safety and resilience of robot systems bound by size, weight, and power (SWaP), we offer two low-overhead anomaly-based SDC detection and recovery algorithms based on Gaussian statistical models and autoencoder neural networks. Our anomaly error protection techniques are validated in numerous simulated environments. We demonstrate that the autoencoder-based technique can recover up to all failure cases in our studied scenarios with a computational overhead of no more than 0.0062%. Our application-aware resilience analysis framework, MAVFI, can be utilized to comprehensively test the resilience of other Robot Operating System (ROS)-based applications and is publicly available at https://github.com/harvard-edge/MAVBench/tree/mavfi. |
11:22 CET | ASD1.2 | PHALANX: FAILURE-RESILIENT TRUCK PLATOONING SYSTEM Speaker: Taewook Ahn, Kookmin University, KR Authors: Changjin Koo1, jaegeun park2, Ahn TaeWook1, Hongsuk Kim1, Jong-Chan Kim1 and Yongsoon Eun2 1Kookmin University, KR; 2DGIST, KR Abstract We introduce Phalanx, a failure-resilient truck platooning system, where trucks in a platoon protect each other from sensor failures despite the lack of redundant sensors. For that, we first emulate the failed sensors by collectively utilizing other sensors across the platoon. If the failed sensor cannot be emulated, the control system is instantaneously reconfigured to a cooperative protection mode using only the live sensors. We take a scenario-based approach considering six scenarios with single and dual failures of the essential sensors (i.e., lidar, encoder, and camera) for platooning control. For each scenario, we present a protection method that enables the safe maneuvering of platoons. For the evaluation, Phalanx is implemented using our scale truck testbed instrumented with fault injection modules, demonstrating safe platooning controls for the failure scenarios. |
11:45 CET | ASD1.3 | EFFICIENT SOFTWARE-IMPLEMENTED HW FAULT TOLERANCE FOR TINYML INFERENCE IN SAFETY-CRITICAL APPLICATIONS Speaker: Uzair Sharif, TU Munich, DE Authors: Uzair Sharif, Daniel Mueller-Gritschneder, Rafael Stahl and Ulf Schlichtmann, TU Munich, DE Abstract TinyML research has mainly focused on optimizing neural network inference in terms of latency, code-size and energy-use for efficient execution on low-power micro-controller units (MCUs). However, distinctive design challenges emerge in safety-critical applications, for example in small unmanned autonomous vehicles such as drones, due to the susceptibility of off-the-shelf MCU devices to soft-errors. We propose three new techniques to protect TinyML inference against random soft errors with the target to reduce run-time overhead: one for protecting fully-connected layers; one adaptation of existing algorithmic fault tolerance techniques to depth-wise convolutions; and an efficient technique to protect the so-called epilogues within TinyML layers. Integrating these layer-wise methods, we derive a full-inference hardening solution for TinyML that achieves run-time efficient soft-error resilience. We evaluate our proposed solution on MLPerf-Tiny benchmarks. Our experimental results show that competitive resilience can be achieved compared with currently available methods, while reducing run-time overheads by ~120% for one fully-connected neural network (NN); ~20% for the two CNNs with depth-wise convolutions; and ~2% for standard CNN. Additionally, we propose selective hardening which reduces the incurred run-time overhead further by ~2x for the studied CNNs by focusing exclusively on avoiding mispredictions. |
12:07 CET | ASD1.4 | FORMAL ANALYSIS OF TIMING DIVERSITY FOR AUTONOMOUS SYSTEMS Speaker: Anika Christmann, TU Braunschweig, DE Authors: Anika Christmann, Robin Hapka and Rolf Ernst, TU Braunschweig, DE Abstract The design of autonomous systems, such as for automated driving and avionics, is challenging due to high performance requirements combined with high criticality. Complex applications demand the full performance of commercial high performance multi-core systems of-the-shelf (COTS), with or without accelerators. While these systems are optimized for performance, hard real-time requirements and deterministic timing behavior are major constraints for safety-critical systems. Unfortunately, infrequent timing outliers caused by interleaved hardware-software effects of COTS systems complicate traditional worst-case design. This conflict often prohibits deploying COTS hardware and consequently prevents sophisticated applications, too. Recently, an approach called Timing Diversity was introduced, which proposes to exploit existing dual modular redundant hardware platforms to mask deadline violations. This paper puts Timing Diversity on a theoretical foundation and provides specification for different implementations. It demonstrates that Timing Diversity needs fast recovery to be effective, proposes a recovery strategy and provides a mathematical model for the reliability of the resulting system. Using experimental data in a Linux based system, it shows that fast recovery is useful, making Timing Diversity a realistic option for compute demanding hard real-time applications. |
FS1 Focus session: Embracing uncertainty and exploring non-determinism for efficient implementations of Machine Learning models
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 11:00 CET - 12:30 CET
Location / Room: Okapi Room 0.8.1
Session chair:
Lorena ANGHEL, SPINTEC, Grenoble INP – University Grenoble Alpes, FR
Time | Label | Presentation Title Authors |
---|---|---|
11:00 CET | FS1.1 | BINARY RERAM-BASED BNN FIRST-LAYER IMPLEMENTATION Speaker: Mona EZZADEEN, CEA-Leti, FR Authors: Mona Ezzadeen1, Atreya Majumdar1, Sigrid Thomas1, Jean-Philippe Noel1, Bastien Giraud1, Marc Bocquet2, Francois Andrieu1, Damien Querlioz3 and Jean-Michel Portal4 1CEA, FR; 2IM2NP - Aix-Marseille University, FR; 3C2N - CNRS, FR; 4Aix-Marseille University, FR Abstract The deployment of Edge AI requires energy-efficient hardware with a minimal memory footprint to achieve optimal performance. One approach to meet this challenge is the use of Binary Neural Networks (BNNs) based on non-volatile in-memory computing (IMC). In recent years, elegant ReRAM-based IMC solutions for BNNs have been developed, but they do not extend to the first layer of a BNN, which typically requires non-binary activations. In this paper, we propose a modified first layer architecture for BNNs that uses k-bit input images broken down into k binary input images with associated fully binary convolution layers and an accumulation layer with fixed weights of {$2^{-1},...,2^{-k}$}. To further increase energy efficiency, we also propose reducing the number of operations by truncating 8-bit RGB pixel code to the 4 most significant bits (MSB). Our proposed architecture only reduces network accuracy by 0.28\% on the CIFAR-10 task compared to a BNN baseline. Additionally, we propose a cost-effective solution to implement the weighted accumulation using successive charge sharing operations on an existing ReRAM-based IMC solution. This solution is validated through functional electrical simulations. |
11:30 CET | FS1.2 | SCALABLE SPINTRONICS-BASED BAYESIAN NEURAL NETWORK FOR UNCERTAINTY ESTIMATION Speaker: Soyed Tuhin Ahmed, Karlsruhe Institute of Technology, DE Authors: Soyed Ahmed1, Kamal Danouchi2, Michael Hefenbrock3, Guillaume PRENAT4, Lorena Anghel5 and Mehdi Tahoori1 1Karlsruhe Institute of Technology, DE; 2University Grenoble Alpes, CEA, CNRS, Grenoble INP, IRIG-Spintec Laboratory, FR; 3RevoAI GMBH, DE; 4University Grenoble Alpes, CEA, CNRS, Grenoble INP, FR; 5Grenoble-Alpes University, Grenoble, France, FR Abstract Typical neural networks are incapable of effectively estimating prediction uncertainty, leading to overconfident predictions. Estimating uncertainty is crucial for safety-critical tasks such as autonomous vehicle driving and medical diagnosis and treatment. Bayesian Neural Networks (BayNNs), which combine the capabilities of neural networks and Bayesian inference, are an effective approach for uncertainty estimation. However, BayNNs are computationally demanding and necessitate substantial memory resources. Computation-in-memory (CiM) architectures utilizing emerging resistive non-volatile memories such as Spin-Orbit Torque (SOT) have been proposed to increase the resource efficiency of traditional neural networks. However, training scalable and efficient BayNNs and implementing them in the CiM architecture presents its own challenges. In this paper, we propose a scalable Bayesian NN framework via Subset-Parameter inference and its Spintronic-based CiM implementation. Our method is evaluated on large datasets and topologies to show that it can achieve comparable accuracy while still being able to estimate uncertainty efficiently at up to 70X lower power consumption and 158.7X lower storage memory requirements. |
12:00 CET | FS1.3 | COUNTERING UNCERTAINTIES IN IN-MEMORY-COMPUTING PLATFORMS WITH STATISTICAL TRAINING, ACCURACY COMPENSATION AND RECURSIVE TEST Speaker: Bing Li, TU Munich, DE Authors: Amro Eldebiky1, Grace Li Zhang2 and Bing Li1 1TU Munich, DE; 2TU Darmstadt, DE Abstract "In-memory computing (IMC) has become an efficient solution for implementing neural networks on hardware. However, IMC platforms request that parameters such as weights in neural networks are programmed to exact values. This is a very demanding task due to programming complexity and variations. Accordingly, new methods should be introduced to counter such uncertainties. In this talk, we will first discuss a method to train neural networks statistically with variations modeled as correlated random variables. The statistical effect is incorporated into the cost function during training. Consequently, a neural network after statistical training becomes robust to uncertainties. To deal with variations and noise further, we also introduce a compensation method with weight constraints and extra layers for neural networks. These extra layers are trained after the weights in the original neural network are determined to enhance the inference accuracy. Finally, we discuss a method for testing the effect of process variations in an optical acceleration platform for neural networks. This optical platform uses Mach-Zehnder interferometers (MZIs) to implement the multiply–accumulate operations. However, trigonometric functions in the transformation matrix of an MZI make it very sensitive to variations. To address this problem, we apply a recursive test procedure to determine the properties of MZIs inside an optical acceleration module, so that process variations can be compensated accordingly to maintain the accuracy of neural networks. " |
FS2 Focus session: Open-source hardware technologies
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 11:00 CET - 12:30 CET
Location / Room: Gorilla Room 1.5.3
Session chair:
Giovanni De Micheli, EPFL, CH
The session will discuss perspectives on the future and transformative implications of open-source hardware technologies.
Time | Label | Presentation Title Authors |
---|---|---|
11:00 CET | FS2.1 | DEMOCRACY OF SILICON AND INTELLIGENT EDGE Speaker and Author: Naveed Sherwani, RapidSilicon, US Abstract The open-source movement has already tremendously shaken our industry with broad initiatives such as the RISC-V ISA. One of the most remarkable effect of open-source is the ability to collaborate a broad borderless-scale, and foster innovation and education worldwide - a true democracy of silicon. New levels of innovation are expected by the upcoming intelligent edge requirements, where data deluge will have to be handled locally to sensors with minimal energy requirements. Participation of the largest nation is extremely important to this mission, but enabling larger engagement through education opportunities should be a top priority to create our industry's workforce of tomorrow. |
11:30 CET | FS2.2 | PULP: 10 YEARS OF OPEN SOURCE HARDWARE Speaker and Author: Frank Gürkaynak, ETH Zurich, CH Abstract The Parallel Ultra Low Power (PULP) Platform project kicked off in a small office almost exactly 10 years ago. We wanted to work on energy efficient computer architectures and realized that we needed the help and cooperation of a larger community if we were to be successful as an academic institution. This is why we had open source as a cornerstone of our project. 10 years and more then 50 ASICs later, open source hardware is no longer seen as a enthusiasts dream, or academic curiosity, but has established itself in business plans of companies big and small as well as receiving funding from governments. I have been lucky enough to have witnessed some of the key events of this development and in this talk I want to share a bit of this history as seen from our side and provide some insights into the developments we can expect in the near future. |
12:00 CET | FS2.3 | OPENFPGA: BRINGING OPEN-SOURCE HARDWARE TO FPGAS Speaker and Author: Pierre-Emmanuel Gaillardon, University of Utah, US Abstract In this talk, we will introduce the OpenFPGA framework whose aim is to generate highly-customizable Field Programmable Gate Array (FPGA) fabrics and their supporting EDA flows. Following the footsteps of the RISC-V initiative, OpenFPGA brings reconfigurable logic into the open-source community and closes the performance gap with commercial products. OpenFPGA strongly incorporates physical design automation in its core and enables 100k+ look-up tables FPGA fabric generation from specification to layout in less than 24h with a single engineer effort. |
LBR1 Late Breaking Results: novel computing paradigms
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 11:00 CET - 12:30 CET
Location / Room: Okapi Room 0.8.2
Session chair:
Nele Mentens, KU Leuven, BE
Time | Label | Presentation Title Authors |
---|---|---|
11:00 CET | LBR1.1 | DIGITAL EMULATION OF OSCILLATOR ISING MACHINES Speaker: Jaijeet Roychowdhury, University of California at Berkeley, US Authors: Shreesha Sreedhara1, Jaijeet Roychowdhury1, Joachim Wabnig2 and Pavan Srinath3 1University of California at Berkeley, US; 2Nokia Bell Labs, GB; 3Nokia Bell Labs, FR Abstract Ising problem is an NP-hard combinatorial optimization problem. Recently, networks of mutually coupled, nonlinear, self-sustaining oscillators known as Oscillator Ising Machines (OIMs) were shown to heuristically solve Ising problems. The phases of the oscillators in OIMs can be modeled as systems of Ordinary Differential Equations (ODEs) known as Generalized Kuramoto (Gen-K) models. In this paper, we solve Gen-K ODE systems efficiently using cleverly designed fixed point operations. To demonstrate this idea, we fabricated a prototype chip containing 33 spins with programmable all-to-all connectivity. We test this design using Multi-Input Multi-Output decoding problems, and show that the OIM emulator achieves near-optimal Symbol Error Rates (SER). |
11:03 CET | LBR1.2 | ENERGY-EFFICIENT BAYESIAN INFERENCE USING NEAR-MEMORY COMPUTATION WITH MEMRISTORS Speaker: Clement Turck, Universite Paris-Saclay, FR Authors: Clément Turck1, Kamel-Eddine Harabi2, Tifenn Hirtzlin3, Elisa Vianello3, Raphaël Laurent4, Jacques Droulez5, Pierre Bessière6, Jean-Michel Portal7, Marc Bocquet7 and Damien Querlioz2 1Université Paris-Saclay, CNRS, FR; 2Universite Paris-Saclay, CNRS, Centre de Nanosciences et de Nanotechnologies, FR; 3Universite Grenoble-Alpes, CEA-Leti, FR; 4HawAI.tech, FR; 5HawAI.tech, Sorbonne Universite, CNRS, Institut des Systemes Intelligents et de Robotique, FR; 6Sorbonne Universite, CNRS, Institut des Systemes Intelligents et de Robotique, FR; 7Aix-Marseille Universite, CNRS, Institut Matériaux Micro électronique Nanosciences de Provence, FR Abstract Bayesian reasoning is a machine learning approach that provides explainable outputs and excels in small-data situations with high uncertainty. However, it requires intensive memory access and computation and is, therefore, too energy-intensive for extreme edge contexts. Near-memory computation with memristors (or RRAM) can greatly improve the energy efficiency of its computations. Here, we report two fabricated integrated circuits in a hybrid CMOS-memristor process, featuring each sixteen tiny memristor arrays and the associated near-memory logic for Bayesian inference. One circuit performs Bayesian inference using stochastic computing, and the other uses logarithmic computation; these two paradigms fit the area constraints of near-memory computing well. On-chip measurements show the viability of both approaches with respect to memristor imperfections. The two Bayesian machines also operated well at low supply voltages. We also designed scaled-up versions of the machines. Both scaled-up designs can perform a gesture recognition task using orders of magnitude less energy than a microcontroller unit. We also see that if an accuracy lower than 86% is sufficient for this sample task, stochastic computing consumes less energy than logarithmic computing; for higher accuracies, logarithmic computation is more energy-efficient. These results highlight the potential of memristor-based near-memory Bayesian computing, providing both accuracy and energy efficiency. |
11:06 CET | LBR1.3 | TOWARDS A ROBUST MULTIPLY-ACCUMULATE CELL IN PHOTONICS USING PHASE-CHANGE MATERIALS Speaker: Raphael Cardoso, Ecole Centrale de Lyon, FR Authors: Raphael Cardoso1, Clément Zrounba1, Mohab Abdalla1, Paul Jimenez1, Mauricio Gomes1, Benoît Charbonnier2, Fabio Pavanello1, Ian O'Connor3 and Sébastien Le Beux4 1Ecole Centrale de Lyon, FR; 2CEA-Leti, FR; 3Lyon Institute of Nanotechnology, FR; 4Concordia University, CA Abstract In this paper we propose a novel approach to multiply-accumulate (MAC) operations in photonics. This approach is based on stochastic computing and on the dynamic behavior of phase-change materials (PCMs), leading to the unique characteristic of automatically storing the result in non-volatile memory. We demonstrate that, even with perfect look-up tables, the standard approach to PCM scalar multiplication is highly susceptible to perturbations as small as 0.1% of the input power, causing repetitive peaks of 600% relative error. In the same operating conditions, the proposed method achieves an average of 7× improvement in precision. |
11:09 CET | LBR1.4 | LIGHTSPEED BINARY NEURAL NETWORKS USING OPTICAL PHASE-CHANGE MATERIALS Speaker: Taha Michael Shahroodi, TU Delft, NL Authors: Taha Shahroodi1, Raphael Cardoso2, Mahdi Zahedi1, Stephan Wong1, Alberto Bosio3, Ian O'Connor3 and Said Hamdioui1 1TU Delft, NL; 2Ecole Centrale de Lyon, FR; 3Lyon Institute of Nanotechnology, FR Abstract This paper investigates the potential of a compute-in-memory core based on optical Phase Change Materials (oPCMs) to speed up and reduce the energy consumption of the Matrix-Matrix-Multiplication operation. The paper also proposes a new data mapping for Binary Neural Networks (BNNs) tailored for our oPCM core. The preliminary results show a significant latency improvement irrespective of the evaluated network structure and size. The improvement varies from network to network and goes up to 1053X. |
11:12 CET | LBR1.5 | REAL-TIME FULLY UNSUPERVISED DOMAIN ADAPTATION FOR LANE DETECTION IN AUTONOMOUS DRIVING Speaker: Kshitij Bhardwaj, Lawrence Livermore National Lab, US Authors: Kshitij Bhardwaj1, Zishen Wan2, Arijit Raychowdhury2 and Ryan Goldhahn1 1Lawrence Livermore National Lab, US; 2Georgia Tech, US Abstract While deep neural networks are being utilized heavily for autonomous driving, they need to be adapted to new unseen environmental conditions for which they were not trained. We focus on a safety critical application of lane detection, and propose a lightweight, fully unsupervised, real-time adaptation approach that only adapts the batch-normalization parameters of the model. We demonstrate that our technique can perform inference, followed by on-device adaptation, under a tight constraint of 30 FPS on Nvidia Jetson Orin. It shows similar accuracy (avg. of 92.19%) as a state-of-the-art semi-supervised adaptation algorithm but which does not support real-time adaptation. |
11:15 CET | LBR1.6 | A LINEAR-TIME, OPTIMIZATION-FREE, AND EDGE DEVICE-COMPATIBLE HYPERVECTOR ENCODING Speaker: Sercan Aygun, University of Louisiana at Lafayette, US Authors: Sercan Aygun1, M. Hassan Najafi1 and Mohsen Imani2 1University of Louisiana at Lafayette, US; 2University of California, Irvine, US Abstract Hyperdimensional computing (HDC) offers a single-pass learning system by imitating the brain-like signal structure. HDC data structure is in random hypervector format for better orthogonality. Similarly, in bit-stream processing – aka stochastic computing– systems, low-discrepancy (LD) sequences are used for the efficient generation of uncorrelated bit-streams. However, LD-based hypervector generation has never been investigated before. This work studies the utilization of LD Sobol sequences as a promising alternative for encoding hypervectors. The new encoding technique achieves highly-accurate classification with a single-time training step without needing to iterate repeatedly over random rounds. The accuracy evaluations in an embedded environment exhibit a classification rate improvement of up to 9.79% compared to the conventional random hypervector encoding. |
11:18 CET | LBR1.7 | INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Speaker: Authors of the session, DATE, BE Author: Session Chairs, DATE, BE Abstract Participants can freely interact with authors during their interactive technical presentations. |
LKS1 Later … with the keynote speakers
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 11:00 CET - 12:30 CET
Location / Room: Darwin Hall
Session chair:
Rolf Ernst, TU Braunschweig, DE, Selma Saidi, TU Dortmund, DE
Session co-chair:
Ian O’Connor, Ecole Centrale de Lyon, FR
M01 Modern High-Level Synthesis for Complex Data Science Applications
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 11:00 CET - 12:30 CET
Location / Room: Marble Hall
Organisers:
Antonino Tumeo, Pacific Northwest National Laboratory, US
Fabrizio Ferrandi, Politecnico di Milano, IT
Nicolas Bohm Agostini, Pacific Northwest National Laboratory and Northeastern University, US
Serena Curzel, Pacific Northwest National Laboratory, US and Politecnico di Milano, IT
Michele Fiorito, Politecnico di Milano, IT
Presenters:
Antonino Tumeo, Pacific Northwest National Laboratory, US
Fabrizio Ferrandi, Politecnico di Milano, IT
Nicolas Bohm Agostini, Pacific Northwest National Laboratory and Northeastern University, US
Serena Curzel, Pacific Northwest National Laboratory, US and Politecnico di Milano, IT
Serena Curzel, Pacific Northwest National Laboratory, US and Politecnico di Milano, IT
Michele Fiorito, Politecnico di Milano, IT
Data Science applications (machine learning, graph analytics) are among the main drivers for the renewed interests in designing domain specific accelerators, both for reconfigurable devices (Field Programmable Gate Arrays) and Application-Specific Integrated Circuits (ASICs). Today, the availability of new high-level synthesis (HLS) tools to generate accelerators starting from high-level specifications provides easier access to FPGAs or ASICs and preserves programmer productivity. However, the conventional HLS flow typically starts from languages such as C, C++, or OpenCL, heavily annotated with information to guide the hardware generation, still leaving a significant gap with respect to the (Python based) data science frameworks. This tutorial will discuss HLS to accelerate data science on FPGAs or ASICs, highlighting key methodologies, trends, advantages, benefits, but also gaps that still need to be closed. The tutorial will provide a hands-on experience of the SOftware Defined Accelerators (SODA) Synthesizer, a toolchain composed of SODA-OPT, an opensource front-end and optimizer that interface with productive programming data science frameworks in Python, and Bambu, the most advanced open-source HLS tool available, able to generate optimized accelerators for data-intensive kernels. We will further show how SODA integrates with OpenROAD flow, providing a truly automated end-to-end open-source compiler toolchain from high level machine learning frameworks to Silicon.
M01.1 Session 1: Modern High-Level Synthesis for Complex Data Science Applications
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 11:00 CET - 12:30 CET
Location / Room: Marble Hall
Time | Label | Presentation Title Authors |
---|---|---|
11:00 CET | M01.1.1 | AGILE HARDWARE DESIGN FOR COMPLEX DATA SCIENCE APPLICATIONS: OPPORTUNITIES AND CHALLENGES Speaker: Antonino Tumeo, Pacific Northwest National Laboratory, US Abstract Introductory material, context, state-of the art, and research opportunities |
11:20 CET | M01.1.2 | BAMBU: AN OPEN-SOURCE RESEARCH FRAMEWORK FOR THE HIGH-LEVEL SYNTHESIS OF COMPLEX APPLICATIONS. Speaker: Fabrizio Ferrandi, Politecnico di Milano, IT Abstract Advanced materials on High-Level Synthesis methods |
11:45 CET | M01.1.3 | END-TO-END DEMONSTRATION FROM HIGH-LEVEL FRAMEWORKS TO SILICON WITH SODA-OPT, BAMBU, AND OPENROAD Speakers: Nicolas Bohm Agostini1 and Serena Curzel2 1Pacific Northwest National Laboratory and Northeastern University, US; 2Pacific Northwest National Laboratory, US and Politecnico di Milano, IT Abstract Hands on on the end-to-end toolchain |
12:10 CET | M01.1.4 | ADVANCED HIGH-LEVEL SYNTHESIS WITH BAMBU Speakers: Serena Curzel1 and Michele Fiorito2 1Pacific Northwest National Laboratory, US and Politecnico di Milano, IT; 2Politecnico di Milano, IT Abstract Hands on on advanced High-Level Synthesis with Bambu |
M04 Remote Side-Channel and Fault Attacks in FPGAs
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 11:00 CET - 12:30 CET
Location / Room: Okapi Room 0.8.3
Organisers:
Mehdi Tahoori, Karlsruhe Institute of Technology, DE
Jonas Krautter, Karlsruhe Institute of Technology, DE
Dennis Gnad, Karlsruhe Institute of Technology, DE
The shared FPGA platform in the cloud is based on the concept that the FPGA real estate can be shared among various users, probably event at different privilege levels. Such multi-tenancy comes with new security challenges, in which one user, while being completely logically isolated from another, can cause security breaches to another user on the same FPGA. In addition, such a hardware security vulnerability does not require physical access to the hardware to perform measurements or fault attacks, hence it can be done completely remotely. The main objective of this tutorial, which consists of the three components of in-depth lecture, live demo and hands-on experience is to introduce the new challenges coming from sharing FPGAs in both cloud as well as state of the art heterogeneous Systems on Chip (SoCs). It will explore the remote active and passive attacks at the electrical level for multi-tenant FPGAs in the cloud and SoCs and discusses possible countermeasures to deal with such security vulnerabilities.
The first part of this tutorial is an in-depth lecture covering the new trends in design of heterogeneous FPGA-SoCs as well as sharing the FPGAs in the clouds and the associated security vulnerabilities. The lecture part is given by Mehdi Tahoori. In this part, the traditional side channel and fault attacks are reviewed. We also show how the power delivery network (PDN) on the chip, board and system level can be utilized as a side channel medium and how the legitimate programmable logic constructs of the FPGA can be exploited for side channel voltage fluctuation measurements as well as injecting faults on the PDN for fault attacks and denial of service. Also, various countermeasures in terms of offline bitstream checking and online approaches based on fencing and sandboxing will be covered.
In the second part of the tutorial, we present attacks live on recent cloud FPGAs, such as the Intel Stratix 10 and the Xilinx Virtex Ultrascale+. The respective attacks, which are Correlation Power Analysis as well as a Differential Fault Attack on the AES will be explained in details to the attendees, who will be able to learn how to derive secret AES keys from faulty ciphertexts and side-channel measurements in a real system. Moreover, we demonstrate how recent FPGAs can be crashed in a Denial-of-Service attack, making recovery without power cycling impossible. This part is administrated by Dennis Gnad and Jonas Krautter.
Finally, the third part is a hands-on experience using low cost Lattice iCE40-HX8K breakout boards together with a comprehensive graphical interface, which can be used to control various parameters of the measurement or fault injection process on the FPGA. On this platform, participants of the tutorial are able to perform the demonstrated attacks themselves and learn about the importance of the respective parameters as well as the details of the attacked implementation.
MPP1 Multi-partner projects
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 11:00 CET - 12:30 CET
Location / Room: Gorilla Room 1.5.1
Session chair:
Luca Sterpone, Politecnico di Torino, IT
Time | Label | Presentation Title Authors |
---|---|---|
11:00 CET | MPP1.1 | NIMBLEAI: TOWARDS NEUROMORPHIC SENSING-PROCESSING 3D-INTEGRATED CHIPS Speaker: Xabier Iturbe, Ikerlan, ES Authors: Xabier Iturbe1, Nassim Abderrahmane2, Jaume Abella3, Sergi Alcaide3, Eric Beyne4, Henri-Pierre CHARLES5, Christelle Charpin-Nicolle6, Lars Chittka7, Angelica Davila1, Arne Erdmann8, Carles Estrada9, Ander Fernandez9, Anna Fontanelli10, Josè Flich11, Gianluca Furano12, Alejandro Hernan-Gloriani13, Erik Isusquiza14, radu grosu15, Carles Hernandez16, Daniele Ielmini17, David Jackson18, Maha Kooli19, Nicola Lepri20, Bernabe Linares-Barranco21, Jean-Loup Lachese2, Eric Laurent2, Menno Lindwer22, Frank Linsenmaier13, Mikel Lujan18, Karel Masarik23, Nele Mentens24, Orlando Moreira22, Chinmay Nawghane4, Luca Peres18, Jean-Philippe Noel5, Arash Pourtaherian22, Christoph Posch25, Peter Priller26, Zdenek Prikryl23, Felix Resch27, Oliver Rhodes18, Todor Stefanov28, Moritz Storring4, Michele Taliercio10, Rafael Tornero16, Marcel van de Burgwal4, Geert van der Plas4, Elisa Vianello6 and Pavel Zaykov23 1Ikerlan, ES; 2MENTA, FR; 3Barcelona Supercomputing Center (BSC-CNS), ES; 4IMEC, BE; 5CEA, FR; 6CEA-Leti, FR; 7Queen Mary University Of London, GB; 8RAYTRIX, DE; 9IKERLAN, ES; 10MZ TECHNOLOGIES, IT; 11Associate Professor, Universitat Politècnica de València, ES; 12ESA ESTEC, NL; 13VIEWPOINTSYSTEM, AT; 14ULMA MEDICAL TECHNOLOGIES, ES; 15TU Wien, AT; 16UNIVERSIDAD POLITECNICA DE VALENCIA, ES; 17Poltecnico di Milano, IT; 18The University of Manchester, GB; 19CEA/LIST, FR; 20POLITECNICO DI MILANO, IT; 21CSIC, ES; 22GRAI MATTER LABS, NL; 23CODASIP, CZ; 24UNIVERSITY OF LEIDEN, NL; 25PROPHESEE, FR; 26AVL LIST, AT; 27TU WIEN, AT; 28Leiden University, NL Abstract The NimbleAI Horizon Europe project leverages key principles of energy-efficient visual sensing and processing in biological eyes and brains, and harnesses the latest advances in 3D stacked silicon integration, to create an integral sensing-processing neuromorphic architecture that efficiently and accurately runs computer vision algorithms in area-constrained endpoint chips. The rationale behind the NimbleAI architecture is: sense data only with high information value and discard data as soon as they are found not to be useful for the application (in a given context). The NimbleAI sensing-processing architecture is to be specialized after-deployment by tunning system-level trade-offs for each particular computer vision algorithm and deployment environment. The objectives of NimbleAI are: (1) 100x performance per mW gains compared to state-of-the-practice solutions (i.e., CPU/GPUs processing frame-based video); (2) 50x processing latency reduction compared to CPU/GPUs; (3) energy consumption in the order of tens of mWs; and (4) silicon area of approx. 50 mm^2. |
11:03 CET | MPP1.2 | OPTIMIZING INDUSTRIAL APPLICATIONS FOR HETEROGENEOUS HPC SYSTEMS: THE OPTIMA PROJECT Speaker: Dimitris Theodoropoulos, Institute of Communication and Computation Systems, GR Authors: Dimitris Theodoropoulos1, Oliver Michel2, PAVLOS MALAKONAKIS3, Konstantinos Georgopoulos4, Giovanni Isotton5, Dionisios Pnevmatikatos6, Ioannis Papaefstathiou7, Gino Perna8, Panagiotis Miliadis9, Mariza Zanotti8, Chloe Alverti9, Aggelos Ioannou10, Max Engelen11, Valeria Bartsch12, Mathias Balzer12 and Iakovos Mavroidis13 1Institute of Communication and Computer Systems, GR; 2Cyberbotics, CH; 3TU Crete, GR; 4Telecommunication Systems Institute, TU Crete, GR; 5M3E, IT; 6National TU Athens & ICCS, GR; 7Aristotle University of Thessaloniki, GR; 8EnginSoft SpA, Trento, IT; 9National TU Athens, GR; 10School of Electrical & Computer Engineering, TU Crete, Chania, Greece, GR; 11Maxeler IoT Labs, Delft, Netherlands, NL; 12Fraunhofer ITWM Kaiserslautern, DE; 13Telecommunication Systems Institute, GR Abstract OPTIMA is an SME-driven project (intermediate stage) that aims to port and optimize industrial applications and a set of open-source libraries into two novel FPGA-populated HPC systems. Target applications are from the domain of robotics simulation, underground analysis and computational fluid dynamics (CFD), where data processing is based on differential equations, matrix-matrix and matrix-vector operations. Moreover, the OPTIMA OPen Source (OOPS) library will support basic linear algebraic operations, sparse matrix-vector arithmetic, as well as computer-aided engineering (CAE) solvers. The OPTIMA target platforms are JUMAX, an HPC system that couples an AMD Epyc Server with Maxeler FPGA-based Dataflow Engines (DFEs), and server-class machines with Alveo FPGA cards installed. Experimental results on applications up to now, show that performance on robotic simulation can be enhanced up to 1.2x, CFD calculations up to 4.7x, and BLAS routines up to 7x compared to optimized software implementations from OpenBLAS. |
11:06 CET | MPP1.3 | DESIGN ENABLEMENT FLOW FOR CIRCUITS WITH INHERENT OBFUSCATION BASED ON RECONFIGURABLE TRANSISTORS Speaker: Jens Trommer, NaMLab gGmbH, DE Authors: Jens Trommer1, Niladri Bhattacharjee1, Thomas Mikolajick2, Sebastian Huhn3, Marcel Merten3, Mohammed Djeridane3, Muhammad Hassan4, Rolf Drechsler5, Shubham Rai6, Nima Kavand6, Armin Darjani6, Akash Kumar6, Violetta Sessi7, Maximilian Drescher7, Sabine Kolodinski7 and Maciej Wiatr7 1Namlab gGmbH, DE; 2NaMLab Gmbh / TU Dresden, DE; 3University of Bremen, DE; 4University of Bremen/Cyber Physical Systems, DFKI, DE; 5University of Bremen | DFKI, DE; 6TU Dresden, DE; 7Globalfoundries Fab 1, DE Abstract Reconfigurable transistors are a new emerging type of device, which offer the promise to improve the resistance of electronic components against know-how theft. In order to enable a product development of such an emerging device, a cross-layer design enablement strategy is needed, as emerging technologies are not necessarily compatible with standard tools used in the industry. In ‘CirroStrato', we aim on the development of such a complete flow enabling CMOS co-integration of reconfigurable transistors, ranging from process adjustments, device modeling, library characterization, physical and logical synthesis up towards sophisticated hardware security tests. In this multi-partner-project (MPP) paper, our aim is to elucidate the overall design enablement flow, as well as current research challenges on the individual stages. |
11:09 CET | MPP1.4 | SAFEXPLAIN: SAFE AND EXPLAINABLE CRITICAL EMBEDDED SYSTEMS BASED ON AI Speaker: Francisco J Cazorla, BSC, ES Authors: Jaume Abella1, Jon Perez2, Cristofer Englund3, Bahram Zonooz4, Gabriele Giordana5, Carlo Donzella6, Francisco J Cazorla7, Enrico Mezzetti7, Isabel Serra7, Axel Brando7, Irune Agirre2, Fernando Eizaguirre2, Thanh Bui3, Elahe Arani4, Fahad Sarfraz4, Ajay Balasubramaniam4, Ahmed Badar4, Ilaria Bloise5, Lorenzo Feruglio5, Ilaria Cinelli5, Davide Brighenti8 and Davide Cunial8 1Barcelona Supercomputing Center (BSC-CNS), ES; 2Ikerlan, ES; 3RISE, SE; 4Navinfo Europe, NL; 5AIKO s.r.l., IT; 6Exida Development, s.r.l., IT; 7BSC, ES; 8Exida Engineering, s.r.l., IT Abstract Deep Learning (DL) techniques are at the heart of most future advanced software functions in Critical Autonomous AI-based Systems (CAIS), where they also represent a major competitive factor. Hence, the economic success of CAIS industries (e.g., automotive, space, railway) depends on their ability to design, implement, qualify, and certify DL-based software products under bounded effort/cost. However, there is a fundamental gap between Functional Safety (FUSA) requirements on CAIS and the nature of DL solutions. This gap stems from the development process of DL libraries and affects high level concepts such as (1) explainability and traceability, (2) suitability for varying safety requirements, (3) FUSA-compliant implementations, and (4) real-time constraints. As a matter of fact, the data-dependent and stochastic nature of DL algorithms clash with current FUSA practice, which instead builds on deterministic, verifiable, and pass/fail test-based software. The SAFEXPLAIN project tackles these challenges by providing a novel and flexible approach to allow the certification – hence adoption – of DL-based solutions in CAIS building on (1) DL solutions that provide end-to-end traceability, with specific approaches to explain whether predictions can be trusted and strategies to reach (and prove) correct operation, in accordance to certification standards; (2) alternative and increasingly sophisticated design safety patterns for DL with varying requirements of criticality and fault tolerance; (3) DL library implementations that adhere to safety requirements; and (4) computing platform configurations, to regain determinism, and probabilistic timing analyses, to handle the remaining nondeterminism. |
11:12 CET | MPP1.5 | THE FORA EUROPEAN TRAINING NETWORK ON FOG COMPUTING FOR ROBOTICS AND INDUSTRIAL AUTOMATION Speaker: Paul Pop, TU Denmark, DK Authors: Mohammadreza Barzegaran and Paul Pop, TU Denmark, DK Abstract Fog Computing for Robotics and Industrial Automation, FORA, was a European Training Network which focused on future industrial automation architectures and applications based on an emerging technology, called Fog Computing. The research project focused on research related to Fog Computing with applicability to industrial automation and manufacturing. The main outcome of the FORA project was the development of a deterministic Fog Computing Platform (FCP) to be used for implementing industrial automation and robotics solutions for Industry 4.0. This paper reports on the scientific outcomes of the FORA project. FORA has proposed a reference system architecture for Fog Computing, which was published as an open Architecture Analysis Design Language (AADL) model. The technologies developed in FORA include fog nodes and hypervisors, resource management mechanisms and middleware for deploying scalable Fog Computing applications, while guaranteeing the non-functional properties of the virtualized industrial control applications, and methods and processes for assuring the safety and security of the FCP. Several industrial use cases were used to evaluate the suitability of the FORA FCP for the Industrial IoT area, and to demonstrate how the platform can be used to develop industrial control applications and data analytics applications. |
11:15 CET | MPP1.6 | PETAOPS/W EDGE-AI μPROCESSORS: MYTH OR REALITY? Speaker: Manil Dev Gomony, Eindhoven University of Technology, NL Authors: Manil Dev Gomony1, Floran de Putter2, Anteneh Gebregiorgis3, Gianna Paulin4, Linyan Mei5, Vikram Jain5, Said Hamdioui3, Victor Sanchez2, Tobias Grosser6, Marc Geilen2, Marian Verhelst5, Friedemann Zenke7, Frank Gurkaynak4, Barry de Bruin2, Sander Stuijk2, Simon Davidson8, Sayandip De2, Mounir Ghogho9, Alexandra Jimborean10, Sherif Eissa2, Luca Benini11, Dimitrios Soudris12, Rajendra Bishnoi3, Sam Ainsworth13, Federico Corradi2, Ouassim Karrakchou9, Tim Güneysu14 and Henk Corporaal2 1Eindhoven Unversity of Technology, NL; 2Eindhoven University of Technology, NL; 3TU Delft, NL; 4ETH Zurich, CH; 5KU Leuven, BE; 6University of Edinburgh, GB; 7Friedrich Miescher Institute, CH; 8The University of Manchester, GB; 9Universite Internationale de Rabat, MA; 10University of Murcia, ES; 11ETH Zurich, CH | Università di Bologna, IT; 12National Technical University of Athens, GR; 13University of Edinburgh,, GB; 14Ruhr-Universität Bochum & DFKI, DE Abstract With the rise of DL, our world braces for AI in every edge device, creating an urgent need for edge-AI SoCs. This SoC hardware needs to support high throughput, reliable and secure AI processing at ULP, with a very short time to market. With its strong legacy in edge solutions and open processing platforms, the EU is well-positioned to become a leader in this SoC market. However, this requires AI edge processing to become at least 100 times more energy-efficient, while offering sufficient flexibility and scalability to deal with AI as a fast-moving target. Since the design space of these complex SoCs is huge, advanced tooling is needed to make their design tractable. The CONVOLVE project (currently in Inital stage) addresses these roadblocks. It takes a holistic approach with innovations at all levels of the design hierarchy. Starting with an overview of SOTA DL processing support and our project methodology, this paper presents 8 important design choices largely impacting the energy efficiency and flexibility of DL hardware. Finding good solutions is key to making smart-edge computing a reality. |
11:18 CET | MPP1.7 | VE-FIDES: DESIGNING TRUSTWORTHY SUPPLY CHAINS USING INNOVATIVE FINGERPRINTING IMPLEMENTATIONS Speaker: Bernhard Lippmann, Infineon Technologies, DE Authors: Bernhard Lippmann1, Joel Hatsch1, Stefan Seidl1, Detlef Houdeau1, Niranjana Papagudi Subrahmanyam2, Daniel Schneider3, Malek Safieh3, Anne Passarelli3, Aliza Maftun3, Michaela Brunner4, Tim Music4, Michael Pehl4, Tauseef Siddiqui4, Ralf Brederlow5, Ulf Schlichtmann4, Bjoern Driemeyer6, Maurits Ortmanns7, Robert Hesselbarth8 and Matthias Hiller8 1Infineon Technologies, DE; 2Siemens AG, DE; 3Siemens, DE; 4TU Munich, DE; 5TUM School of EDA, DE; 6Uni Ulm, DE; 7University of Ulm, DE; 8Fraunhofer AISEC, DE Abstract The project VE-FIDES will contribute with a solu- tion based on an innovative multi-level fingerprinting approach to secure electronics supply chains against the threats of malicious modification, piracy, and counterfeiting. Hardware-fingerprints are derived from minuscule, unavoidable process variations using the technology of Physical Unclonable Functions (PUFs). The derived fingerprints are processed to a system fingerprint enabling unique identification, not only of single components but also on PCB level. With the proposed concept, we show how the system fingerprint can enhance the trustworthiness of the overall system. For this purpose, the complete system including tiny sensors, a secure element and its interface to the application is considered in VE-FIDES. New insights into methodologies to derive component and system fingerprints are gained. These techniques for the verification of system integrity are complemented by methods for preventing reverse engineering. Two application scenarios are in the focus of VE-FIDES: Industrial control systems and an automotive use case are considered, giving insights to a wide spectrum of requirements for products built from components provided by international supply chains. |
11:21 CET | MPP1.8 | INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Speaker: Authors of the session, DATE, BE Author: Session Chairs, DATE, BE Abstract Participants can freely interact with authors during their interactive technical presentations. |
LK1 IEEE CEDA Distinguished Lecturer Lunchtime Keynote
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 13:00 CET - 14:00 CET
Location / Room: Darwin Hall
Session chair:
Gi-Joon Nam, IBM, IEEE-CEDA President, US
Session co-chair:
Robert Wille, TU Munich, DE
hosted by IEEE CEDA
Time | Label | Presentation Title Authors |
---|---|---|
13:00 CET | LK1.1 | RESTORING THE MAGIC IN DESIGN Presenter: Jan Rabaey, IMEC / UC Berkeley, US Author: Jan Rabaey, IMEC / UC Berkeley, US Abstract The emergence of "Very Large Scale Integration (VLSI)” in the late 1970's created a groundswell of feverish innovation. Inspired by the vision laid out in Mead and Conway's "Introduction to VLSI Design”, numerous researchers embarked on venues to unleash the capabilities offered by integrated circuit technology. The introduction of design rules, separating manufacturing from design, combined with an intermediate abstraction language (CIF) and a silicon brokerage service (MOSIS) gave access to silicon for a large population of eager designers. The magic however expanded way beyond these circuit enthusiasts and attracted a whole generation of software experts to help automate the design process, given rise to concepts such as layout generation, logic synthesis, and silicon compilation. It is hard to overestimate the impact that this revolution has had on information technology and society at large. About fifty years later, Integrated Circuits are everywhere. Yet, the process of creating these amazing devices feels somewhat tired. CMOS scaling, the engine behind the evolution in complexity over all these decades, is slowing down and will most likely peter out in about a decade. So has innovation in design tools and methodologies. As a consequence, the lure of IC design and design tool development has faded, causing a talent shortage worldwide. Yet, at the same time, this moment of transition offers a world of opportunity and excitement. Novel technologies and devices, integrated in three-dimensional artifacts are emerging and are opening the door for truly transformational applications such as brain-machine interfaces and swarms of nanobots. Machine learning, artificial intelligence, optical and quantum computing present novel models of computation surpassing the instruction-set processor paradigm. With this comes a need again to re-invent the design process, explicitly exploiting the capabilities offered by this next generation of computing sysyems. In summary, it is time to put the magic in design again. |
ASD2 ASD special session: Information Processing Factory, Take Two on Self-Aware Systems of MPSoCs
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 14:00 CET - 15:30 CET
Location / Room: Gorilla Room 1.5.4/5
Session chair:
Bryan Donyanavard, San Diego State University, US
Session co-chair:
Smail Niar, UPHF, FR
The Information Processing Factory (IPF) project is a collaboration between research teams in the US (UC Irvine) and Germany (TU Munich and TU Braunschweig) looking into Self-aware MPSoCs. IPF 1.0, was first introduced in ESWEEK 2016 as a paradigm to master complex dependable systems. The IPF paradigm applies principles inspired by factory management to the continuous operation and optimization of highly-integrated embedded systems. IPF 2.0 is an extension of the IPF for recent data-centric approaches and decentralization methodologies. While an IPF 1.0 system can operate independently, IPF 2.0 has a system-of-systems structure in which several IPF 1.0 “factories” interact, thus providing an additional layer of abstraction aimed at this data-centric approach. It horizontally extends core concepts such as self-optimization, self-construction, and runtime verification, while maintaining the strengths of the existing IPF methodology. Four talks in this session highlight the various concepts in IPF 2.0 illustrated through a truck platooning exemplar.
The talks outline the challenges introduced when moving from self-organizing local systems in IPF 1.0 to autonomous systems collaboration in IPF 2.0, using commercial vehicle platooning as a use case. The first talk explains how the self-aware truck control systems collaborate towards a platoon-level runtime verification that continuously supervises the state of a platoon, even under a changing platoon formation and external disturbance, e.g., by intersecting traffic participants. The second talk outlines the challenges related to managing enormous amounts of dynamic data in the system, and discusses how self-aware caching can help in mastering the resulting communication and data management requirements. The third talk proposes approaches to mitigate the energy cost of data management across multiple systems. The fourth talk addresses lack of explainability in the underlying machine learning technology in collaborative autonomous systems.
Time | Label | Presentation Title Authors |
---|---|---|
14:00 CET | ASD2.1 | TRUST, BUT VERIFY: TOWARDS SELF-AWARE, SAFE, AUTONOMOUS SELF-DRIVING SYSTEMS Presenter: Fadi Kurdahi, University of California, Irvine, US Author: Fadi Kurdahi, University of California, Irvine, US Abstract . |
14:22 CET | ASD2.2 | VEHICLE AS A CACHE – A DATA CENTRIC PLATFORM FOR THE IPF PARADIGM Presenter: Rolf Ernst, TU Braunschweig, DE Author: Rolf Ernst, TU Braunschweig, DE Abstract . |
14:45 CET | ASD2.3 | COMPUTATIONAL SELF-AWARENESS FOR ENERGY-EFFICIENT MEMORY SYSTEMS Presenter: Nikil Dutt, UC Irvine, US Author: Nikil Dutt, UC Irvine, US Abstract . |
15:07 CET | ASD2.4 | LEARNING CLASSIFIER TABLES - TURNING ML DECISION MAKING EXPLAINABLE Presenter: Andreas Herkersdorf, TU Munich, DE Author: Andreas Herkersdorf, TU Munich, DE Abstract . |
BPA6 Logic synthesis and verification
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 14:00 CET - 16:00 CET
Location / Room: Okapi Room 0.8.2
Session chair:
Rolf Drechsler, Bremen University, DE
Time | Label | Presentation Title Authors |
---|---|---|
14:00 CET | BPA6.1 | COMPUTING EFFECTIVE RESISTANCES ON LARGE GRAPHS BASED ON APPROXIMATE INVERSE OF CHOLESKY FACTOR Speaker: Zhiqiang Liu, Tsinghua University, CN Authors: Zhiqiang Liu and Wenjian Yu, Tsinghua University, CN Abstract Effective resistance, which originates from the field of circuits analysis, is an important graph distance in spectral graph theory. It has found numerous applications in various areas, such as graph data mining, spectral graph sparsification, circuits simulation, etc. However, computing effective resistances accurately can be intractable and we still lack efficient methods for estimating effective resistances on large graphs. In this work, we propose an efficient algorithm to compute effective resistances on general weighted graphs, based on a sparse approximate inverse technique. Compared with a recent competitor, the proposed algorithm shows several hundreds of speedups and also one to two orders of magnitude improvement in the accuracy of results. Incorporating the proposed algorithm with the graph sparsification based power grid (PG) reduction framework, we develop a fast PG reduction method, which achieves an average 6.4X speedup in the reduction time without loss of reduction accuracy. In the applications of power grid transient analysis and DC incremental analysis, the proposed method enables 1.7X and 2.5X speedup of overall time compared to using the PG reduction based on accurate effective resistances, without increase in the error of solution. |
14:25 CET | BPA6.2 | FANOUT-BOUNDED LOGIC SYNTHESIS FOR EMERGING TECHNOLOGIES - A TOP-DOWN APPROACH Speaker: Dewmini Sudara Marakkalage, EPFL, CH Authors: Dewmini Marakkalage and Giovanni De Micheli, EPFL, CH Abstract In logic circuits, the number of fanouts a gate can drive is limited, and such limits are tighter in emerging technologies such as superconducting electronic circuits. In this work, we study the problem of resynthesizing a logic network with bounded-fanout gates while minimizing area. We 1) formulate this problem for a fixed target logic depth as an integer linear program (ILP) and present exact solutions for small logic networks, and 2) propose a top-down approach to construct a feasible solution to the ILP which yields an efficient algorithm for fanout bounded synthesis. When using the minimum depth achievable with unbounded fanouts as the target logic depth, our top-down approach achieves 11.82% better area as compared to the state-of-the-art with matching or better delays. |
14:50 CET | BPA6.3 | SYNTHESIS WITH EXPLICIT DEPENDENCIES Speaker: Priyanka Golia, National University of Singapore and Indian Institute of Technology Kanpur, SG Authors: Priyanka Golia1, Subhajit Roy2 and Kuldeep S Meel3 1IIT Kanpur and NUS Singapore, SG; 2IIT Kanpur, IN; 3National University of Singapore, SG Abstract Quantified Boolean Formulas (QBF) extend propositional logic with quantification /forall exists for propositional variables. In QBF, an existentially quantified variable is allowed to depend on all universally quantified variables in its scope. Dependency Quantified Boolean Formulas (DQBF) restrict the dependencies of existentially quantified variables. In DQBF, existentially quantified variables have explicit dependencies on a subset of universally quantified variables, called Henkin dependencies. Given a Boolean specification between the set of inputs and outputs, the problem of Henkin synthesis is to synthesize each output variable as a function of its Henkin dependencies such that the specification is met. Henkin synthesis has wide-ranging applications, including verification of partial circuits, controller synthesis, and circuit realizability. In this work, we propose a data-driven approach for Henkin synthesis called Manthan3. On an extensive evaluation of over 563 instances arising from past DQBF solving competitions, we demonstrate that Manthan3 is competitive with state-of-the-art tools. Furthermore, Manthan3 solves 26 benchmarks that none of the current state-of-the-art techniques could solve. |
15:15 CET | BPA6.4 | INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Speaker: Authors of the session, DATE, BE Author: Session Chairs, DATE, BE Abstract Participants can freely interact with authors during their interactive technical presentations. |
BPA9 Memory-centric computing
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 14:00 CET - 16:00 CET
Location / Room: Gorilla Room 1.5.1
Session chair:
Said Hamdioui, TU Delft, NL
Time | Label | Presentation Title Authors |
---|---|---|
14:00 CET | BPA9.1 | MINIMIZING COMMUNICATION CONFLICTS IN NETWORK-ON-CHIP BASED PROCESSING-IN-MEMORY ARCHITECTURE Speaker: Hanbo Sun, Tsinghua University, CN Authors: Hanbo Sun1, Tongxin Xie1, Zhenhua Zhu1, Guohao Dai2, Huazhong Yang1 and Yu Wang1 1Tsinghua University, CN; 2Shanghai Jiao Tong University, CN Abstract Deep Neural Networks (DNNs) have made significant breakthroughs in various fields. However, their enormous computation and parameters seriously hinder their application. Emerging Processing-In-Memory (PIM) architectures provide extremely high energy efficiency to accelerate DNN computing. Moreover, Network-on-Chip (NoC) based PIM architectures significantly improve the scalability of PIM architectures. However, the contradiction between high communication and limited NoC bandwidth introduces severe communication conflicts. Existing work neglects the impact of communication conflicts. On the one hand, neglecting communication conflicts leads to the lack of precise performance estimations in the mapping process, making it hard to find optimal results. On the other hand, communication conflicts cause low NoC bandwidth utilization in the schedule process. And there is over 70% latency gap in existing work caused by communication conflicts. This paper proposes communication conflict optimized mapping and schedule strategies for NoC based PIM architectures. The proposed mapping strategy constructs communication conflict graphs to model communication conflicts. Based on this constructed graph, we adopt a Graph Neural Network (GNN) as a precise performance estimator. Our schedule strategy predefines the communication priority and NoC communication behavior tables for target DNN workloads. In this way, it can improve the NoC bandwidth utilization effectively. Compared with existing work, for typical classification DNNs on the CIFAR and ImageNet datasets, the proposed strategies reduce 78% latency and improve the throughput by 3.33x on average with negligible deployment and hardware overhead. Experimental results also show that our strategies decrease the average gap to ideal cases without communication conflicts from 80.7% and 70% to 12.3% and 1.26% for latency and throughput, respectively. |
14:25 CET | BPA9.2 | HIERARCHICAL NON-STRUCTURED PRUNING FOR COMPUTING-IN-MEMORY ACCELERATORS WITH REDUCED ADC RESOLUTION REQUIREMENT Speaker: Wenlu Xue, Beihang University, CN Authors: Wenlu Xue1, Jinyu Bai2, Sifan Sun3 and Wang Kang2 1Beihang Universiry, CN; 2Beihang University, CN; 3Beiahng University, CN Abstract The crossbar architecture, which is comprised of novel nano-devices, enables high-speed and energy-efficient computing-in-memory (CIM) for neural networks. However, the overhead from analog-to-digital converters (ADCs) substantially degrades the energy efficiency of CIM accelerators. In this paper, we introduce a hierarchical non-structured pruning strategy where value-level and bit-level pruning are performed jointly on neural networks to reduce the resolution of ADCs by using the famous alternating direction method of multipliers (ADMM). To verify the effectiveness, we deployed the proposed method to a variety of state-of-the-art convolutional neural networks on two image classification benchmark datasets: CIFAR10, and ImageNet. The results show that our pruning method can reduce the required resolution of ADCs to 2 or 3 bits with only slight accuracy loss (∼0.25%), and thus can improve the hardware efficiency by 180%. |
14:50 CET | BPA9.3 | PIC-RAM: PROCESS-INVARIANT CAPACITIVE MULTIPLIER BASED ANALOG IN MEMORY COMPUTING IN 6T SRAM Speaker: Kailash Prasad, IIT Gandhinagar, IN Authors: Kailash Prasad, Aditya Biswas, Arpita Kabra and Joycee Mekie, IIT Gandhinagar, IN Abstract In-Memory Computing (IMC) is a promising approach to enabling energy-efficient Deep Neural Network-based applications on edge devices. However, analog domain dot product and multiplication suffers accuracy loss due to process variations. Furthermore, wordline degradation limits its minimum pulsewidth, creating additional non-linearity and limiting IMC's dynamic range and precision. This work presents a complete end-to-end process invariant capacitive multiplier based IMC in 6T-SRAM (PIC-RAM). The proposed architecture employs the novel idea of two-step multiplication in column-major IMC to support 4-bit multiplication. The PIC-RAM uses an operational amplifier-based capacitive multiplier to reduce bitline discharge allowing good enough WL pulse width. Further, it employs process tracking voltage reference and fuse capacitor to tackle dynamic and post-fabrication process variations, respectively. Our design is compute-disturb free and provides a high dynamic range. To the best of our knowledge, PIC-RAM is the first analog SRAM IMC approach to tackle process variation with a focus on its practical implementation. PIC-RAM has a high energy efficiency of about 25.6 TOPS/W for 4-bit X 4-bit multiplication and has only 0.5% area overheads due to the use of the capacitance multiplier. We obtain 409 bit-wise TOPS/W, which is about 2X better than state-of-the-art. PIC-RAM shows the TOP-1 accuracy for ResNet-18 on CIFAR10 and MNIST is 89.54% and 98.80% for 4bitX4bit multiplication. |
15:15 CET | BPA9.4 | INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Speaker: Authors of the session, DATE, BE Author: Session Chairs, DATE, BE Abstract Participants can freely interact with authors during their interactive technical presentations. |
CF1.1 Careers Fair – Company Presentations
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 14:00 CET - 14:45 CET
Location / Room: Marble Hall
Session chair:
Anton Klotz, Cadence Design Systems, DE
This is a Young People Programme event. During the Company Presentation Session, participating companies will introduce themselves and explain their business and working environment. Presenting companies include Cadence Design Systems, Synopsys, imec, X-FAB, Bosch, ICSense, Springer Nature, Siemens and RacyICs.
Time | Label | Presentation Title Authors |
---|---|---|
14:05 CET | CF1.1.1 | INTRODUCING CADENCE DESIGN SYSTEMS Presenter: Ben Woods, Cadence Design Systems, IE Author: Ben Woods, Cadence Design Systems, IE Abstract . |
14:12 CET | CF1.1.2 | INTRODUCING SYNOPSYS Presenter: Xander Bergen Henegouwen, Synopsys, NL Author: Xander Bergen Henegouwen, Synopsys, NL Abstract . |
14:19 CET | CF1.1.3 | INTRODUCING BOSCH Presenter: Matthias Kühnle, Bosh, DE Author: Matthias Kühnle, Bosh, DE Abstract . |
14:26 CET | CF1.1.4 | INTRODUCING XFAB Presenter: Rachid Hamani, X-FAB, FR Author: Rachid Hamani, X-FAB, FR Abstract . |
14:32 CET | CF1.1.5 | INTRODUCING RACYICS Presenter: Florian Bilstein, RacyICs, DE Author: Florian Bilstein, RacyICs, DE Abstract . |
14:38 CET | CF1.1.6 | INTRODUCING SIEMENS Presenter: Jaclyn Krieger, Siemens, DE Author: Jaclyn Krieger, Siemens, DE Abstract . |
FS3 Focus session: Integrated Photonics, a key technology for the future of semiconductor based systems
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 14:00 CET - 15:30 CET
Location / Room: Okapi Room 0.8.1
Session chair:
Twan Korthorst, Synopsys, NL
In this session we will discuss all about silicon / integrated photonics technology and devices, current use in pluggable optical transceivers and the roadmap towards optical I/O for high performance compute, programmable photonics, and optical accelerators for AI/ML. Basics of light, optics and integrated photonics and design tool requirements, solutions and trends will be discussed, both for photonic ICs as well as 3DIC and 3DHI multi-die/multi-domain systems.
Time | Label | Presentation Title Authors |
---|---|---|
14:00 CET | FS3.1 | DESIGN SOLUTIONS FOR PHOTONIC ICS AND HETEROGENOUS SYSTEMS Presenter: Twan Korthorst, Synopsys, NL Author: Twan Korthorst, Synopsys, NL Abstract This presentation will introduce integrated silicon photonics technology and applications, and dive into more detail about the required design tools and solutions. Not only for photonic devices and circuits, but also in the context of a 3D heterogeneously integrated system with digital and analog electrical chips, interposer and photonic ICs. |
14:30 CET | FS3.2 | PUTTING THE LASER IN SILICON: HETEROGENEOUS OR HYBRID INTEGRATION Presenter: Martijn Heck, Eindhoven University of Technology, NL Author: Martijn Heck, Eindhoven University of Technology, NL Abstract Silicon is an important material for photonic integration, due to its compatibility with mature manufacturing infrastructure. However, owing to its indirect bandgap, it has no native lasers and amplifiers available, which severely limits its application and increases packaging costs and packaging challenges. In this talk I will outline the options for laser integration, and the associated challenges with respect to the design of such hybrid or heterogeneous photonic integrated circuits. |
15:00 CET | FS3.3 | THE CRUCIAL ROLE OF INTEGRATED PHOTONICS IN THE EVOLUTION TOWARDS LOW-ENERGY OPEN AND PROGRAMMABLE OPTICAL NETWORKS Presenter: Vittorio Curri, Politecnico di Torino, IT Author: Vittorio Curri, Politecnico di Torino, IT Abstract Networking technologies are fast evolving to support the request for ubiquitous Internet access that is becoming a fundamental need for the modern and inclusive society. Such evolution needs the development of networks into open, disaggregated and programmable systems according to the software-defined networking (SDN) paradigm. To enable such an evolution the infrastructure control must be separated by data networking operations performed by the transceivers (TRXs) for optical circuits deployment and optical switches for transparent lightpaths routing. Moreover, spatial and power consumption reduction are fundamental need. Integrated photonics is the crucial technology to enable such an evolution. Regarding the TRXs, all commercial solutions have already adopted such technologies enabling pluggable TRX currently operating at data-rate up to 1.2 Tbps/wavelength with substantial reduction of space occupation and power consumption, besides costs. While, for the switching, solutions are still at the prototype level, and preliminary solutions are already available on the market. We will discuss on the different photonics integrated solutions focusing on the possible revolutionary impact on the optical networking. |
LKS2 Later … with the keynote speakers
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 14:00 CET - 15:30 CET
Location / Room: Darwin Hall
Session chair:
Gi-Joon Nam, IBM, IEEE-CEDA President, US
Session co-chair:
Robert Wille, TU Munich, DE
W01 Eco-ES: Eco-design and circular economy of Electronic Systems
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 14:00 CET - 18:00 CET
Location / Room: Nightingale Room 2.6.1/2
Organisers:
Chiara Sandionigi, CEA, FR
Jean-Christophe Crebier, CNRS/G-INP/UGA, FR
Jonas Gustafsson, RISE, SE
David Bol, UCLouvain, BE
The impact of electronics on the environment is becoming an important issue, especially because of the number of systems growing exponentially. Eco-design and circular economy applied to Electronic Systems are thus becoming major challenges for our society to respond to the dangers for the environment: exponential increase in electronic waste generation, depletion of resources, contribution to climate change and poor resiliency to supply-chain issues. Electronic Systems designers willing to engage in eco-design face several difficulties, related in particular to a limited knowledge of the environmental impact from the design phase and the uncertain extension of the service lifetime of the system or parts of the system, owing to the variability in user behaviour and business models.
At DATE 2023, the workshop Eco-ES is devoted to Eco-design and circular economy of Electronic Systems. The objective of Eco-ES is to gather experts from both academia and industry, covering a wide scope in the environmental sustainability of Electronic Systems. Besides regular sessions with talks, a debate panel will offer a place for the audience to discuss and share ideas.
Workshop topics include:
- Specification and modelling of sustainable Electronic Systems
- Life Cycle Assessment tools and techniques
- Electronic Design Automation tools for eco-design
- Design Space Exploration including environmental aspects
- Eco-reliability techniques to design sustainable systems with extended lifetime
- Reparability methods
- Reuse strategies
- Recycling of Electronic Systems
- Refurbish for a second life of the products
- Sustainable cloud computing and datacenters
- Inter-disciplinary works linking the technology aspects of eco-design and circular economy to social and economic sciences
W01.1 Workshop introduction and Keynote
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 14:00 CET - 14:40 CET
Location / Room: Nightingale Room 2.6.1/2
14:00 - 14:10: Workshop introduction
Chiara Sandionigi, CEA, France
14:10 - 14:40: Transitioning to a Circular Economy for Greener Electronic Systems
Manuel Rei, 3DS, France
W01.2 Circular economy for Electronic Systems
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 14:40 CET - 15:40 CET
Location / Room: Nightingale Room 2.6.1/2
Session chair:
Jonas Gustafsson, RISE, SE
14.40 - 15.00: The environmental footprint of semiconductor manufacturing
Cédric Rolin, IMEC, Belgium
15.00 - 15.20: Ecodesign engineering sticking to actual end-of-life operations
Marc Heude, Thales, France
15.20 - 15.40: A circular economy approach for strategic metals in electronics
Serge Kimbel, Weeecycling, France
W01.3 Poster session & Coffee break
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 15:40 CET - 16:15 CET
Location / Room: Nightingale Room 2.6.1/2
- Aniah: A chip design methodology for Eco-design (Aniah)
- Repair, refurbishment and recycling of electronic devices with Lithium-ion batteries (DTI)
- Energy-efficient hardware reuse for sustainable data centers (LIRMM)
- EECONE: European Ecosystem for green Electronics
- Eco-innovation for Digital Systems and Integrated Circuits (CEA)
W01.4 Open call talks
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 16:15 CET - 17:15 CET
Location / Room: Nightingale Room 2.6.1/2
Session chair:
Jonas Gustafsson, RISE, SE
16.15 - 16.35: Sustainability analysis of indium phosphide technologies for RF applications
Benjamin Vanhouche, IMEC, Belgium
16.35 - 16.55: Eco-design and optimization of the edge cloud
Jonas Gustafsson, RISE, Sweden
16.55 - 17.15: Twinning digital ICT products: the digital product passport
Leandro Navarro, Universitat Politècnica de Catalunya, Spain
W01.5 Debate panel
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 17:15 CET - 18:00 CET
Location / Room: Nightingale Room 2.6.1/2
Session chairs:
David Bol, UC Louvain, BE
Chiara Sandionigi, CEA, FR
This session provides to the audience a place to debate about ecodesign, circular economy and end of life of electronic systems.
Invited speakers:
- Manuel Rei, 3DS, France
- Cédric Rolin, IMEC, Belgium
- Marc Heude, Thales, France
- Serge Kimbel, Weeecycling, France
W04 3rd Workshop Open-Source Design Automation (OSDA 2023)
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 14:00 CET - 18:00 CET
Location / Room: Okapi Room 0.8.3
Organisers:
Christian Krieg, TU Wien, AT
Claire Xenia Wolf, YosysHQ, AT
Andrea Borga, oliscience, NL
OSDA intends to provide an avenue for industry, academics, and hobbyists to collaborate, network, and share their latest visions and open-source contributions, with a view to promoting reproducibility and re-usability in the design automation space. DATE provides the ideal venue to reach this audience since it is the flagship European conference in this field - particularly poignant due to the recent efforts across the European Union (and beyond) that mandate “open access” for publicly funded research to both published manuscripts as well as software code necessary for reproducing its conclusions.
We invited authors of major tools and flows to talk about their recent activities to promote open-source hardware, and open-source design automation. Below you will find the list of speakers who kindly accepted our invitation already. The list is not yet complete, so hang on and watch out for updates!
The list is given in alphabetical order.
- Andrew Kahng (OpenROAD), University of California San Diego, USA
- Antonino Tumeo (SODA Synthesizer), Pacific Northwest National Laboratory (PNNL), USA
- Claire Xenia Wolf (Yosys), YosysHQ, Austria
- Frans Skarman (Spade), Linköping University, Sweden
- Jean-Paul Chaput (Coriolis2), Sorbonne Université, France
- Jim Lewis (OSVVM), SynthWorks, USA
- Larry Doolittle (vhd2vl), Lawrence Berkeley National Labs, USA
- Matthew Guthaus (OpenRAM), University of California Santa Cruz, USA
- Myrtle Shah (nextpnr, FABulous), Heidelberg University, Germany
- Rishiyur Nikhil (BSV and BH), Bluespec Inc., USA
- Tim Edwards (Caravel), Efabless, Inc., USA
- Tristan Gingold (GHDL), CERN, Switzerland
- Tsung-Wei Huang (TaskFlow), University of Utah, USA
A secondary objective of this workshop is to provide a peer-reviewed forum for researchers to publish “enabling” technology such as infrastructure or tooling as open-source contributions -- standalone technology that would not normally be regarded as novel by traditional conferences -- such that others inside and outside of academia may build upon it.
W04.1 Welcome Session
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 14:00 CET - 14:15 CET
Location / Room: Okapi Room 0.8.3
Workshop opening and poster pitch
W04.2 Front-end and Applications
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 14:15 CET - 16:00 CET
Location / Room: Okapi Room 0.8.3
Time | Label | Presentation Title Authors |
---|---|---|
14:15 CET | W04.2.1 | LARRY DOOLITTLE Abstract vhd2vl is a simple and open-source stand-alone program that converts synthesizable VHDL to Verilog. While it has plenty of limitations, it has proved useful to many developers since its start in 2004. This talk will cover its strengths, weaknesses, and alternatives. |
14:30 CET | W04.2.2 | ANTONINO TUMEO Abstract This talk presents the SODA (Software Defined Accelerators) framework, an open-source modular, multi-level, no-human-in-the-loop, hardware compiler that enables end-to-end generation of specialized accelerators from high-level data science frameworks. SODA is composed of SODA-Opt, a high-level frontend developed in MLIR that interfaces with domain-specific programming environments and allows performing system level design, and Bambu, a state-of-the-art high-level synthesis (HLS) engine that can target different device technologies. The framework implements design space exploration as compiler optimization passes. We show how the modular, yet tight, integration of the high-level optimizer and lower-level HLS tools enables the generation of accelerators optimized for the computational patterns of novel "converged" applications. We then discuss some of the research opportunities that such an open-source framework allows. |
14:45 CET | W04.2.3 | MATTHEW GUTHAUS Abstract In this talk, Prof. Guthaus presents the current status of the OpenRAM project including Skywater 130 tape-out results. In addition, Prof. Guthaus will discuss the future roadmap of the OpenRAM project features and support for newer technologies. |
15:00 CET | W04.2.4 | TSUNG-WEI HUANG Abstract Today's EDA algorithms demand large parallel and heterogeneous computing resources for performance. However, writing parallel EDA algorithms is extremely challenging due to highly complex and irregular patterns. This talk will present a novel programming system to help tackle the parallelization challenges of building high-performance EDA algorithms. |
15:15 CET | W04.2.5 | TIM EDWARDS Abstract This talk explores how hardware projects designed using an open source PDK rely too much on precise data which may not be available, and how problems can be avoided by certain design methodologies such as two-phase clocking, negative-edge clocking, margining, and monte carlo simulation. While open PDK data can be made more reliable by cross validation with multiple tools and, ultimately, measurement, good design practices can achieve working silicon without absolute certainty. |
15:30 CET | W04.2.6 | RISHIYUR NIKHIL Abstract BSV and BH, the Bluespec HLHDLs (High-Level Languages for Hardware Design), emerged from ideas in formal specification (Term Rewriting Systems), functional programming (Haskell), and automatic synthesis of RTL from specifications. BSV has been used in some major commercial ASIC designs and is used widely in FPGA projects. The BSV/BH compiler (written in Haskell) was open-sourced in 2020 (https://github.com/B-Lang-org/bsc) and today's projects are centered around RISC-V design and verification, and on accelerators. |
15:45 CET | W04.2.7 | FRANS SKARMAN Abstract Frans will present Spade, a new open source standalone hardware description language. He will show how Spade's abstractions and tooling, which is inspired by software languages, improves the productivity of an HDL without sacrificing low level control. |
W04.3 Poster Session (coffee break)
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 16:00 CET - 16:30 CET
Location / Room: Okapi Room 0.8.3
- Davide Cieri, Nicolò Vladi Biesuz, Rimsky Alejandro Rojas Caballero, Francesco Gonnella, Nico Giangiacomi, Guillermo Loustau De Linares and Andrew Peck: Hog 2023.1: a collaborative management tool to handle Git-based HDL repository
- Lucas Klemmer and Daniel Grosse: Programming Language Assisted Waveform Analysis: A Case Study on the Instruction Performance of SERV
- Vamsi Vytla and Larry Doolittle: Newad: A register map automation tool for Verilog
- Stefan Riesenberger and Christian Krieg: Towards Power Characterization of FPGA Architectures To Enable Open-Source Power Estimation Using Micro-Benchmarks
W04.4 Back-End and Verification
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 16:30 CET - 18:00 CET
Location / Room: Okapi Room 0.8.3
Time | Label | Presentation Title Authors |
---|---|---|
16:30 CET | W04.4.1 | ANDREW KAHNG Abstract OpenROAD (https://theopenroadproject.org) is an open-source RTL-to-GDS tool that generates manufacturable layout from a given hardware description – in 24 hours, at advanced foundry nodes. OpenROAD lowers the cost, expertise and schedule barriers to hardware design, thus providing a platform for research, education and system innovation. This talk will present current status of the OpenROAD project and the roadmap for OpenROAD as it seeks to enable VLSI/EDA education, early design space exploration for system designers, research on machine learning in EDA, and more. |
16:45 CET | W04.4.2 | JEAN-PAUL CHAPUT Abstract The talk will be focused on two majors points : why Open Hardware is as important as Open Source Software and the major challenges in building FOSS EDA tools. |
17:00 CET | W04.4.3 | MYRTLE SHAH Abstract Myrtle will introduce some of the recent developments in nextpnr; including easier ways of prototyping new architectures as well as some core algorithm improvements. They will also introduce FABulous, a highly flexible open source eFPGA fabric generator, and its close integration with nextpnr. |
17:15 CET | W04.4.4 | TRISTAN GINGOLD Abstract GHDL is an open-source VHDL simulator and synthesis tool. This talk will present the latest added features and some ideas for future development (in particular mixed simulation) |
17:30 CET | W04.4.5 | JIM LEWIS Abstract Open Source VHDL Verification Methodology (OSVVM) provides VHDL with buzz word verification capabilities including Transaction Level Modeling, Constrained Random, Functional Coverage, Scoreboards, FIFOs, Memory Models, Error and Message handling, and Test Reporting that are simple to use and feel like built-in language features. OSVVM has grown rapidly during the COVID years, giving us better capability, better test reporting (HTML and Junit), and scripting that is simple to use (and works with most VHDL simulators). This presentation shows how these advances fit into the overall OSVVM Methodology. |
17:45 CET | W04.4.6 | CLAIRE XENIA WOLF Abstract In her talk, Claire will discuss recent developments in open-source verification tools. Claire will briefly present equivalence checking with Yosys (EQY) and mutation cover with Yosys (MCY), and will highlight potential future directions. |
CF1.2 Careers Fair – Panel on Industry Career Perspectives
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 14:45 CET - 15:30 CET
Location / Room: Marble Hall
Session chair:
Oliver Bringmann, University of Tübingen, DE
Panellists:
Heinz Riener, Cadence Design Systems, DE
Björn Hartmann, Synopsys, DE
Johannes Sanwald, Robert Bosch GmbH, DE
Alessandro Brunetti, iQrypto, BE
Presenter:
Heinz Riener, Cadence Design Systems, DE
This is a Young People Programme event. At the Panel on Industry Career Perspectives, Young Professionals from Companies and startups will talk about their experience changing from academia to industry or starting a startup.
CF2 Careers Fair – Speed Dating
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 15:30 CET - 16:00 CET
Location / Room: Marble Hall
Session chair:
Anton Klotz, Cadence Design Systems, DE
This is a Young People Programme event. At the Speed Dating event, attendees of the Young People Programme can meet the recruiters and exchange business cards and CVs. Recruiters from Cadence Design Systems, X-FAB, Synopsys and Bosch will attend.
CF3 Careers Fair – Academia
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 16:00 CET - 17:30 CET
Location / Room: Marble Hall
Session chair:
Nele Mentens, KU Leuven, BE
Careers Fair Academia brings together researchers from academia with open positions and enthusiastic students who are looking for a position in academia. The academics can present their exciting research plans and students can get in touch with them to learn more about it. In addition, you will get the chance to hear about different academic career path across Europe, opportunities, challenges, similarities, and differences. Our panelists, Prof. Diana Goehringer (TU Dresden), Prof. Ahmed Hemani (KIT), Prof. Alberto Bosio (ECL), and Prof. Lukas Sekanina (VUTBR) share their valuable experiences and discuss with you any questions you may have about your future career path in academia.
Time | Label | Presentation Title Authors |
---|---|---|
16:00 CET | CF3.1 | OPEN POSITIONS Presenter: Careers Fair – Academia Participants, DATE, BE Author: Careers Fair – Academia Participants, DATE, BE Abstract Fair participants advertise new and upcoming research initiatives with academic open positions. |
CF3.2 | PANEL DISCUSSION Presenter: Careers Fair – Academia Panelists, DATE, BE Author: Careers Fair – Academia Panelists, DATE, BE Abstract Panel discussion on academic career paths in different countries. |
ASD3 ASD technical session: Autonomy for systems perception, control and optimization
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 16:30 CET - 18:00 CET
Location / Room: Gorilla Room 1.5.4/5
Session chair:
Rolf Ernst, TU Braunschweig, DE
Time | Label | Presentation Title Authors |
---|---|---|
16:30 CET | ASD3.1 | AUTONOMOUS HYPERLOOP CONTROL ARCHITECTURE DESIGN USING MAPE-K Speaker: Julian Demicoli, TU Munich, DE Authors: Julian Demicoli, Laurin Prenzel and Sebastian Steinhorst, TU Munich, DE Abstract In the very recent past, there has been a trend for passenger transport towards electrification of the vehicles to reduce greenhouse gas emissions. However, due to the low energy density of battery technology, electrification of airplanes is not possible with current technologies. Here, Hyperloop systems can offer a climate-friendly alternative to short-haul flights but face some technical challenges to be resolved. In contrast to conventional rail systems, the Hyperloop concept uses magnetic propulsion and levitation to operate and has no physical contact with the environment. Consequently, mechanical backup solutions do not suffice to avoid catastrophic events in case of failure. Software solutions must, therefore, ensure fail-operational behavior, which requires autonomous adaptability to uncertain states. The MAPE-K approach offers a solution to achieve such adaptability. In this paper, we present a hierarchical architecture that combines the MAPE-K concept with the Simplex concept to achieve self-adaptive behavior. We impose our autonomous architecture on the controller design for the levitation system of a Hyperloop pod and show that this controller, designed using our methodology, outperforms a conventional PID controller by up to 76%. |
16:53 CET | ASD3.2 | REINFORCEMENT-LEARNING-BASED JOB-SHOP SCHEDULING FOR INTELLIGENT INTERSECTION MANAGEMENT Speaker: Shao-Ching Huang, National Taiwan University, TW Authors: Shao-Ching Huang1, Kai-En Lin1, Cheng-Yen Kuo1, Li-Heng Lin1, Muhammed Sayin2 and Chung-Wei Lin1 1National Taiwan University, TW; 2Bilkent University, TR Abstract The goal of intersection management is to organize vehicles to pass the intersection safely and efficiently. Due to the technical advance of connected and autonomous vehicles, intersection management becomes more intelligent and potentially unsignalized. In this paper, we propose a reinforcement-learning-based methodology to train a centralized intersection manager. We define the intersection scheduling problem with a graph-based model and transform it to the job-shop scheduling problem (JSSP) with additional constraints. To utilize reinforcement learning, we model the scheduling procedure as a Markov decision process (MDP) and train the agent with the proximal policy optimization (PPO). A grouping strategy is also developed to apply the trained model to streams of vehicles. Experimental results show that the learning-based intersection manager is especially effective with high traffic densities. This paper is the first work in the literature to apply reinforcement learning on the graph-based model. The proposed methodology can flexibly deal with any conflicting scenario and indicate the applicability of reinforcement learning to intelligent intersection management. |
17:15 CET | ASD3.3 | BIO-INSPIRED AUTONOMOUS EXPLORATION POLICIES WITH CNN-BASED OBJECT DETECTION ON NANO-DRONES Speaker: Lorenzo Lamberti, Università di Bologna, IT Authors: Lorenzo Lamberti1, Luca Bompani1, Victor Kartsch Morinigo1, Manuele Rusci2, Daniele Palossi3 and Luca Benini4 1Università di Bologna, IT; 2KU Leuven, BE; 3ETH Zurich, CH; 4University of Bologna, ETH Zurich, IT Abstract Nano-sized drones, with palm-sized form factor, are gaining relevance in the Internet-of-Things ecosystem. Achieving a high degree of autonomy for complex multi-objective missions (e.g., safe flight, exploration, object detection) is extremely challenging for the onboard chip-set due to tight size, payload (<10g), and power envelope constraints, which strictly limit both memory and computation. Our work addresses this complex problem by combining bio-inspired navigation policies, which rely on time-of-flight distance sensor data, with a vision-based convolutional neural network (CNN) for object detection. Our field-proven nano-drone is equipped with two microcontroller units (MCUs), a single-core ARM Cortex-M4 (STM32) for safe navigation and exploration policies, and a parallel ultra-low power octa-core RISC-V (GAP8) for onboard CNN inference, with a power envelope of just 134mW, including image sensors and external memories. The object detection task achieves a mean average precision of 50% (at 1.6 frame/s) on an in-field collected dataset. We compare four bio-inspired exploration policies and identify a pseudo-random policy to achieve the highest coverage area of 83% in a ~36m^2 unknown room in a 3 minutes flight. By combining the detection CNN and the exploration policy, we show an average detection rate of 90% on six target objects in a never-seen-before environment. |
17:38 CET | ASD3.4 | BUTTERFLY EFFECT ATTACK: TINY AND SEEMINGLY UNRELATED PERTURBATIONS FOR OBJECT DETECTION Speaker: Nguyen Anh Vu Doan, Fraunhofer IKS, DE Authors: Nguyen Anh Vu Doan, Arda Yueksel and Chih-Hong Cheng, Fraunhofer IKS, DE Abstract This work aims to explore and identify tiny and seemly unrelated perturbations of images in object detection that will lead to performance degradation. While tininess can naturally be defined using L_p norms, we characterize the degree of "unrelatedness" of an object by the pixel distance between the occurred perturbation and the object. Triggering errors in prediction while satisfying two objectives can be formulated as a multi-objective optimization problem where we utilize genetic algorithms to guide the search. The result successfully demonstrates that (invisible) perturbations on the right part of the image can drastically change the outcome of object detection on the left. An extensive evaluation reaffirms our conjecture that transformer-based object detection networks are more susceptible to butterfly effects in comparison to single-stage object detection networks such as YOLOv5. |
FS4 Focus session: The Past, Present and Future of Chiplets
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 16:30 CET - 18:00 CET
Location / Room: Okapi Room 0.8.1
Session chair:
Krishnendu Chakrabarty, Arizona State University,, US
Time | Label | Presentation Title Authors |
---|---|---|
16:30 CET | FS4.1 | THE NEXT ERA FOR CHIPLET INNOVATION Speaker: Gabriel Loh, Advanced Micro Devices, Inc., US Authors: Gabriel Loh and Raja Swaminathan, Advanced Micro Devices, Inc., US Abstract Moore's Law is slowing down and the associated costs are simultaneously increasing. These pressures have given rise to new approaches utilizing advanced packaging and integration such as chiplets, interposers, and 3D stacking. We first describe the key technology drivers and constraints that motivate chiplet-based architectures, exploring several product case studies to highlight how different chiplet strategies have been developed to address different design objectives. We detail multiple generations of chiplet-based CPU architectures as well as the recent addition of 3D stacking options to further enhance processor capabilities. Across the industry, we are still collectively in the relatively early days of advanced packaging and 3D integration. As silicon scaling only gets more challenging and expensive while demand for computation continues to soar, we anticipate the transition to a new generation of chiplet architectures that utilize increasing combinations of 2D, 2.5D, and 3D integration and packaging technologies to continue to deliver compelling SoC solutions. However, this next era for chiplet innovation will face a variety of challenges. We will explore many of these technical topics, which in turn provide rich research opportunities for the community to explore and innovate. |
17:00 CET | FS4.2 | ACHIEVING DATACENTER-SCALE PERFORMANCE THROUGH CHIPLET-BASED MANYCORE ARCHITECTURES Speaker: Partha Pande, Washington State University, US Authors: Harsh Sharma1, Sumit Mandal2, Jana Doppa1, Umit Ogras3 and Partha Pratim Pande1 1Washington State University, US; 2Indian Institute of Science, IN; 3University of Wisconsin - Madison, US Abstract Chiplet-based 2.5D systems that integrate multiple smaller chips on a single die are gaining popularity for executing both compute- and data-intensive applications. While smaller chips (chiplets) reduce fabrication costs, they also provide less functionality. Hence, manufacturing several smaller chiplets and combining them into a single system enables the functionality of a larger monolithic chip without prohibitive fabrication costs. The chiplets are connected through the network-on-interposer (NoP). Designing a high-performance and energy-efficient NoP architecture is essential as it enables large-scale chiplet integration. This paper highlights the challenges and existing solutions for designing suitable NoP architectures targeted for 2.5D systems catered to datacenter-scale applications. We also highlight the future research challenges stemming from the current state-of-the-art to make the NoP-based 2.5D systems widely applicable. |
17:30 CET | FS4.3 | MACHINE LEARNING ACCELERATORS IN 2.5D CHIPLET PLATFORMS WITH SILICON PHOTONICS Speaker: Sudeep Pasricha, Colorado State University, US Authors: Febin Sunny, Ebadollah Taheri, Mahdi Nikdast and Sudeep Pasricha, Colorado State University, US Abstract Domain-specific machine learning (ML) accelerators such as Google's TPU and Apple's Neural Engine now dominate CPUs and GPUs for energy-efficient ML processing. However, the evolution of electronic accelerators is facing fundamental limits due to the limited computation density of monolithic processing chips and the reliance on slow metallic interconnects. We present a vision of how optical computation and communication can be integrated into 2.5D chiplet platforms to drive an entirely new class of sustainable and scalable ML hardware accelerators. We describe how cross-layer design and fabrication of optical devices, circuits, and architectures, and hardware/software codesign can help design efficient photonics-based 2.5D chiplet platforms to accelerate emerging ML workloads. |
SD6 Reconfigurable architectures, machine learning and circuit design
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 16:30 CET - 18:00 CET
Location / Room: Gorilla Room 1.5.3
Session chair:
Jan Moritz Joseph, RWTH Aachen University, DE
16:30 CET until 16:54 CET: Pitches of regular papers
16:54 CET until 18:00 CET: Interactive technical presentations by the authors of regular papers and extended abstracts
Regular Papers
Time | Label | Presentation Title Authors |
---|---|---|
16:30 CET | SD6.1 | TOWARDS EFFICIENT NEURAL NETWORK MODEL PARALLELISM ON MULTI-FPGA PLATFORMS Speaker: David Rodriguez, Universitat Politècnica de València, ES Authors: David Rodriguez Agut1, Rafael Tornero1 and Josè Flich2 1Universitat Politecnica de Valencia, ES; 2Associate Professor, Universitat Politècnica de València, ES Abstract Nowadays, convolutional neural networks (CNN) are common in a wide range of applications. Their high accuracy and efficiency contrast with their computing requirements, leading to the search for efficient hardware platforms. FPGAs are suitable due to their flexibility, energy efficiency and low latency. However, the ever increasing complexity of CNNs demands higher capacity devices, forcing the need for multi-FPGA platforms. In this paper, we present a multi-FPGA platform with distributed shared memory support for the inference of CNNs. Our solution, in contrast with previous works, enables combining different model parallelism strategies applied to CNNs, thanks to the distributed shared memory support. For a four FPGA setting, the platform reduces the execution time of 2D convolutions by a factor of 3.95 when compared to single FPGA. The inference of standard CNN models is improved by factors ranging 3.63-3.87. |
16:33 CET | SD6.2 | HIGH-ACCURACY LOW-POWER RECONFIGURABLE ARCHITECTURES FOR DECOMPOSITION-BASED APPROXIMATE LOOKUP TABLE Speaker: Xingyue Qian, Shanghai Jiao Tong University, CN Authors: Xingyue Qian1, Chang Meng1, Xiaolong Shen2, Junfeng Zhao2, Leibin Ni2 and Weikang Qian1 1Shanghai Jiao Tong University, CN; 22012Labs, Huawei technologies Co.,Ltd., CN Abstract Storing pre-computed results of frequently-used functions into lookup table (LUT) is a popular way to improve energy efficiency, but its advantage diminishes as the number of input bits increases. A recent work shows that by decomposing the target function approximately, the total LUT entries can be dramatically reduced, leading to significant energy saving. However, its heuristic approximate decomposition algorithm leads to sub-optimal approximation quality. Also, its rigid hardware architecture only supports disjoint decomposition and may have unnecessary extra power consumption sometimes. To address these issues, we develop a novel approximate decomposition algorithm based on beam search and simulated annealing, which can reduce 11.1% approximation error. We also propose a nondisjoint approximate decomposition method and two reconfigurable architectures. The first has 10.4% less error using 19.2% less energy and the second has 23.0% less error with same energy consumption compared to the state-of-the-art design. |
16:36 CET | SD6.3 | FPGA ACCELERATION OF GCN IN LIGHT OF THE SYMMETRY OF GRAPH ADJACENCY MATRIX Speaker: Gopikrishnan Raveendran Nair, Arizona State University, US Authors: Gopikrishnan Raveendran Nair1, Han-sok Suh1, Mahantesh Halappanavar2, Frank Liu3, Jae-sun Seo1 and Yu Cao1 1Arizona State University, US; 2Pacific Northwest National Laboratory, US; 3Oak Ridge National Lab, US Abstract Graph Convolutional Neural Networks (GCNs) are widely used to process large-scale graph data. Different from deep neural networks (DNNs), GCNs are sparse, irregular, and unstructured, posing unique challenges to hardware acceleration with regular processing elements (PEs). In particular, the adjacency matrix of a GCN is extremely sparse, leading to frequent but irregular memory access, low spatial/temporal data locality and poor data reuse. Furthermore, a realistic graph usually consists of unstructured data (e.g., unbalanced distributions), creating significantly different processing times and imbalanced workload for each node in GCN acceleration. To overcome these challenges, we propose an end-to-end hardware-software co-design to accelerate GCNs on resourceconstrained FPGAs with the features including: (1) A custom dataflow that leverages symmetry along the diagonal of the adjacency matrix to accelerate feature aggregation for undirected graphs. We utilize either the upper or the lower triangular matrix of the adjacency matrix to perform aggregation in GCN to improve data reuse. (2) Unified compute cores for both aggregation and transform phases, with full support to the symmetry-based dataflow. These cores can be dynamically reconfigured to the systolic mode for transformation or as individual accumulators for aggregation in GCN processing. (3) Preprocessing of the graph in software to rearrange the edges and features to match the custom dataflow. This step improves the regularity in memory access and data reuse in the aggregation phase. Moreover, we quantize the GCN precision from FP32 to INT8 to reduce the memory footprint without losing the inference accuracy. We implement our accelerator design in Intel Stratix10 MX FPGA board with HBM2, and demonstrate 1.3×-110.5× improvement in end-to-end GCN latency as compared to the state-of the-art FPGA implementations, on the graph datasets of Cora, Pubmed, Citeseer and Reddit. |
16:39 CET | SD6.4 | PR-ESP: AN OPEN-SOURCE PLATFORM FOR DESIGN AND PROGRAMMING OF PARTIALLY RECONFIGURABLE SOCS Speaker: Biruk Seyoum, Columbia University, US Authors: Biruk Seyoum, Davide Giri, Kuan-Lin Chiu, Bryce Natter and Luca Carloni, Columbia University, US Abstract Despite its presence for more than two decades and its proven benefits in expanding the space of system design, dynamic partial reconfiguration (DPR) is rarely integrated into frameworks and platforms that are used to design complex reconfigurable system-on-chip (SoC) architectures. This is due to the complexity of the DPR FPGA flow as well as the lack of architectural and software runtime support to enable and fully harness DPR. Moreover, as DPR designs involve additional design steps and constraints, they often have a higher FPGA compilation (RTL-to-bitstream) runtime compared to equivalent monolithic designs. In this work, we present PR-ESP, an open-source platform for a system-level design flow of partially reconfigurable FPGA-based SoC architectures targeting embedded applications that are deployed on resource-constrained FPGAs. Our approach is realized by combining SoC design methodologies and tools from the open-source ESP platform with a fully-automated DPR flow that features a novel size-driven technique for parallel FPGA compilation. We also developed a software runtime reconfiguration manager on top of Linux. Finally, we evaluated our proposed platform using the WAMI-App benchmark application on Xilinx VC707. |
16:42 CET | SD6.5 | ISOP: MACHINE LEARNING ASSISTED INVERSE STACK-UP OPTIMIZATION FOR ADVANCED PACKAGE DESIGN Speaker: Hyunsu Chae, University of Texas at Austin, US Authors: Hyunsu Chae1, Bhyrav Mutnury2, Keren Zhu1, Douglas Wallace2, Douglas Winterberg2, Daniel de Araujo3, Jay Reddy2, Adam Klivans1 and David Z. Pan1 1University of Texas at Austin, US; 2Dell Infrastructure Solutions Group, US; 3Siemens EDA, US Abstract Future computing calls for heterogeneous integration, e.g., the recent adoption of the chiplet methodology. However, high-speed cross-chip interconnects and packaging shall be critical for the overall system performance. As an example of advanced packaging, a high-density interconnect (HDI) printed circuit board (PCB) has been widely used in complex electronics from cell phones to computing servers. A modern HDI PCB may have over 20 layers, each with its unique material properties and geometrical dimensions, i.e., stack-up, to meet various design constraints and performance optimizations. However, stack-up design is usually done manually in the industry, where experienced designers may devote many hours to adjusting the physical dimensions and materials to meet the desired specifications. This process, however, is time-consuming, tedious, and sub-optimal, largely depending on the designer's expertise. In this paper, we propose to automate the stack-up design with a new framework, ISOP, using machine learning for inverse stack-up optimization for advanced package design. Given a target design specification, ISOP automatically searches for ideal stack-up design parameters while optimizing performance. We develop a novel machine learning-assisted hyper-parameter optimization method to make the search efficient and reliable. Experimental results demonstrate that ISOP is better in figure-of-merit (FoM) than conventional simulated annealing and Bayesian optimization algorithms, with all our design targets met with a shorter runtime. We also compare our fully-automated ISOP with expert designers in the industry and achieve very promising results, with orders of magnitude reduction of turn-around time. |
16:45 CET | SD6.6 | FAST AND ACCURATE WIRE TIMING ESTIMATION BASED ON GRAPH LEARNING Speaker: Yuyang Ye, Southeast University, CN Authors: Yuyang Ye1, Tinghuan Chen2, Yifei Gao1, Hao Yan1, Bei Yu2 and Longxing Shi1 1Southeast University, CN; 2The Chinese University of Hong Kong, HK Abstract Accurate wire timing estimation has become a bottleneck in timing optimization since it needs a long turn-around time using a sign-off timer. The gate timing can be calculated accurately using lookup tables in cell libraries. In comparison, the accuracy and efficiency of wire timing calculation for complex RC nets are extremely hard to trade off. The limited number of wire paths opens a door for the graph learning method in wire timing estimation. In this work, we present a fast and accurate wire timing estimator based on a novel graph learning architecture, namely GNNTrans. It can generate wire path representations by aggregating local structure information and global relationships of whole RC nets, which cannot be collected with traditional graph learning work efficiently. Experimental results on both tree-like and non-tree nets demonstrate improved accuracy, with the max error of wire delay being lower than 5 ps. In addition, our estimator can predict the timing of over 200K nets in less than 100 secs. The fast and accurate work can be integrated into incremental timing optimization for routed designs. |
16:48 CET | SD6.7 | DTOC: INTEGRATING DEEP-LEARNING DRIVEN TIMING OPTIMIZATION INTO STATE-OF-THE-ART COMMERCIAL EDA TOOL Speaker: Kyungjoon Chang, Seoul National University, KR Authors: Kyungjoon Chang1, Heechun Park2, Jaehoon Ahn1, Kyu-Myung Choi1 and Taewhan Kim1 1Seoul National University, KR; 2Kookmon University, KR Abstract Recently, deep-learning (DL) models have paid a considerable attention to timing prediction in the placement and routing (P&R) flow. As yet, the DL-based prior works are con fined to timing prediction at the time-consuming global routing stage, and very few have addressed the timing prediction problem at the placement, i.e., at the pre-route stage. This is because it is not easy to "accurately” predict various timing parameters at the pre-route stage. Moreover, no work has addressed a seamless link of timing prediction at the pre-route stage to the final timing optimization through making use of commercial P&R tools. In this work, we propose a framework called DTOC, to be used at the pre-route stage for this end. Precisely, the framework is composed of two models: (1) a DL-driven arc delay and arc output slew prediction model, performing in two levels: (level-1) predicting net resistance (R), net capacitance (C), and arc length (Len), followed by (level-2) predicting arc delay and arc output slew from the R/C/Len prediction obtained in (level-1); (2) a timing optimization model, which uses the inference outcomes in our DL-driven prediction model to enable the commercial P&R tools to calculate the full path delays, setting update timing margins on paths, so that the P&R tools should use more accurate margins on timing optimization. Experimental results show that, by using our DTOC framework during timing optimization in P&R, we improve the pre-route prediction accuracy on arc delay and arc output slew by 20∼26% on average, and improve the WNS, TNS, and the number of timing violation paths by 50∼63% on average. |
16:51 CET | SD6.8 | RL-LEGALIZER: REINFORCEMENT LEARNING-BASED CELL PRIORITY OPTIMIZATION IN MIXED-HEIGHT STANDARD CELL LEGALIZATION Speaker: Sung-Yun Lee, Pohang University of Science and Technology, KR Authors: Sung-Yun Lee1, Seonghyeon Park2, Daeyeon Kim2, Minjae Kim2, Tuyen Le3 and Seokhyeong Kang2 1Pohang University of Science and Technology (POSTECH), KR; 2Pohang University of Science and Technology, KR; 3AgileSoDA, KR Abstract Cell legalization order has a substantial effect on the quality of modern VLSI designs, which use mixed-height standard cells. In this paper, we propose a deep reinforcement learning framework to optimize cell priority in the legalization phase of various designs. We extract the selected features of movable cells and their surroundings, then embed them into cell-wise deep neural networks. We then determine cell priority and legalize them in order using a pixel-wise search algorithm. The proposed framework uses a policy gradient algorithm and several training techniques, including grid-cell subepisode, data normalization, reduced-dimensional state, and network optimization. We aim to resolve the suboptimality of existing sequential legalization algorithms with respect to displacement and wirelength. On average, our proposed framework achieved 34% lower legalization costs in various benchmarks compared to that of the state-of-the-art legalization algorithm. |
Extended Abstracts
Time | Label | Presentation Title Authors |
---|---|---|
16:54 CET | SD6.9 | NEURAL NETWORK ON THE EDGE: EFFICIENT AND LOW COST FPGA IMPLEMENTATION OF DIGITAL PREDISTORTION IN MIMO SYSTEMS Speaker: John Dooley, Maynooth University, IE Authors: Yiyue Jiang1, Andrius Vaicaitis2, John Dooley2 and Miriam Leeser1 1Department of Electrical and Computer Engineering at Northeastern University, US; 2Department of Electronic Engineering at Maynooth University, IE Abstract Base stations in cellular networks must operate linearly, power efficiently, and with ever increasing flexibility. Recent FPGA hardware advances have demonstrated linearization using neural networks, however the latency introduced by these solutions is a concern. We present a novel hardware implementation for a low digital cost, high throughput pipelined Real Valued Time Delay Neural Network (RVTDNN) structure with a hardware-efficient activation function. Network training times are reduced by minimizing the training signal samples used, based on a biased probability density function (pdf). The design has been experimentally validated using an AMD/Xilinx RFSoC ZCU216 board and surpasses the data throughput of conventional RVTDNN-based DPD while using a fraction of their hardware utilization. |
16:54 CET | SD6.10 | QUANTISED NEURAL NETWORK ACCELERATORS FOR LOW-POWER IDS IN AUTOMOTIVE NETWORKS Speaker: Shashwat Khandelwal, Ph.D. Student, Electronic and Electrical Engineering, Trinity College Dublin, IE Authors: Shashwat Khandelwal, Anneliese Walsh and Shreejith Shanker, Trinity College Dublin, IE Abstract In this paper, we explore low-power custom quantised Multi-Layer Perceptrons (MLPs) as an Intrusion Detection System (IDS) for automotive controller area network (CAN). We utilise the FINN framework from AMD/Xilinx to quantise, train and generate hardware IP of our MLP to detect denial of service (DoS) and fuzzying attacks on CAN network, using ZCU104 (XCZU7EV) FPGA as our target ECU architecture with integrated IDS capabilities. Our approach achieves significant improvements in latency (0.12 ms per-message processing latency) and inference energy consumption (0.25 mJ per inference) while achieving similar classification performance as state-of-the-art approaches in the literature. |
SD7 Logical and physical analysis and design
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 16:30 CET - 18:00 CET
Location / Room: Gorilla Room 1.5.1
Session chair:
Patrick Groeneveld, CEREBRAS & STANFORD, US
16:30 CET until 16:54 CET: Pitches of regular papers
16:54 CET until 18:00 CET: Interactive technical presentations by the authors of regular papers and extended abstracts
Regular Papers
Time | Label | Presentation Title Authors |
---|---|---|
16:30 CET | SD7.1 | SYNTHESIS AND UTILIZATION OF STANDARD CELLS AMENABLE TO GEAR RATIO OF GATE-METAL PITCHES FOR IMPROVING PIN ACCESSIBILITY Speaker: Jooyeon Jeong, Seoul National University, Pl Authors: Jooyeon Jeong, Sehyeon Chung, Kyeongrok Jo and Taewhan Kim, Seoul National University, KR Abstract Traditionally, the synthesis of standard cells invariably assumes that the gear ratio (GR) between the gate poly pitch in the cells and the metal pitch of the first vertical metal layer (to be used for routing) over the gate poly is 1:1 for chip implementation. However, the scaling trend in sub-10nm node CMOS designs is that GR is changing from 1:1 to 3:2 or 4:3, which means the number and location of pin access points vary depending on the cell placement location, thereby causing hard-to-pin-access if the pin access points were aligned on the offtrack routing pattern. This work overcomes the pin inaccessibility problem caused by non-1:1 GR in chip implementation. Precisely, we propose a non-1:1 GR aware DTCO (design and technology co-optimization) flow to generate cells with pin patterns that are best suited to the implementation of target design. To this end, we propose two new tasks to be installed in our DTCO framework: (1) from the existing cells optimized for 1:1 GR, we relocate their pin patterns amenable to non-1:1 GR, so that a maximal pin accessibility should be achieved; (2) we incrementally update the pin patterns of the cell instances with routing failures due to pin inaccessibility in the course of the DTCO iterations to produce the cells with best fitted pin patterns to the implementation of target design. Through experiments with benchmark circuits, it is shown that our DTCO methodology optimizing pin patterns amenable to non-1:1 GR is able to produce chip implementations with on average 5.88× fewer routing failures at no additional wirelength, timing, and power cost. |
16:33 CET | SD7.2 | CENTER-OF-DELAY: A NEW METRIC TO DRIVE TIMING MARGIN AGAINST SPATIAL VARIATION IN COMPLEX SOCS Speaker: Christian Lutkemeyer, Marvell Semiconductor, Inc., US Authors: Christian Lutkemeyer1 and Anton Belov2 1Marvell Semiconductor, Inc., US; 2Synopsys, IE Abstract Complex VLSI SOCs are manufactured on large 300mm wafers. Individual SOCs can show significant spatial performance gradients in the order of 10% per 10mm. The traditional approach to handling this variation in STA tools is a margin look-up table indexed by the diagonal of the bounding box around the gates in a timing path. In this paper we propose a new approach based on the concept of the Center-of-Delay of a timing path. We justify this new approach theoretically for linear performance gradients and present experimental data that shows that the new approach is both safe, and significantly less pessimistic than the existing method. |
16:36 CET | SD7.3 | A NOVEL DELAY CALIBRATION METHOD CONSIDERING INTERACTION BETWEEN CELLS AND WIRES Speaker: Leilei Jin, The National ASIC System Engineering Technology Research Center, Southeast University, CN Authors: Leilei Jin, Jia Xu, Wenjie Fu, Hao Yan, Xiao Shi, Ming Ling and Longxing Shi, Southeast University, CN Abstract In the advanced technology, the accuracy of cell and wire delay modeling are the key metrics for timing analysis. However, when the supply voltage decreases to the near-threshold regime, the complicated process variation effect causes the cell delay and the wire delay hard to model. Most researchers study cell or wire delay separately, ignoring the coefficients between them. In this paper, we propose an N-sigma delay model by characterizing different sigma levels (-3σ to +3σ) of the cell and wire delay distribution. The N-sigma cell delay model is represented by the first four moments and calibrated by the operating conditions (input slew, output load). Meanwhile, based on the Elmore model, the wire delay variability is calculated by considering the effect of drive and load cells. The delay models are verified through the ISCAS85 benchmarks and the functional units of PULPino processor with TSMC 28 nm technology. Compared to the SPICE results, the average errors for estimating the +⁄- 3σ cell delay are 2.1% and 2.7% and those of the wire delay are 2.4% and 1.6%, respectively. The errors of path delay analysis keep below 6.6% and the speed is 103X over SPICE MC simulations. |
16:39 CET | SD7.4 | RETHINKING NPN CLASSIFICATION FROM FACE AND POINT CHARACTERISTICS OF BOOLEAN FUNCTIONS Speaker: Jiaxi Zhang, Peking University, CN Authors: Jiaxi Zhang1, Shenggen Zheng2, Liwei Ni3, Huawei Li3 and Guojie Luo1 1Peking University, CN; 2Peng Cheng Laboratory, CN; 3Chinese Academy of Sciences, CN Abstract NPN classification is an essential problem in the design and verification of digital circuits. Most existing works explored variable symmetries and cofactor signatures to develop their classification methods. However, cofactor signatures only consider the face characteristics of Boolean functions. In this paper, we propose a new NPN classifier using both face and point characteristics of Boolean functions, including cofactor, influence, and sensitivity. The new method brings a new perspective to the classification of Boolean functions. The classifier only needs to compute some signatures, and the equality of corresponding signatures is a prerequisite for NPN equivalence. Therefore, these signatures can be directly used for NPN classification, thus avoiding the exhaustive transformation enumeration. The experiments show that the proposed NPN classifier gains better NPN classification accuracy with comparable speed. |
16:42 CET | SD7.5 | EXACT SYNTHESIS BASED ON SEMI-TENSOR PRODUCT CIRCUIT SOLVER Speaker: Hongyang Pan, Ningbo University, CN Authors: Hongyang Pan1 and Zhufei Chu2 1Ningbo university, CN; 2Ningbo University, CN Abstract In logic synthesis, Boolean satisfiability (SAT) is widely used as a reasoning engine, especially for exact synthesis. By representing input formulas as logic circuits instead of conjunction normal forms (CNFs) as in off-the-shelf CNF-based SAT solvers, circuit-based SAT solvers enable decoding after solution to be easier. An exact synthesis method based on a semi-tensor product (STP) circuit solver is presented in this paper. As opposed to other SAT-based exact synthesis algorithms, synthesized Boolean functions are encoded into STP canonical forms and can be solved by STP-based circuit SAT solver in our method. It can also obtain all optimal solutions in one pass. In particular, all solutions are expressed as 2-lookup tables (LUTs), rather than homogeneous logic representations. Hence, different costs can be considered when selecting the optimal circuit. In experiments, we demonstrate that our method accelerates the runtime up to 225.6X while reducing timeout instances by up to 88\%. |
16:45 CET | SD7.6 | AN EFFECTIVE AND EFFICIENT HEURISTIC FOR RATIONAL-WEIGHT THRESHOLD LOGIC GATE IDENTIFICATION Speaker: Ting Yu Yeh, National Taiwan University of Science and Technology, TW Authors: Ting Yu Yeh, Yueh Cho and Yung Chih Chen, National Taiwan University of Science and Technology, TW Abstract In CMOS-based current mode realization, the threshold logic gate (TLG) implementation with rational weights has been shown to be more cost-effective than the conventional TLG implementation without rational weights. The existing method for the rational-weight TLG identification is an integer linear programming (ILP)-based method, which could suffer from inefficiency for a Boolean function with a large number of inputs. This paper presents a heuristic for rational-weight TLG identification. We observe that in the ILP formulation, many variables related to the rational weights are redundant according to the ILP solutions. Additionally, a rational-weight TLG could be transformed from a conventional TLG. Thus, the proposed method aims to identify the conventional TLG that can be transformed to a rational-weight TLG with lower implementation cost. We conducted the experiments on a set of TLGs with 4 ∼ 15 inputs. The results show that the proposed method has a competitive quality and is much more efficient, compared to the ILP-based method. |
16:48 CET | SD7.7 | FAST STA GRAPH PARTITIONING FRAMEWORK FOR MULTI-GPU ACCELERATION Speaker: Tsung-Wei Huang, University of Utah, US Authors: Guannan Guo1, Tsung-Wei Huang2 and Martin Wong3 1University of Illinois at Urbana-Champaign, US; 2University of Utah, US; 3The Chinese University of Hong Kong, HK Abstract Path-based Analysis (PBA) is a key process in Static Timing Analysis (STA) to reduce excessive slack pessimism. However, PBA can easily become the major performance bottleneck due to its extremely long execution time. To overcome this bottleneck, recent STA researches have proposed to accelerate PBA algorithms with manycore CPU and GPU parallelisms. However, GPU memory is rather limited when we compute PBA on large industrial designs with millions of gates. In this work, we introduce a new endpoint-oriented partitioning framework that can separate STA graphs and dispatch the PBA workload onto multiple GPUs. Our framework can quickly identify logic overlaps among endpoints and group endpoints based on the size of shared logic. We then recover graph partitions from the endpoint groups and offload independent PBA workloads to multiple GPUs. Experiments show that our framework can greatly accelerate the PBA process on designs with over 10M gates. |
16:51 CET | SD7.8 | TOFU: A TWO-STEP FLOORPLAN REFINEMENT FRAMEWORK FOR WHITESPACE REDUCTION Speaker: Shixiong Kai, Huawai Noah's Ark Lab, Pl Authors: Shixiong Kai1, Chak-Wa Pui2, Fangzhou Wang3, Jiang Shougao4, Bin Wang1, Yu Huang5 and Jianye Hao6 1Huawei Noah's Ark Lab, CN; 2UniVista, CN; 3The Chinese University of Hong Kong, HK; 4Hisilicon, CN; 5HiSilicon, CN; 6Tianjin University, CN Abstract Floorplanning, as an early step in physical design, will greatly affect the PPA of the later stages. To achieve better performance while maintaining relatively the same chip size, the utilization of the generated floorplan needs to be high and constraints related to design rules, routability, power should be honored. In this paper, we propose a two-step framework, called TOFU, for floorplan whitespace reduction with fixed-outline and soft/pre-placed/hard modules modeled. Whitespace is first reduced by iteratively refining the locations of modules. Then the modules near whitespace will be changed into rectilinear shapes to further improve the utilization. To ensure the legality and quality of the intermediate floorplan during the refinement process, a constraint graph-based legalizer with a novel constraint graph construction method is proposed. Experimental results show that the whitespace of the initial floorplans generated by Corblivar can be reduced by about 70% on average and up to 90% in several cases. Moreover, the resulting wirelength is also 3% shorter due to a higher utilization. |
Extended Abstracts
Time | Label | Presentation Title Authors |
---|---|---|
16:54 CET | SD7.10 | ROUTABILITY PREDICTION USING DEEP HIERARCHICAL CLASSIFICATION AND REGRESSION Speaker: Daeyeon Kim, Pohang University of Science and Technology, KR Authors: Daeyeon Kim, Jakang Lee and Seokhyeong Kang, Pohang University of Science and Technology, KR Abstract Routability prediction can forecast the locations where design rule violations occur without routing and thus can speed up the design iterations by skipping the time-consuming routing tasks. This paper investigated (i) how to predict the routability on a continuous value and (ii) how to improve the prediction accuracy for the minority samples. We propose a deep hierarchical classification and regression (HCR) model that can detect hotspots with the number of violations. The hierarchical inference flow can prevent the model from overfitting to the majority samples in imbalanced data. In addition, we introduce a training method for the proposed HCR model that uses Bayesian optimization to find the ideal modeling parameters quickly and incorporates transfer learning for the regression model. We achieved an R2 score of 0.71 for the regression and increased the F1 score in the binary classification by 94\% compared to previous work. |
16:54 CET | SD7.11 | EFFICIENT DESIGN RULE CHECKING WITH GPU ACCELERATION Speaker: Zhenhua Feng, Dalian University of Technology, CN Authors: Wei Zhong1, Zhenhua Feng1, Zhuolun He2, Weimin Wang1, Yuzhe Ma3 and Bei Yu2 1Dalian University of Technology, CN; 2The Chinese University of Hong Kong, HK; 3Hong Kong University of Science and Technology, CN Abstract Design Rule Checking (DRC) is an essential part of the chip design flow, which ensures that manufacturing requirements are conformed to avoid a chip failure. With the rapid increase of design scales, DRC has been suffering from runtime overhead. To overcome this challenge, we propose to accelerate DRC algorithms by harnessing the power of graphics processing units (GPUs). Specifically, we first explore an efficient data transfer approach for geometry information of a layout. Then we investigate GPU-based scanline algorithms to accommodate both intra-polygon checking and intre-polygon checking based on the characteristics of the design rules. Experimental results show that the proposed GPU-accelerated method can substantially outperform a multi-threaded DRC algorithm using CPU. Compared with the baseline with 24 threads, we can achieve an average speedup of 36 times and 201 times for spacing rule checks and enclosing rule checks on a metal layer, respectively. |
16:54 CET | SD7.12 | MITIGATING LAYOUT DEPENDENT EFFECT-INDUCED TIMING RISK IN MULTI-ROW-HEIGHT DETAILED PLACEMENT Speaker: Li-Chen Wang, National Taiwan University of Science and Technology, TW Authors: Li-Chen Wang and Shao-Yun Fang, National Taiwan University of Science and Technology, TW Abstract With the development of advanced process technology, the electrical characteristic variation of MOSFET transistors has been seriously influenced by layout dependent effect (LDEs). Due to these LDEs, two cells of specific cell types may suffer from timing degradation when they are adjacently and closely placed with specific orientations. To mitigate the timing risk of critical paths and thus optimize the performance of a target design, this work proposes a dynamic programming (DP)-based method for multi-row-height detailed placement with cell flipping and cell shifting. Experimental results shows the efficiency and effectiveness of the proposed DP-based approach. |
16:54 CET | SD7.13 | TWO-STAGE PCB ROUTING USING POLYGON-BASED DYNAMIC PARTITIONING AND MCTS Speaker: Youbiao He, Iowa State University, US Authors: Youbiao He1, Hebi Li2, Ge Luo2 and Forrest Sheng Bao3 1Iowa state university, US; 2Iowa State University, US; 3Iowa State Univerity, US Abstract We propose a pad-focused, net-by-net, two-stage printed circuit board (PCB) routing approach comprising the global routing using Monte Carlo tree search (MCTS) and the detailed routing using A*. Compared with conventional PCB routing algorithms, our approach can route PCB components in both BGA and non-BGA packages. To minimize the gap between the global and detailed routing stages, a polygon-based dynamic routable region partitioning mechanism is introduced. Experimental results show that our approach outperforms state-of-the-art routers such as DeepPCB and FreeRouting in terms of success rate or wirelength. |
16:54 CET | SD7.14 | DEEPTH: CHIP PLACEMENT WITH DEEP REINFORCEMENT LEARNING USING A THREE-HEAD POLICY NETWORK Speaker: Dengwei Zhao, Shanghai Jiao Tong University, CN Authors: Dengwei Zhao, Shuai Yuan, Yanan Sun, Shikui Tu and Lei Xu, Shanghai Jiao Tong University, CN Abstract Modern very-large-scale integrated (VLSI) circuit placement with huge state space is a critical task for achieving layouts with high performance. Recently, reinforcement learning (RL) algorithms have made a promising breakthrough to dramatically save design time than human effort. However, the previous RL-based works either require a large dataset of chip placements for pre-training or produce illegal final placement solutions. In this paper, DeepTH, a three-head policy gradient placer, is proposed to learn from scratch without the need of pre-training, and generate superior chip floorplans. Graph neural network is initially adopted to extract the features from nodes and nets of chips for estimating the policy and value. To efficiently improve the quality of floorplans, a reconstruction head is employed in the RL network to recover the visual representation of the current placement, by enriching the extracted features of placement embedding. Besides, the reconstruction error is used as a bonus during training to encourage exploration while alleviating the sparse reward problem. Furthermore, the expert knowledge of floorplanning preference is embedded into the decision process to narrow down the potential action space. Experiment results on the ISPD2005 benchmark have shown that our method achieves 19.02% HPWL improvement than the analytic placer DREAMPlace and 19.89% improvement at least than the state-of-the-art RL algorithms. |
SS1 Security of emerging technologies and machine learning
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 16:30 CET - 18:00 CET
Location / Room: Okapi Room 0.8.2
Session chair:
Giorgio Di Natale, TIMA, FR
16:30 CET until 16:57 CET: Pitches of regular papers
16:57 CET until 18:00 CET: Interactive technical presentations by the authors of regular papers and extended abstracts
Regular Papers
Time | Label | Presentation Title Authors |
---|---|---|
16:30 CET | SS1.1 | PRIVACY-PRESERVING NEURAL REPRESENTATION FOR BRAIN-INSPIRED LEARNING Speaker: Mohsen Imani, University of California, Irvine, US Authors: Javier Roberto Rubalcava-Cortes1, Alejandro , Hernandez Cano1, Alejandra Citlalli Pacheco Tovarm1, Farhad Imani2, Rosario Cammarota3 and Mohsen Imani4 1Universidad Nacional Autonoma de Mexico, MX; 2University of Connecticut, US; 3Intel Labs, US; 4University of California, Irvine, US Abstract In this paper, we propose BIPOD, a brain-inspired privacy-oriented machine learning. Our method rethinks privacypreserving mechanisms by looking at how the human brain provides effective privacy with minimal cost. BIPOD exploits hyperdimensional computing (HDC) as a neurally-inspired computational model. HDC is motivated by the observation that the human brain operates on high-dimensional data representations. In HDC, objects are thereby encoded with high-dimensional vectors, called hypervectors, which have thousands of elements. BIPOD exploits this encoding as a holographic projection with both cryptographic and randomization-based features. BIPOD encoding is performed using a set of brain keys that are generated randomly. Therefore, attackers cannot get encoded data without accessing the encoding keys. In addition, revealing the encoding keys does not directly translate to information loss. We enhance BIPOD encoding method to mathematically create perturbation on encoded neural patterns to ensure a limited amount of information can be extracted from the encoded data. Since BIPOD encoding is a part of the learning process, thus can be optimized together to provide the best trade-off between accuracy, privacy, and efficiency. Our evaluation on a wide range of applications shows that BIPOD privacy-preserving techniques result in 11.3× higher information privacy with no loss in classification accuracy. In addition, at the same quality of learning, BIPOD provides significantly higher information privacy compared to state-of state-of-the-art privacy-preserving techniques |
16:33 CET | SS1.2 | EXPLOITING SHORT APPLICATION LIFETIMES FOR LOW COST HARDWARE ENCRYPTION IN FLEXIBLE ELECTRONICS Speaker: Nathaniel Bleier, University of Illinois at Urbana-Champaign, US Authors: Nathaniel Bleier1, Muhammad Mubarik1, Suman Balaji2, Francisco Rodriguez2, Antony Sou2, Scott White2 and Rakesh Kumar1 1University of Illinois at Urbana-Champaign, US; 2PragmatIC Semiconductor, GB Abstract Many emerging flexible electronics applications require hardware based encryption, but it is unclear if practical hardware-based encryption is possible for flexible applications due to stringent power requirements of these applications and higher area and power over heads of flexible technologies. In this work, we observe that the lifetime of many flexible applications is so small that often one key suffices for the entire lifetime. This means that, instead of generating keys and round keys in hardware, we can generate the round keys offline, and instead store these round keys directly in the engine. This eliminates the need for hardware for dynamic generation of round keys, which significantly reduces encryption overhead. This significant reduction in encryption overhead allows us to demonstrate the first practical flexible encryption engines. To prevent an adversary from reading out the stored round keys, we scramble the round keys before storing them in the ROM; camouflage cells are used to unscramble the keys before feeding them to logic. In spite of the unscrambling overhead, our encryption engines consume 27.4% lower power than the already heavily area and power-optimized baselines, while being 21.9% smaller on average. |
16:36 CET | SS1.3 | ATTACKING RERAM-BASED ARCHITECTURES USING REPEATED WRITES Speaker: Biresh Kumar Joardar, University of Houston, IN Authors: Biresh Kumar Joardar1 and Krishnendu Chakrabarty2 1University of Houston, US; 2Duke University, US Abstract Resistive random-access memory (ReRAM) have is a promising technology for both memory and for in-memory computing. However, these devices have security vulnerabilities that are yet to be adequately investigated. In this work, we identify one such vulnerability that exploits the write mechanism in ReRAMs. Whenever a cell/row is written, a constant bias is automatically applied to the remaining cells/rows to reduce sneak current. We develop a new attack (referred as WriteHammer) that exploits this process. By repeatedly exposing a subset of cells to this bias, WriteHammer can cause noticeable resistance drift in the victim ReRAM cells. Experimental results indicate that WriteHammer can cause up to 3.5X change in cell resistance by simply writing to the ReRAM cells |
16:39 CET | SS1.4 | SECURITY EVALUATION OF A HYBRID CMOS/MRAM ASCON HARDWARE IMPLEMENTATION Speaker: Nathan Roussel, Mines Saint-Etienne, FR Authors: Nathan Roussel, Olivier Potin, Jean-Max Dutertre and Jean-Baptiste Rigaud, Mines Saint-Etienne, FR Abstract As the number of IoT objects is growing fast, power consumption and security become a major concern in the design of integrated circuits. Lightweight Cryptography (LWC) algorithms aim to secure the communications of these connected objects at the lowest energy impact. To reduce the energy footprint of cryptographic primitives, several LWC hardware implementations embedding hybrid CMOS/MRAM-based cells have been investi- gated. These architectures use the non-volatile characteristic of MRAM to store data manipulated in the algorithm computation. We provide in this work a security evaluation of a hybrid CMOS/MRAM hardware implementation of the A SCON cipher, a finalist of the National Institute of Standards and Technology LWC contest. We focus on a simulation flow using the current EDA tools capable of carrying out power analysis for side-channel attacks, for the purpose of assessing potential weaknesses of MRAM hybridization. Differential Power Analysis (DPA) and Correlation Power Analysis (CPA) are conducted on the post- route and parasitic annoted netlist of the design. The results show that the hybrid implementation does not significantly lower the security feature compared to a reference CMOS implementation. |
16:42 CET | SS1.5 | MANTIS: MACHINE LEARNING-BASED APPROXIMATE MODELING OF REDACTED INTEGRATED CIRCUITS Speaker: Benjamin Carrion Schaefer, University of Texas at Dallas, US Authors: Chaitali Sathe, Yiorgos Makris and Benjamin Carrion Schaefer, University of Texas at Dallas, US Abstract With most VLSI design companies now being fabless it is imperative to develop methods to protect their Intellectual Property (IP). One approach that has become very popular due to its relative simplicity and practicality is logic locking. One of the problems with traditional locking mechanisms is that the locking circuitry is built into the netlist that the VLSI design company delivers to the foundry which has now access to the entire design including the locking mechanism. This implies that they could potentially tamper with this circuitry or reverse engineer it to obtain the locking key. One relatively new approach that has been coined as hardware redaction is to map a portion of the design to an embedded FPGA (eFPGA). The bitstream of the eFPGA now acts as the locking key. The fab now receives the design without the bitstream and hence, cannot reverse engineer the functionality of the design. The obvious drawbacks are the increase in design complexity and the area and performance overheads associated with the eFPGA. In this work we propose, to the best of our knowledge, the first attack on these type of new locking mechanisms by substituting the exact logic mapped onto the eFPGA by a synthesizable predictive model that replicates the behavior of the exact logic. We show that this approach is especially applicable in the context of approximate computing where hardware accelerators tolerate certain degree of errors at their outputs. Some examples include Digital Signal Processing (DSP) or image processing applications Experimental results show that our proposed approach is very effective finding suitable predictive models. |
16:45 CET | SS1.6 | LONG RANGE DETECTION OF EMANATION FROM HDMI CABLES USING CNN AND TRANSFER LEARNING Speaker: Shreyas Sen, Purdue University, US Authors: Md Faizul Bari, Meghna Roy Chowdhury and Shreyas Sen, Purdue University, US Abstract The transition of data and clock signals between high and low states in electronic devices creates electromagnetic radiation according to Maxwell's equations. These unintentional emissions, called emanation, may have a significant correlation with the original information-carrying signal and form an information leakage source, bypassing secure cryptographic methods at both hardware and software levels. Information extraction exploiting compromising emanations poses a major threat to information security. Shielding the devices and cables along with setting a control perimeter for a sensitive facility are the most commonly used preventive measures. These countermeasures raise the research need for the longest detection range of exploitable emanation and the efficacy of commercial shielding. In this work, using data collected from 3 types of commercial HDMI cables (unshielded, single-shielded, and double-shielded) in an office environment, we have shown that the CNN-based detection method outperforms the traditional threshold-based detection method and improves the detection range from 4 m to 22.5 m for an iso-accuracy of ~95%. Also, for an iso-distance of 16 m, the CNN-based method provides ~100% accuracy, compared to ~88.5% using the threshold-based method. The significant performance boost is achieved by treating the FFT plots as images and training a residual neural network (ResNet) with the data so that it learns to identify the impulse-like emanation peaks even in the presence of other interfering signals. A comparison has been made among the emanation power from the 3 types of HDMI cables to judge the efficacy of multi-layer shielding. Finally, a distinction has been made between monitor contents, i.e., still image vs video, with an accuracy of 91.7% at a distance of 16 m. This distinction bridges the gap between emanation-based image and video reconstruction algorithms. |
16:48 CET | SS1.7 | ADVERSARIAL ATTACK ON HYPERDIMENSIONAL COMPUTING-BASED NLP APPLICATIONS Speaker: Sizhe Zhang, Villanova University, US Authors: Sizhe Zhang1, Zhao Wang2 and Xun Jiao1 1Villanova University, US; 2University of Chicago, US Abstract The security and robustness of machine learning algorithms have become increasingly important as they are used in critical applications such as natural language processing (NLP), e.g., text-based spam detection. Recently, the emerging brain-inspired hyperdimensional computing (HDC), compared to deep learning methods, has shown advantages such as compact model size, energy efficiency, and capability of few-shot learning in various NLP applications. While HDC has been demonstrated to be vulnerable to adversarial attacks in image and audio input, there is currently very limited study on its adversarial security to NLP tasks, which is arguable one of the most suitable applications for HDC. In this paper, we present a novel study on the adversarial attack of HDC-based NLP applications. By leveraging the unique properties in HDC, the similarity-based inference, we propose similarity-guided approaches to automatically generate adversarial text samples for HDC. Our approach is able to achieve up to 89% attack success rate. More importantly, by comparing with unguided brute-force approach, similarity-guided attack achieves a speedup of 2.4X in generating adversarial samples. Our work opens up new directions and challenges for future adversarially-robust HDC model design and optimization. |
16:51 CET | SS1.8 | A PRACTICAL REMOTE POWER ATTACK ON MACHINE LEARNING ACCELERATORS IN CLOUD FPGAS Speaker: Russell Tessier, University of Massachusetts, Amherst, US Authors: Shanquan Tian1, Shayan Moini2, Daniel Holcomb2, Russell Tessier2 and Jakub Szefer1 1Yale University, US; 2University of Massachusetts Amherst, US Abstract The security and performance of FPGA-based accelerators play vital roles in today's cloud services. In addition to supporting convenient access to high-end FPGAs, cloud vendors and third-party developers now provide numerous FPGA accelerators for machine learning models. However, the security of accelerators developed for state-of-the-art Cloud FPGA environments has not been fully explored, since most remote accelerator attacks have been prototyped on local FPGA boards in lab settings, rather than in Cloud FPGA environments. To address existing research gaps, this work analyzes three existing machine learning accelerators developed in Xilinx Vitis to assess the potential threats of power attacks on accelerators in Amazon Web Services (AWS) F1 Cloud FPGA platforms, in a multi-tenant setting. The experiments show that malicious co-tenants in a multi-tenant environment can instantiate voltage sensing circuits as register-transfer level (RTL) kernels within the Vitis design environment to spy on co-tenant modules. A methodology for launching a practical remote power attack on Cloud FPGAs is also presented, which uses an enhanced time-to-digital (TDC) based voltage sensor and auto-triggered mechanism. The TDC is used to capture power signatures, which are then used to identify power consumption spikes and observe activity patterns involving the FPGA shell, DRAM on the FPGA board, or the other co-tenant victim's accelerators. Voltage change patterns related to shell use and accelerators are then used to create an auto-triggered attack that can automatically detect when to capture voltage traces without the need for a hard-wired synchronization signal between victim and attacker. To address the novel threats presented in this work, this paper also discusses defenses that could be leveraged to secure multi-tenant Cloud FPGAs from power-based attacks. |
16:54 CET | SS1.9 | SCALABLE SCAN-CHAIN-BASED EXTRACTION OF NEURAL NETWORK MODELS Speaker: Shui Jiang, The Chinese University of Hong Kong, CN Authors: Shui Jiang1, Seetal Potluri2 and Tsung-Yi Ho1 1The Chinese University of Hong Kong, HK; 2North Carolina State University, US Abstract Scan chains have greatly improved hardware testability while introducing security breaches for confidential data. Scan-chain attacks have extended their scope from cryptoprocessors to AI edge devices. The recently proposed scan-chain-based neural network (NN) model extraction attack (ICCAD 2021) made it possible to achieve fine-grained extraction and is multiple orders of magnitude more efficient both in queries and accuracy than its coarse-grained mathematical counterparts. However, both query formulation complexity and constraint solver failures increase drastically with network depth/size. We demonstrate a more powerful adversary, who is capable of improving scalability while maintaining accuracy, by relaxing high-fidelity constraints to formulate an approximate-fidelity-based layer-constrained least-squares extraction using random queries. We conduct our extraction attack on neural network inference topologies of different depths and sizes, targeting the MNIST digit recognition task. The results show that our method outperforms the scan-chain attack proposed in ICCAD 2021 by an average increase in the extracted neural network's functional accuracy of ≈ 32% and 2−3 orders of reduction in queries. Furthermore, we demonstrated that our attack is highly effective even in the presence of countermeasures against adversarial samples. |
Extended Abstracts
Time | Label | Presentation Title Authors |
---|---|---|
16:57 CET | SS1.10 | COMPREHENSIVE ANALYSIS OF HYPERDIMENSIONAL COMPUTING AGAINST GRADIENT BASED ATTACKS Speaker: Hamza Errahmouni Barkam, University of California, Irvine, US Authors: Hamza Errahmouni Barkam1, SungHeon Jeong2, Calvin Yeung1, Zhuowen Zou1, Xun Jiao3 and Mohsen Imani1 1University of California, Irvine, US; 2University of California, Irvine, KR; 3Villanova University, US Abstract Brain-inspired Hyper-dimensional computing (HDC) has recently shown promise as a lightweight machine learning approach. HDC models could become the solution to the security aspect of critical applications, such as self-driving cars. Despite its success, there are limited studies on the robustness of HDC models to adversarial attacks. In this paper, we introduce the first comprehensive study that compares the robustness of HDC to malicious attacks to that of deep neural network (DNN) models. We develop a framework that enables HDC models to generate gradient-based adversarial examples using state-of-the-art techniques applied to DNNs. We explore different hyperparameters and HDC architectures and design mechanisms to protect HDC models against malicious attacks. This study includes using data pre-processing and adversarial training. Our evaluation shows that HDC with a proper neural encoding module provides significantly higher robustness to adversarial attacks than existing DNNs. In addition, HDC models have high robustness to adversarial samples generated for DNNs. Our study also indicates that the proposed defense mechanisms can further protect HDC models and potentially increase this technology's viability in safety-critical applications. Our evaluation shows that our HDC model provides, on average, 19.9% higher robustness than DNNs to adversarial samples. |
YPPK Young People Programme – Keynote on career opportunities
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 17:30 CET - 18:00 CET
Location / Room: Marble Hall
Session chair:
Anton Klotz, Cadence Design Systems, DE
This is a Young People Programme event
Time | Label | Presentation Title Authors |
---|---|---|
17:30 CET | YPPK.1 | 3D INTEGRATION: OPPORTUNITIES & CHALLENGES FOR SYSTEM ARCHITECTURE TECHNOLOGY CO-OPTIMIZATION Presenter: Dragomir Milojevic, IMEC, BE Author: Dragomir Milojevic, IMEC, BE Abstract Today there is a consensus that the future of IC design and manufacturing will combine CMOS scaling to eventually reach 1nm and beyond, and advanced 3D integration packaging. Individual dies will be manufactured using new device architectures, together with so called performance boosters to still enable node-to-node gains, even with more modest gate/metal pitch scaling factors. Future ICs will integrate in the same IC package multiple dies manufactured using different processes (heterogeneous integration) optimized for given functionality (e.g., analogue, lower caches levels, high-capacity memories, high-performance logic etc.). To enable die-to-die connectivity, different 3D integration technologies will be required (TSVs, front-side, back-side bumps, optical interconnects etc.) with optimized properties to match performance, energy, and bandwidth requirements of the die-to-die interconnect. But the future will probably not limit itself to technology improvements only. The holy grail of next generation ICs will most likely be the fact that the above-mentioned technology ingredients could be used to re-design the system architecture from scratch to allow unprecedented gains in power, performance, and area. Thus, the parameter space of traditional SoCs design will increase further, making exploration and design choices much harder to make (number and type of cores, memory hierarchy configuration, interconnect design and configuration etc.). To enable system design novel methods will be required to allow so called System Technology Co-Optimization (STCO), or a paradigm in which good old "divide and conquer” approach should be abandoned in favour of more holistic system architecture-design-technology interaction. In this talk we will provide an overview of next generation challenges for system architecture design, practical implementation through EDA and process technology. Ultimately the goal of the presentation will be to point out incredible opportunities offered by the paradigm change for future research & development in the field. |
ASD4 ASD Panel session: Autonomous Systems Design as a Driver of Innovation?
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 18:30 CET - 20:00 CET
Location / Room: Gorilla Room 1.5.4/5
Session chair:
Rasmus Adler, Fraunhofer IESE, DE
Panellists:
Karl-Erik Arzen, Lund University, SE
Martin Fränzle, Carl von Ossietzky Universität, DE
Arne Hamann, Robert Bosch GmbH, DE
Davy Pissoort, KU Leuven, BE
Claus Bahlmann, Siemens, DE
Christoph Schulze, The Autonomous, AT
Presenter:
Karl-Erik Arzen, Lund University, SE
Autonomous systems have high potential in many application domains. However, most discussions seem to take place with respect to autonomous road vehicles. Automotive industry promised substantial progress in this field but many predictions have not come true. Companies stepped back and corrected their predictions. Does this mean, systems autonomy is not ready to drive innovation? However, autonomous behavior is obviously not limited to road vehicles. Various kinds of systems can benefit from autonomous behavior in various domains such as health and pharmaceutics, energy, manufacturing, farming, mining and so on. In this session, we will thus take a broader perspective on autonomous system design as a driver of innovation and discuss benefits, challenges, and risks in various application domains.
PhDF PhD Forum
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 18:30 CET - 20:00 CET
Location / Room: Atrium
The PhD Forum is a great opportunity for PhD students to present their work to a broad audience in the system design and design automation community from both industry and academia, and to establish contacts for entering the job market. Representatives from industry and academia get a glance of state-of-the-art in system design and design automation. The PhD Forum is hosted by EDAA, ACM SIGDA and IEEE CEDA
Time | Label | Presentation Title Authors |
---|---|---|
18:30 CET | PhDF.1 | ROBUST AND EFFICIENT MACHINE LEARNING FOR EMERGING RESOURCE-CONSTRAINED EMBEDDED SYSTEMS Speaker: Mikail Yayla, TU Dortmund, DE Authors: Mikail Yayla and Jian-Jia Chen, TU Dortmund, DE Abstract This thesis proposes a vision for highly resource-efficient future intelligent systems that are comprised of robust Binary Neural Networks (BNNs) operating with approximate memory and approximate computing units, while being able to be trained on the edge. The studies conducted within the scope of the thesis are summarized in three sections. In the fist section, we present how BNNs can be optimized for robustness by margin-maximization. In the second section, we present three studies on HW/SW Codesign methods that exploit the error tolerance of BNNs for efficient inference. In the third section, we summarize our method for enabling the memory-efficient training of BNNs. |
18:30 CET | PhDF.2 | FORMAL AND PRACTICAL TECHNIQUES FOR THE VIRTUAL PROTOTYPE DRIVEN SYSTEM DESIGN PROCESS Presenter: Pascal Pieper, DFKI, DE Author: Pascal Pieper, DFKI, DE Abstract Modern SoC designs are produced in increasingly faster cycle times, while their complexity rises at the need of a continuously decreasing cost. To cope with this high demand and pressure on a manufacturer's ability to maintain a reliable and secure end-product, a Virtual Prototype based design process is used widely in the industry. A VP creates the possibility to design, evaluate and verify an executable prototype of the system in an early design stage by modelling the future hardware on a behavioral or structural level. In contrast to more traditional design flows like extit{hardware-then-software}, this enables both the iterative design evaluation and a parallel development of the (actual) hardware emph{and} software very early in the product conception phase. Additionally, after development of the lower level hardware stages (e.g., Register-Transfer-Layer, gate level, or physical hardware), VPs can be used as golden reference models with test and verification methods for comparison between the system level behaviour and the actual hardware. For this to work, however, the VP and its components need to be verified in the first place. In this thesis, several techniques are proposed to improve and strengthen the VP-based design process, covering modeling and verification of security properties and hardware behaviour, as well as novel debugging, analysis and educational tools. The main goal of this thesis is to both improve existing processes and state-of-the-art tools, as well as showcase new approaches to handle and verify complex systems on hardware, software and intermediate levels. |
18:30 CET | PhDF.3 | ON THE ROLE OF RECONFIGURABLE SYSTEMS IN DOMAIN-SPECIFIC COMPUTING Speaker and Author: Davide Conficconi, Politecnico di Milano, IT Abstract Introduction The computer architecture field faces technological and architectural obstacles that limit the general-purpose processor scaling in the delivered performance at a reasonable energy cost. Therefore, computer architects have to follow novel paths to harvest more energy-efficient computations from the currently available technology, for instance, by employing domain specialized solution for a given scenario. The domain specialization path builds on a comprehensive environment where hardware and software are both specialized towards a particular application domain rather than being general purpose. Domain-Specific Architectures (DSAs) generally are the prominent exponent for hardware-centric domain specialization. DSAs leverage an abstraction layer such as an Instruction Set Architecture (ISA) and employ the easiest yet advanced computer architecture techniques to build a fixed datapath with the simplest data type and size. Generally, DSAs are thought to be efficiently implemented as Application-Specific Integrated Circuits (ASICs) or part of a System on Chip (SoC). However, developing custom silicon devices is a time-consuming and costly process that is not always compatible with the time-to-market and fast evolution of the applications, which may require additional datapath customization. Thus, adaptable computing platforms represent the most viable alternative for these scenarios. Field-Programmable Gate Arrays (FPGAs) are the candidate platforms for their on-field reconfigurable heterogeneous fabric. On top of the reconfigurability, FPGAs can implement large spatial computing designs and are publicly available on cloud computing platforms. Domain-Specific Reconfigurable Architectures FPGAs (and all reconfigurable systems) deserve a deeper analysis of their role in the domain specialization path despite being the commercial platform closest to the ideal adaptable computing paradigm. Indeed, they can implement domain-specialized architectures that can be updated after field deployment, delivering variable datapaths which are adaptable almost an infinite number of times. Here, I call them Domain-Specific Reconfigurable Architectures (DSRAs). Employing Reconfigurable Computing (RC) systems (such as FPGAs) opens a wide variety of architectural organizations different from traditional CPUs with their fixed datapath. This thesis classifies them on two orthogonal characteristics: level of software programmability and datapath configurability. The most traditional is the DSA based on a "fixed" datapath with a dedicated ISA that communicates with instructions and data memories (simply called DSA from now on). Then, streaming architectures have fixed datapaths for each class of problems, generally devised from a high-level tool that automates the whole process. Finally, the third architecture organization combines a semi-fixed datapath with a streaming architecture creating a Coarse-Grained Reconfigurable Architecture (CGRA). Figure 1 represents the three main DSRA classes. Although there are exciting research efforts on CGRAs, they are still immature; hence, I will focus mainly on traditional and streaming DSRAs. This thesis defines and analyzes specialized computer architecture organizations based on reconfigurable platforms called DSRAs and addresses three main topics for each specific domain: design methodologies, automation, and usability. The first one (i.e., the design methodologies) is crucial for designing highly energy-efficient architecture; while automation is essential for fast iterative approaches to newer solutions development and reproducibility of achieved results; the last one (i.e., usability) comprehends software programmability in a complete view that spans from hardware-software interfacing to ways of programming the architecture. My dissertation builds on systematic reviews of the latest system-level trends in reconfigurable systems [3] and the latest ways of how to design digital systems for FPGAs [5]. Then, I explore two domains that better mirror the corresponding DSRA class characteristics: one context-specific streaming architecture that could benefit from an automation toolchain and the other that presents different execution models suited for various application features. The most synthetic views of my dissertation contributions are: 1) An analysis of the latest reconfigurable system-level trends with a taxonomy of domain-specific reconfigurable computer organizations [3]; 2) A survey with taxonomies and timelines of the most prominent digital design abstractions for FPGAs [5]}; 3) An open-source design automation framework for highly customizable streaming-dataflow domain specialized accelerators proven on the Image Registration domain [1]; 4) An exploration of different computational model and form of parallelism for the Regular Expressions (or equivalently Finite State Machines) domain for traditional DSAs (depth-first [2] and breadth-first [4]). Open Source Design Automation Framework For Streaming DSRAs Image Registration (IRG) is an essential pre-processing step of several image processing pipelines. However, it is often neglected for its context-specific nature that would require a different architecture for different contexts. Therefore, this thesis presents a comprehensive framework based on the streaming architectural pattern with a dataflow MapReduce}approach, shown in Figure 2. To complete the DSRAs, a design automation toolchain lowers the adaptability effort of the architecture to unexpected contexts or new devices, and a software abstraction layer hides the low-level hardware interfacing mechanisms to expose simpler software APIs. All these components achieve significantly optimized IRG procedures at a lower energy profile [1]. Different Computational Models of Loopback-based DSRAs Particular domains may present more than a single computational pattern that fits the design process of a DSRA and different applications. For instance, the Regular Expressions (REs) domain presents intrinsically sequential computations that can leverage both a depth-first or a breadth-first execution model. Within this context, this thesis presents two different architectures, shown in Figure 3, that explore these different computational patterns and their respective programming abstraction. They exploit the idea of using REs as a programming language of a DSA and share the automation methodology built out of the IRG domain. The two DSA achieves impressive performance and energy efficiency results, although showing that their improvements are application sensitive [2,4]. |
18:30 CET | PhDF.4 | PHD FORUM ABSTRACT: CO-OPTIMIZATION OF NEURAL NETWORKS AND HARDWARE ARCHITECTURES FOR THEIR EFFICIENT EXECUTION Speaker and Author: Cecilia Latotzke, RWTH Aachen University, DE Abstract Convolutional Neural Networks (CNNs) are ubiquitously used on edge devices, because of their high classification accuracy. However, CNNs with high classification accuracy usually have a high memory footprint. This memory footprint causes high energy costs, which is a challenge for edge devices. Reducing the memory footprint by means of pruning or quantization can reduce accuracy. Meanwhile, most tasks do not accept a degradation in classification accuracy. This dissertation investigates the research question of how to enable the inference of CNNs efficiently and with high accuracy. |
18:30 CET | PhDF.5 | {ACCELERATING MEMORY INTENSIVE ALGORITHMS AND APPLICATIONS USING IN-MEMORY COMPUTING Presenter: Ann Franchesca Laguna, De La Salle University, PH Author: Ann Franchesca Laguna, De La Salle University, PH Abstract Data-intensive applications do not fully utilize the computing capabilities of Von Neumann architectures because of the memory bandwidth bottleneck. These memory-bandwidth-limited applications can be accelerated by minimizing the data movement between the memory and the compute units through in-memory computing (IMC). Using IMC, this work accelerated four different types of applications and algorithms. |
18:30 CET | PhDF.6 | LOGIC SYNTHESIS FOR ADIABATIC QUANTUM-FLUX PARAMETRON CIRCUITS CONSIDERING TECHNOLOGY-SPECIFIC COSTS Speaker: Siang-Yun Lee, EPFL, CH Authors: Siang-Yun Lee and Giovanni De Micheli, EPFL, CH Abstract Adiabatic quantum-flux parametron (AQFP) is a next-generation superconducting electronic technology featuring ultra-low energy consumption. While the computation paradigm remains the same as classical digital logic families, the AQFP technology has unconventional properties to be considered in design automation. This thesis is divided into two parts. First, on a technology-independent level, a scalable logic synthesis framework is presented along with a specialized resynthesis algorithm targeting majority-based circuits. Whereas the former is general purpose, the latter is especially important for AQFP circuit optimization because the basic computing unit in AQFP is the majority gate. Second, two design constraints imposed by AQFP, namely, path balancing and fanout branching, are tackled. Additional buffers need to be inserted on shorter paths and splitters have to be inserted at the output of multi-fanout gates to fulfill these constraints, which occupy large area in AQFP circuits. We study the optimality of the buffer and splitter insertion problem and propose both exact and heuristic methods to minimize this additional cost. |
18:30 CET | PhDF.7 | MODERN HIGH-LEVEL SYNTHESIS: IMPROVING PRODUCTIVITY WITH A MULTI-LEVEL APPROACH Speaker and Author: Serena Curzel, Politecnico di Milano, IT Abstract High-Level Synthesis (HLS) tools simplify the design of hardware accelerators by automatically generating Verilog/VHDL code starting from a general purpose software programming language, usually C/C++. They include a wide range of optimization techniques in the process, most of them performed on a low-level intermediate representation (IR) of the code. Because of the mismatch between the requirements of hardware descriptions and the characteristics of input languages, HLS tools often rely on users to add specific directives (pragmas) that augment the input specification to guide the generation of optimized hardware. A good result thus still requires hardware design knowledge and non-trivial design space exploration, which might be an obstacle for domain scientists seeking to accelerate applications written, for example, in Python-based programming frameworks. This thesis proposes a modern approach based on multi-level compiler technologies to bridge the gap between HLS and high-level frameworks, and to use domain-specific abstractions to solve domain-specific problems. The key enabling technology is the Multi-Level Intermediate Representation (MLIR), a framework that supports building reusable compiler infrastructure inspired by (and part of) the LLVM project. The proposed approach uses MLIR to introduce new optimizations at appropriate levels of abstraction outside the HLS tool while still relying on years of HLS research in the low-level hardware generation steps; users and developers of HLS tools can thus increase their productivity, obtain accelerators with higher performance, and not be limited by the features of a specific (possibly closed-source) backend. The presented tools and techniques were designed, implemented, and tested to synthesize machine learning algorithms, but they are broadly applicable to any input specification written in a language that has a translation to MLIR. Generated accelerators can be deployed on Field Programmable Gate Arrays or Application-Specific Integrated Circuits, and they can reach ~10-100 GFLOPS/W efficiency without any manual optimization of the code. |
18:30 CET | PhDF.8 | FAST BAYESIAN ALGORITHMS FOR FPGA PLATFORMS Speaker and Author: Raissa Likhonina, Academy of Sciences, Institute of Information Theory and Automation, CZ Abstract The PhD thesis was devoted to fast Bayesian algorithms, more precisely to the QRD RLS Lattice algorithm combined with hypothesis testing and applied to hand detection problem solution based on ultrasound technology. Due to the proposed structure of regression models and the offered approach to hypothesis testing in the work, the algorithm under consideration is able to solve the problem of noise cancellation and additionally to compute the distance between the hand and the device; thus, potentially enabling to identify simple gestures. Further, the algorithm was implemented in parallel on the HW platform of Xilinx Zynq Ultrascale+ device with a quad-core ARM Cortex A53 processor and FPGA programmable logic and proved to function reliably and accurately in real time using real data from an ultrasound microphone. The work contains an investigation of the state of the art in the corresponding field and gives the theoretical background necessary for the development and modification of the algorithm to fulfill the goals of the thesis. The thesis also includes thorough description of experiments and an analysis of the results including those from simulation and from computation using real ultrasound data both in the MATLAB R2019b environment and on the HW platform of Xilinx Zynq Ultrascale+. |
18:30 CET | PhDF.9 | VIRTUAL PROTOTYPE CENTRIC VERIFICATION FOR EMBEDDED SYSTEM DEVELOPMENT Speaker and Author: Niklas Bruns, University of Bremen, DE Abstract Nowadays, a world without embedded systems cannot be imagined. Embedded systems are widespread in consumer electronics as well as in the automotive sector. The high diversity of products leads to various requirements for the underlying embedded systems. For embedded system development, it is crucial to have a short time-to-market (TTM) to persist in the modern markets. In order to reduce the development time, Virtual Prototype (VP) based design flow was established. The VP-based design flow enables parallelizing the Hardware (HW) and Software (SW) development. Nevertheless, parallelized development is not enough to guarantee a short TTM, but also efficient verification methodologies are required. In this work, several novel approaches are proposed to improve the verification of embedded systems that are developed using a Virtual Prototype based design flow. These approaches concentrate on the transitions between VP-based design flow development steps. Just exactly as in the VP-based development flow, the vital link between specification, HW, and SW is the VP. |
18:30 CET | PhDF.10 | OLYMPUS: DESIGN METHODS FOR SIMPLIFYING THE CREATION OF DOMAIN-SPECIFIC MEMORY ARCHITECTURES Speaker and Author: Stephanie Soldavini, Politecnico di Milano, IT Abstract Recently, hardware accelerators are becoming increasingly important and specialization of these accelerators means they can achieve high performance and energy efficiency. This specialization, however, means their design is complex and time consuming, and even more so in the case of modern big data and machine learning applications, where a huge amount of data needs to be processed. This complexity means the designer not only has to optimize the accelerator computation logic, but also has to carefully craft efficient memory architectures, which is not the case in traditional software design. The goal of this work is to address these challenges by reducing the manual steps designer must perform to accelerate data-intensive applications by means of FPGA. We aim to create a multi-level compilation flow that specializes a domain-specific memory template to match data, application, and technology requirements in order to simplify the hardware accelerator development process In this thesis, I am developing Olympus, a set of methods for simplifying the creation of domain-specific memory architectures. With the currently implemented optimizations, Olympus is able to achieve a performance of up to 43 GFLOPS and an efficiency of 1.2 GFLOPS/W while using double-precision data and up to 103 GOPS and 3.9 GOPS/W when using 32-bit fixed point data. |
18:30 CET | PhDF.11 | EFFICIENT NEURAL ARCHITECTURES FOR EDGE DEVICES Speaker and Author: Dolly Sapra, University Van Amsterdam, NL Abstract The rise of IoT networks, with numerous interconnected edge devices, has led to an increase in demand for intelligent data processing closer to the data source. Deployment of neural networks at the edge is desirable, though challenging since an edge has limitations on available resources. The focus of this thesis is on neural architectures for Convolutional Neural Networks (CNNs) that execute on the edge. The thesis presents Evolutionary Piecemeal Training (EPT), an algorithm for an efficient Neural Architecture Search (NAS). This flexible algorithm treats NAS as an optimization problem with a variable number of objectives possible. To highlight the versatility of EPT, three different sets of experiments have been shown in the thesis, with one, two and four objectives respectively. The multi-objective algorithm typically involves hardware specific objectives in addition to accuracy of the CNN to produce a pareto-optimal set of neural architectures. Further, the thesis examines adaptivity of the CNN-based application running at the edge. The first work is Scenario Based Run-time Switching (SBRS) framework, where every scenario represents an operation mode and has an associated CNN. An application may switch between scenarios to allow synchronous adaptation with environmental changes. Additionally, a framework was presented to efficiently share and reuse CNNs in distributed IoT networks. This framework supports maintenance and adaptation of existing and deployed CNNs at the edge. To conclude, this thesis demonstrates various methodologies to improve the performance of a CNN deployed on a resource-constrained edge device. The key ideas include searching for an efficient neural architecture, adaptive applications with run-time CNN switching and CNNs as dynamic entities in a distributed IoT network. Thesis is published at https://dare.uva.nl/search?identifier=03eff2c1-b5ab-4fc8-bfe6-046c0a929… |
18:30 CET | PhDF.12 | DESIGN AND IMPLEMENTATION OF PARALLEL AND APPROXIMATE MICROARCHITECTURES FOR TIGHTLY COUPLED PROCESSOR ARRAYS Speaker and Author: Marcel Brand, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE Abstract With the decline of Moore's law, the trend of processor architecture design goes to compensating the stagnating compute power with parallelism of many- and multi-core systems. Therefore, it becomes increasingly important to have access to processing elements that are small but powerful and promote efficient coding and memory usage. Our work on Orthogonal Instruction Processing (OIP) and Anytime Instruction Processing (AIP) tackles this problem from various angles. With OIP, in contrast to well-known Very Long Instruction Word (VLIW) processor architectures, we can decrease the size of software-pipelined application code down to 4.6% of software-pipelined VLIW code and thus also save memory that would be expensive both in area and power requirements. AIP gives a programmer or compiler control over the accuracy of computed floating-point (FP) operations. The accuracy of the computations is encoded on bit granularity into the instruction, which leads to the executed operation only computing as many most significant bits (MSBs) and may even terminate earlier than when it had been computed at full accuracy. The concept exploits the fact that many algorithms do not need to compute every instruction with full accuracy and trades said accuracy off against execution time and power consumption. Anytime instructions prove especially useful when computing iterative algorithms like square-root or Jacobi, but also show benefits in other domains, e. g., compared to regular FP operations, they can reduce the energy consumption of Convolutional Neural Network (CNN) inference by up to 62% without increasing the error rate of the classification. |
18:30 CET | PhDF.13 | OSCILLATORY NEURAL NETWORKS IMPLEMENTED ON FPGA FOR EDGE COMPUTING APPLICATIONS Speaker: Madeleine Abernot, LIRMM - University of Montpellier - CNRS, FR Authors: Madeleine Abernot and Aida Todri-Sanial, LIRMM, University of Montpellier, CNRS, FR Abstract This PhD work focuses on Oscillatory Neural Network (ONN) computing paradigm for edge artificial intelligence applications. In particular, it uses a digital ONN design implemented on FPGA to explore novel ONN architectures, learning algorithms, and applications. First, using a fully-connected ONN architecture, ONN can perform pattern recognition, applied in this work for various edge applications, like image processing and robotics. Then, this work introduces layered ONN architectures for classification tasks applied to image edge detection. |
18:30 CET | PhDF.14 | NOVEL CIRCUIT ARCHITECTURES FOR SCALABLE AND ADAPTIVE SENSOR READOUT Speaker and Author: Jonah Van Assche, KU Leuven, BE Abstract In this Phd research, the design and modelling of new circuit architectures for sensing devices in the extreme edge is investigated, with a focus on biomedical sensing (e.g. ECG, EEG, ...). Such edge devices require very long battery life on a limited energy budget, coming from a small battery that powers the device. At the same time, long monitoring periods are needed, and the devices should be wireless connected to a base station/the cloud for further data-processing. This research explores sensor systems that directly compress the signal when sampling the signal, in the mixed-signal domain. This to lower the data rate of the system and hence, the system-level power consumption of edge devices. Two techniques in particular were focused on, compressive sensing and event-based sampling. The research has three main objectives. The first objective is to develop high-level models that can estimate already power consumption directly when modelling the sensor readout circuits (without the need for circuit level simulations). These modelling techniques were applied to a compressive sensing system and an event-based level-crossing ADC, showing that both such techniques can result in great system level power savings compared to a traditional sensor system. The second research objective was to improve the circuit level performance of level-crossing ADCs, which resulted in a prototype IC that can achieve state-of-the-art power efficiency and accuracy. The third research objective is to validate the design made in the second research objective in an application with a spiking neural network, to show that event-based sampling can result in not only lower data rate, but also lower processing power on-chip. |
18:30 CET | PhDF.15 | A COMPLETE ASSERTION-BASED VERIFICATION FRAMEWORK FROM THE EDGE TO THE CLOUD Speaker and Author: Samuele Germiniani, Università di Verona, IT Abstract Since modern Cyber-physical systems (CPS) modern CPSs are increasingly complex and distributed, it is no longer appropriate to focus the verification process only on the single component, instead, it is necessary to embrace holistic approaches that look at the entire system. To this end, it is crucial to consider an ecosystem of integrated tools interconnected in a complete supply chain: from the formalisation of specifications up to their run-time verification. Even though several tools have been proposed in the last few decades, there is no single framework that can be considered an integrated ecosystem. This leads to a number of inefficiencies and holes in the verification process. Assertion-based verification (ABV) is a well-known approach for checking the functional correctness of a system. In ABV, the specifications of the system under verification (SUV) are formalised through assertions, which are logic properties that must hold during the system's execution. Due to the complexity and dynamic nature of the SUV, ABV cannot be applied only in an offline fashion before the deployment of the system. Therefore, it is necessary to extend the verification process to the post-deployment phase, that is, by running checkers during the execution of the system. However, this collides with the issues of dealing with a distributed system affected by unpredictable latency. In this context, the SUV is usually made of several components with limited available resources, and to make things even more challenging, these resources are usually already completely saturated from executing the functional tasks. To fill in the gap, I propose a complete framework to verify complex distributed systems, from the formalisation of specifications to runtime execution. The proposed framework aims at covering several holes in the verification process of systems executing in an edge-to-cloud computing environment. |
18:30 CET | PhDF.16 | A RESOURCE EFFICIENT ACCELERATION OF NEURAL NETWORKS ON LOW-END FPGAS THROUGH MEMORY SHARING Speaker: Argyris Kokkinis, Aristotle University of Thessaloniki, GR Authors: Argyris Kokkinis1 and Kostas Siozios2 1Aristotle University of Thessaloniki, GR; 2Department of Physics, Aristotle University of Thessaloniki, GR Abstract Hardware acceleration at the deep edge is accompanied with strict constraints for low power and high throughput . In low-end FPGAs the frequent communication with the off-chip memory decreases both the design's performance and energy efficiency. In this research a design methodology for the acceleration of Neural Networks (NNs) on low-end FPGAs through on-chip memory sharing among the implemented accelerators is presented. Experimental analysis indicates that this methodology may increase the size of the on-chip NNs up to x3.09 without the overhead of a continuous off-chip communication. |
18:30 CET | PhDF.17 | PHASE-BASED OSCILLATORY NEURAL NETWORK FOR ENERGY EFFICIENT NEUROMORPHIC COMPUTING Speaker: Corentin Delacour, LIRMM, University of Montpellier, CNRS, FR Authors: corentin delacour and Aida Todri-Sanial, LIRMM, University of Montpellier, CNRS, FR Abstract Oscillatory Neural Networks (ONNs) are novel neuromorphic architectures where information is encoded in phases among coupled oscillators. This work introduces the concept of analog ONNs based on beyond-CMOS devices to perform AI tasks with a low energy footprint. Using circuit and TCAD simulations, we investigate the design of compact oscillating neurons made of vanadium dioxide (VO2) and coupled by passive synaptic elements. The ONN energy scaling at the device and architecture level is presented. Finally, we showcase a VO2-ONN for solving NP-hard optimization problems such as finding the maximum cut of a graph. |
18:30 CET | PhDF.18 | CO-DESIGN OF LIGHTWEIGHT EHEALTH APPLICATIONS ON A IOT EGDE PROCESSOR Speaker: Mingyu Yang, Tokyo Institute of Technology, JP Authors: Mingyu Yang and Yuko Hara-Azumi, Tokyo Institute of Technology, JP Abstract With the development of the Internet of Things (IoT), eHealth applications implemented on embedded systems are offering an easy-to-use IoT eHealth ecosystem. For such applications, a power/energy-efficient computing platform as well as lightweight algorithms that do not require powerful resources or large memory footprint are both essential. This work targets a lightweight implementation of a low-power eHealth device using both hardware and software approaches. A memory-conscious dynamic time warping (DTW) algorithm used in various lightweight eHealth applications is deployed on a small and low-power embedded processor. Prototypes of the processor were fabricated using a 65nm low-power process. |
18:30 CET | PhDF.19 | QUALITY-OF-SERVICE AWARE DESIGN AND MANAGEMENT OF EMBEDDED MIXED-CRITICALITY SYSTEMS Speaker and Author: Behnaz Ranjbar, TU Dresden, DE Abstract A wide range of embedded systems found in the automotive and avionics industries are evolving into Mixed-Criticality (MC) systems to meet cost, space, timing, and power consumption requirements. MC applications are real-time, and to ensure the correctness of these applications, it is essential to meet strict timing requirements as well as functional specifications. The correct design of such MC systems requires a thorough understanding of the system's functions and their importance to the system. We address the challenges associated with efficient MC system design. We first focus on MC application analysis through Worst-Case Execution Time (WCET) analysis and task scheduling analysis in order to execute more low-criticality tasks in the system, i.e., improving the Quality-of-Service (QoS), while guaranteeing the correct execution of high-criticality tasks. Then, it addresses the challenge of enhancing QoS using parallelism in multi-processor hardware platforms. In addition, we studied the power and thermal management of multi-core MC systems while guaranteeing the real-timeliness of the systems under any circumstances. |
18:30 CET | PhDF.20 | FUNCTIONAL SYNTHESIS VIA MACHINE LEARNING AND AUTOMATED REASONING Speaker and Author: Priyanka Golia, IIT Kanpur and NUS Singapore, SG Abstract Automated Functional synthesis deals with synthesizing programs, functions, and circuits that satisfy the user's requirement. Given a relation specification R(X, Y ) over input X and output Y, the task is to synthesize output Y in terms of X, that is, Y := F(X), such that the given specification is met. Given the fundamental importance of synthesis in computer science, recent developments in this area led to advances in program synthesis, synthesis of safety controllers, circuit design and repair, and cryptanalysis. We proposed a novel data-driven approach for functional synthesis that takes advantage of advances in machine learning, constrained sampling, and automated reasoning. The proposed approach is very generic and could be lifted to diverse settings. We further analyze its impact on program synthesis and synthesis with explicit dependencies. The submission summarizes the different work done as a part of my thesis. Joint work with Kuldeep S. Meel, Subhajit Roy, and Friedrich Slivovsky. |
18:30 CET | PhDF.21 | APPLICATION REFINEMENT AND MEMORY MANAGEMENT OVER HETEROGENEOUS DRAM/NVM SYSTEMS Speaker: Manolis Katsaragakis, National TU Athens, GR Authors: Manolis Katsaragakis1, Francky Catthoor2 and Dimitrios Soudris3 1National TU Athens, GR; 2IMEC, BE; 3National Technical University of Athens, GR Abstract This PhD focuses on the development of systematic methodology for providing source code organization, data structure refinement, exploration and placement over emerging memory technologies. The goal is to extract alternative solutions, aiming to provide multi-criteria tradeoff over different optimization aspects, such as memory footprint, accesses, performance and energy consumption. |
18:30 CET | PhDF.22 | MAXIMIZING THE POTENTIAL OF RISC-V VECTOR EXTENSIONS FOR SPEEDING UP CRYPTOGRAPHY ALGORITHMS Speaker and Author: Huimin Li, TU Delft, NL Abstract RISC-V is an open and freely accessible Instruction Set Architecture (ISA) based on reduced instruction set computer (RISC) principles. It is suitable for direct native hardware implementation with small base instructions (ISA bases) for simplified general-purpose computers and rich optional instruction extensions for more comprehensive applications. These optional extensions are designed to work with all ISA bases without conflicts. Additionally, RISC-V allows users to customize their instructions to accelerate specification applications. RISC-V vector extensions (RISC-V vector ISA) are designed for vector operations. These extensions make multiple data execute the same highly-parallel process under one instruction and improve the whole system's performance. This paper explores the full potential of RISC-V Vector Extensions in cryptography algorithms. |
18:30 CET | PhDF.23 | OPTIMIZING AI AT THE EDGE: FROM NETWORK TOPOLOGY DESIGN TO MCU DEPLOYMENT Speaker and Author: Alessio Burrello, Politecnico di Torino and Università di Bologna, IT Abstract Optimizing and deploying artificial intelligence on edge devices to remove the necessity of cloud computing systems and sending data over networks is vital for reducing energy consumption and improving privacy. This thesis will describe two essential knobs to optimize the so-called EdgeAI. The first topic analyzed in the thesis will be Neural Architecture Search (NAS). NAS is quickly becoming the go-to approach to optimize the structure of Deep Learning (DL) models. I will focus on two different tools that I developed, one to optimize the architecture of Temporal Convolutional Networks (TCNs), a convolutional model for time-series processing that has recently emerged, and one to optimize the data precision of tensors inside CNNs. The first NAS proposed explicitly targets the optimization of the most peculiar architectural parameters of TCNs, namely dilation, receptive field, and the number of features in each layer. Note that this is the first NAS that explicitly targets these networks. The second NAS proposed instead focuses on finding the most efficient data format for a target CNN, with the granularity of the layer filter. Note that applying these two NASes in sequence allows an "application designer" to minimize the structure of the neural network employed, minimizing the number of operations or the memory usage of the network. The second chapter describes the optimization of neural network deployment on edge devices. Importantly, exploiting edge platforms' scarce resources is critical for NN efficient execution on MCUs. To do so, I will introduce DORY (Deployment Oriented to memoRY) -- an automatic tool to deploy CNNs on low-cost MCUs. DORY, in different steps, can manage different levels of memory inside the MCU automatically, offload the computation workload (i.e., the different layers of a neural network) to dedicated hardware accelerators, and automatically generates ANSI C code that orchestrates off- and on-chip transfers with the computation phases. On top of this, I will introduce two optimized computation libraries that DORY can exploit to deploy TCNs and Transformers on edge efficiently. In the last chapter of the thesis, I will describe two different applications of bio-signal analysis, i.e., heart rate tracking and sEMG-based gesture recognition. In these two applications, I will show the employment of previously described techniques as fundamental blocks for optimizing the execution of these tasks on edge. |
18:30 CET | PhDF.24 | EFFICIENT AND RELIABLE EDGE VISION ACCELERATOR WITH COMPUTE-IN-MEMORY Speaker: Wantong Li, Georgia Tech, US Authors: Wantong Li and Shimeng Yu, Georgia Tech, US Abstract Compute-in-memory (CIM) has been widely investigated as an attractive candidate to accelerate the extensive multiply-and-accumulate (MAC) workloads in deep learning inference. Analog CIM with non-volatile memories such as resistive random-access memory (RRAM) benefits from low leakage, high capacity, and suppression of data movement, but inference accuracy can deteriorate from nonidealities. This work proposes techniques including on-chip write-verify, in-situ error correction, and temperature-tracking ADC references to combat the process, voltage, and temperature (PVT) variations in RRAM-CIM. A prototype chip with these features has been fabricated and validated in TSMC 40nm technology. The macro achieves competitive compute density of 97.8 GOPS/mm2 and energy efficiency of 44.5 TOPS/W, while guaranteeing high accuracy under low VDD and high temperature. On the application side, vision transformer has become the state-of-the-art for many computer vision tasks, and a digital reconfigurable accelerator (RAWAtten) for its complex window attention is designed. RAWAtten achieves 2.4× speedup over the baseline GPU while consuming only a fraction of GPU power. Having improved the reliability of analog CIM, a hybrid RAWAtten employing analog CIM for its linear layers and digital compute for its intermediate matrix multiplications is under development to combine advantages of both compute schemes. Monolithic 3D integration will be used to further reduce cost of data movements and allow stacking of heterogeneous layers in different technology nodes. |
18:30 CET | PhDF.25 | UNFORGETTABLE!: DESIGNING A NON-VOLATILE PROCESSOR FOR INTERMITTENTLY POWERED EMBEDDED DEVICES Speaker: Satya Jaswanth Badri, Indian Institute of Technology Ropar, IN Authors: SatyaJaswanth Badri1, Mukesh Saini1 and Neeraj Goel2 1Indian Institute of Technology, Ropar, IN; 2IIT Ropar, IN Abstract Battery-less technology evolved to replace battery usage in space, deep mines, and other environments to reduce cost and pollution. The alternative and promising solution to replace battery-operated devices is energy harvesters, which help to collect energy from the environment to power up IoT devices. The collected energy is stored in a capacitor and uses this energy for computations, so power failures may often occur in these IoT systems. We refer to this computation as intermittent computation. Data loss is the major challenge in these intermittently powered IoT devices. Non-volatile memory (NVM) based processors were explored for saving the system state during a power failure. A Non-Volatile Processor (NVP) is needed for these devices. We proposed three different architectures that combine to design a suitable and efficient NVP for intermittent computing. In our first work, we deploy NVM at the L1 cache. We deploy NVM at the LLC cache in our second work. In our third proposed work, we proposed a memory mapping technique for a modern NVP, i.e., MSP430FR6989. |
18:30 CET | PhDF.26 | LANGUAGE SUPPORT AND OPTIMIZATION FOR ENERGY-EFFICIENT AND ADAPTABLE EXECUTION OF MULTIPLE DATAFLOW APPLICATIONS ON EMBEDDED SYSTEMS Speaker and Author: Robert Khasanov, TU Dresden, DE Abstract Many modern computing systems in embedded end-user devices consist of many cores, and the number of cores continues to grow. Embedded devices often process varying workloads, where different kernels may be requested to execute at any time. The system needs to ensure that a required Quality of Service is delivered and the overall energy consumption is minimized. This thesis combines several works which aim at energy-efficient and adaptable execution on soft/firm real-time systems and researches adaptivity both at the application and system levels. More concretely, first, it presents a novel extension to Kahn Process Networks (KPN), which introduces implicit parallelism and a relaxed execution strategy. Despite this relaxation, the introduced extension to the application model still keeps deterministic KPN semantics. Second, the thesis presents a novel energy-efficient runtime resource management algorithm for multi-application mapping. The presented methodology lets runtime applications adapt to available resources by using mapping segments, which allows the manager to consider the upcoming changes in the workload, thereby enlarging the scope of analysis. As a result, the manager better adapts applications to the available resources and produces energy-efficient schedules. Due to low overhead, the approach could also be applied to a use-case of baseband processing, where the incoming requests are processed at the millisecond granularity. The final part of the prospective thesis presents a complete tool flow from adaptive dataflow application code down to the final execution on the embedded system. This tool flow exploits the available adaptivity knobs at both application and system levels in a joint way, thereby better adapting the system to the varied dynamic workload. |
18:30 CET | PhDF.27 | AN INTEGRATED ENVIRONMENT FOR MODELING AND DEPLOYING DIGITAL TWINS Speaker and Author: Charles Steinmetz, Hochschule Hamm-Lippstadt - Campus Lippstadt, DE Abstract The Digital Twin (DT) has been the focus of researchers from academia and industry in the last few years. It is one of the key enablers of the current and next industrial revolutions, such as Industry 4.0, Industry 5.0, and Metaverse. However, representing real-world systems can be complex, since assets might have several ways of being represented and several stakeholders with different experiences might be involved. In this context, this paper proposes an environment that integrates all these perspectives in a common language that different stakeholders can use, covering all system levels from the device level up to process and workflow levels. A methodology and elements for creating semantic DT-models are provided. Furthermore, a 4-layers architecture is presented to help designers to identify the responsibilities of each part of the system. |
18:30 CET | PhDF.28 | HARDWARE AND SOFTWARE ARCHITECTURES FOR ENERGY-EFFICIENT SMART HEALTHCARE SYSTEMS Speaker: Bharath Srinivas Prabakaran, TU Wien, AT Authors: Bharath Srinivas Prabakaran1 and Muhammad Shafique2 1TU Wien (TU Wien), AT; 2New York University Abu Dhabi, AE Abstract Wearables are proving to be increasingly influential by penetrating and improving the user experience of most smartphone holders. They have drastically improved the users' quality of life due to their ease of use and broad-spectrum functionalities, including the deployment of sensors capable of monitoring various biosignals to estimate the users' health. The collected data is transmitted to the user's device and/or their physician, based on their requirement, for further processing and information extraction to detect anomalies. This work investigates the research challenges associated with such smart-healthcare systems at both the hardware and software layers to propose relevant techniques that can ease future deployment and adoption. |
18:30 CET | PhDF.29 | POWER SIDE CHANNELS IN REMOTE FPGAS Speaker and Author: Ognjen Glamocanin, EPFL, CH Abstract The pervasive adoption of field-programmable gate arrays (FPGAs) in both cyber-physical systems and the cloud has raised many security issues. Being integrated circuits, FPGAs are susceptible to fault and power side-channel attacks, which require physical access to the victim device. Recent work demonstrated that physical proximity is no longer required for these attacks, as FPGA logic can be misused to create on-chip voltage sensors or power-wasting circuits. This work explores the impact of on-chip voltage sensors on FPGA security, and shows that they can be used to both enhance and compromise the security of FPGA-based systems. In case of deployed and no longer accessible cyber-physical devices vulnerable to tampering attacks, we show that on-chip sensors allow designers to re-evaluate the power side-channel leakage after deployment, ensuring that security has not been compromised. In the case of shared FPGAs in the cloud, we demonstrate that new security vulnerabilities arise with the use of on-chip sensors. We show that a remote attacker can mount both statistical (correlation power analysis) and machine learning (ML) based attacks with the on-chip sensors, emphasizing the need to deploy countermeasures in multi-tenant FPGAs. Our work also demonstrates new, routing-based sensor architectures that outperform the state of the art. Finally, we evaluate the temperature impact on the on-chip sensors and demonstrate that it can significantly impact the attack effort. |
18:30 CET | PhDF.30 | NOVEL TECHNIQUES FOR TIMING ANALYSIS OF VLSI CIRCUITS Speaker and Author: Dimitrios Garyfallou, University of Thessaly, GR Abstract Timing analysis is an essential and demanding verification method used during the design and optimization of a Very Large Scale Integrated (VLSI) circuit, while it also constitutes the cornerstone of the final signoff that dictates whether the chip can be released to the semiconductor foundry for fabrication. Throughout the last few decades, the relentless push for high-performance and energy-efficient circuits has been met by aggressive technology scaling, which enabled the integration of a vast number of devices into the same die but introduced new challenges to timing analysis. In nanometer technologies, highly resistive interconnects have an ever-increasing impact on timing, while nonlinear transistor and Miller capacitances imply that signals no longer resemble smooth saturated ramps. At the same time, manufacturing process variations have become significantly more pronounced, which calls for sophisticated timing analysis techniques to reduce uncertainty in timing estimation. From another perspective, the timing guardbands enforced to protect circuits from variation-induced timing errors are overly pessimistic since they are estimated using static timing analysis under rare worst-case conditions, leaving extensive dynamic timing margins unexploited. To this end, this research presents several new techniques for accurate and efficient timing analysis of VLSI circuits in advanced technologies, which address different aspects of the problem, including gate and interconnect timing estimation, timing analysis under process variation, and dynamic timing analysis. |
18:30 CET | PhDF.31 | SHARED RESOURCE CONTENTION AWARE SCHEDULABILITY ANALYSIS FOR MULTIPROCESSOR REAL-TIME SYSTEMS Speaker: Jatin Arora, CISTER Research Centre, ISEP, IPP, PT Authors: Jatin Arora, Eduardo Tovar and Claudio Maia, Polytechnic Institute of Porto, PT Abstract Commercial-off-the-shelf (COTS) multicore platforms have the potential to provide raw computing power while being energy-efficient and cost-effective. However, the adoption of multicore platforms in hard real-time systems is still under scrutiny. The main challenge that hinders the use of COTS multicore platforms in hard real-time systems is their unpredictability, which originates from the sharing of different hardware resources. A task executing on one core of a multicore platform has to compete with other co-running tasks (tasks running on other cores) to access shared hardware resources such as the last-level cache (LLC), the interconnect (e.g., memory bus), and the main memory. This competition is problematic as it can negatively influence the temporal behavior of tasks in a non-deterministic manner. This phenomenon is known as shared resource contention. The shared resource contention in multicore systems is problematic as it can negatively influence the temporal behavior of tasks in a non-deterministic manner. To circumvent this problem, the concept of the 3-phase task execution model was proposed that divides the task execution into distinct computation and memory phases. In such a model, the shared resources, i.e., the memory bus, and the main memory, are only accessed during the memory phases and no memory access is allowed during the computation phase. Leveraging such a model, tasks can be scheduled in a manner such that while a task is executing its memory phase, another task on a different core can execute its computation phase concurrently without suffering shared resource contention. However, if tasks running on multiple cores execute their memory phases at the same time, shared resource contention can occur. To address this issue, this PhD dissertation focuses on analyzing the shared resource contention suffered by 3-phase tasks due to the sharing of the memory bus and main memory. Having analyzed the shared resource contention, the Worst-Case Response Time (WCRT) based schedulability analysis is performed by integrating the maximum shared resource contention suffered by 3-phase tasks. |
18:30 CET | PhDF.32 | SCALABLE HARDWARE-AWARE NEURO-EVOLUTIONARY ALGORITHMS Speaker and Author: Michal Pinos, Faculty of Information Technology, Brno University of Technology, CZ Abstract Recently, there has been a growing interest in the use of DNNs in low-power devices with limited resources, such as Internet of Things (IoT) devices, embedded devices, or other battery-powered smart gadgets. The deployment of DNNs in these devices is associated with many restrictions, such as limited power consumption, low memory or insufficient computing power, which, for example, limits their usage to on-device inference only. In order to be able to deploy modern DNNs on resource constrained devices, many methods of hardware-aware DNN design have been researched. One of the most frequently used approaches is the manual or semi-automatic optimizations of the existing DNNs for deployment on the given hardware. Such optimizations usually consist of procedures such as model quantization reducing the bit width of numeric data types, replacing expensive floating point operations with fixed point arithmetic, or model compression using pruning and fine-tuning techniques. Another, recently very popular and successful technique is the deployment of approximate computing at different levels of the DNN computing stack. In this research I focus on the utilization of approximate multipliers in certain layers of DNN models. In particular, excellent trade-offs between energy consumption and accuracy can be achieved by approximating multiplications in the convolutional layers of convolutional neural networks (CNNs). To overcome some of the problems associated with the tedious manual design of DNN architectures, such as the time complexity and error-proneness, a special technique for the automated design of neural network architectures, called Neural Architecture Search (NAS), has been deployed. |
18:30 CET | PhDF.33 | DESIGN AND CODE OPTIMIZATION FOR SYSTEMS WITH NEXT-GENERATION RACETRACK MEMORIES Speaker and Author: Asif Ali Khan, TU Dresden, DE Abstract With the rise of computationally expensive application domains such as machine learning, genomics, and fluid simulation, the quest for performance and energy-efficient computing has gained unprecedented momentum. The significant increase in computing and memory devices in modern systems has resulted in an unsustainable surge in energy consumption, a substantial portion of which is attributed to the memory system. The scaling of conventional memory technologies and their suitability for the next-generation system is also questionable. This has led to the emergence and rise of nonvolatile memory (NVM) technologies. Today, in different development stages, several NVM technologies are competing for rapid access to the market. Racetrack memory (RTM) is one such nonvolatile memory technology that promises SRAM- comparable latency, reduced energy consumption, and unprecedented density compared to other technologies. However, RTM is sequential in nature, i.e., data in an RTM cell needs to be shifted to an access port before it can be accessed. These shift operations incur performance and energy penalties. This thesis presents a set of techniques including optimal, near-optimal, and evolutionary algorithms for efficient scalar, instructions, and array placement in RTMs. We present an automatic compilation framework that analyzes static control flow programs and transforms the loop traversal order and memory layout to maximize accesses to consecutive RTM locations and minimize shifts. We develop a simulation framework called RTSim that models various RTM parameters and enables accurate architectural-level simulation. |
18:30 CET | PhDF.34 | BITSTREAM PROCESSING SYSTEMS WITH NEW PERSPECTIVES TOWARD SIMULATION AND LIGHTWEIGHT NEURAL NETWORKS Speaker: Sercan Aygun, University of Louisiana at Lafayette, US Authors: Sercan Aygun1 and Ece Gunes2 1University of Louisiana at Lafayette, US; 2Istanbul TU, TR Abstract Sercan Aygun obtained his Ph.D. degree in Electronics Engineering from Istanbul Technical University, Turkey, in November 2022. He is currently a postdoctoral researcher at the University of Louisiana at Lafayette, USA. The goal of the dissertation is to propose software simulations of stochastic computing (SC) systems with an emphasis on vision and learning machines. A new simulation approach based on the contingency table (CT) construct is proposed. The simulation burden of the memory- and runtime-bounded SC is reduced. By only utilizing correlation-aware CT, digital circuits are simulated as if using actual bitstreams. However, only scalar processing is performed. In addition, the dissertation proposes a new bitstream processing neural network architecture based on binarized weights and activations. Bitstream processing binarized neural network (BSBNN) is presented by considering its efficient hardware structure and robustness to non-idealities such as bitflip errors. The dissertation was collaboratively continued at the Université Catholique de Louvain, Belgium (Supervisor: Prof. Christophe De Vleeschouwer, 2018-2019) and the University of Louisiana at Lafayette, USA (Supervisor: Asst. Prof. M. Hassan Najafi, 2021- 2022). |
18:30 CET | PhDF.35 | A CROSS-LAYER FRAMEWORK FOR ADAPTIVE PROCESSOR-BASED SYSTEMS REGARDING ERROR RESILIENCE AND POWER EFFICIENCY Speaker and Author: Mitko Veleski, Brandenburg University of Technology, DE Abstract This thesis presents a novel, first of its kind framework for synergistic optimization of two fundamental, but non-complementary requirements in modern computing: error resilience and power consumption. The basic foundations of the framework are high degree of configurability and simple integration in a typical processor-based system. Such a framework makes the host system easily adaptable to variations and capable to operate optimally in all conditions. This is achieved by intelligent interchanging of techniques for resilient and low power computing during runtime. As flow of relevant information in an efficient and timely manner is crucial for dynamic system adjustment, the framework building blocks are distributed across several abstraction layers. Moreover, the framework allows a system to preserve its performance at negligible area overhead. |
18:30 CET | PhDF.36 | SECURITY AND INTERPRETABILITY IN AUTOMOTIVE SYSTEMS Speaker and Author: Shailja Thakur, New York University, US Abstract The lack of a sender authentication mechanism in the Controller Area Network (CAN) makes it vulnerable to security threats, such as an attacker impersonating an Electronic Control Unit (ECU) and sending spoofed messages. To address the issue, this thesis proposes a sender authentication technique that utilizes power consumption measurements and a classification model to determine transmitting states. By analyzing the power consumption of each ECU, the technique can identify the actual sender and detect any spoofed messages. The method shows good accuracy in real-world settings, making it a promising solution to the problem of CAN security. However, while machine learning-based security controls have shown great potential in improving automotive security, false positives pose a significant challenge. False positive alerts can cause alarm fatigue in operators, leading to incorrect reactions and, ultimately, rendering the system less effective. To address this challenge, the thesis explores explanation techniques for image and time series inputs.These techniques assign weights to sensitive inputs and quantify variations in explanations. Overall, the thesis proposes methods for addressing security and interpretability in automotive systems. These methods have potential applications in other settings where transparent and reliable decision-making is crucial. |
18:30 CET | PhDF.37 | RESOURCE-AWARE OPTIMIZATION TECHNIQUES FOR MACHINE LEARNING INFERENCE ON HETEROGENEOUS EMBEDDED SYSTEMS Speaker and Author: Ourania Spantidi, Southern Illinois University Carbondale, US Abstract Deep neural networks (DNNs) are being heavily utilized in modern applications, putting energy-constraint devices to the test. To bypass high energy consumption issues, approximate computing has been employed in DNN accelerators to balance out the accuracy-energy reduction trade-off. However, the approximation-induced accuracy loss can be very high and drastically degrade the performance of the DNN. Therefore, there is a need for a fine-grain mechanism that would assign specific DNN operations to approximation to maintain acceptable DNN accuracy, while achieving low energy consumption. This PhD thesis presents two different methods for weight-to-approximation mapping for approximate DNN accelerators. |
18:30 CET | PhDF.38 | A CAD FRAMEWORK FOR AUTOMATED LEARNABILITY ASSESSMENT OF PHYSICALLY UNCLONABLE FUNCTIONS Speaker: Durba Chatterjee, IIT Kharagpur, IN Authors: Durba Chatterjee, Debdeep Mukhopadhyay and Aritra Hazra, IIT Kharagpur, IN Abstract Ever since the emergence of Physically Unclonable Function~(PUF), the hardware primitive has been subjected to various machine learning~(ML) attacks. While several design strategies have been proposed to mitigate state-of-the-art attacks, they are subsequently broken by novel attack techniques. One of the reasons is that most designs are adapted to mitigate the former attacks and do not consider design strengthening from an architectural perspective. This necessitates the development of a formal methodology to design strong ML-resilient PUF constructions. In this work, we present a CAD framework~PUF-G, to formally represent and evaluate the Probably Approximately Correct~(PAC) learnability of Silicon PUFs and their compositions. To represent a PUF design, we propose a formal representation language capable of representing any PUF construction or composition upfront. The PUF-G tool parses the design description, translates the design to an interim model and outputs the PAC-learnability bounds. This tool will help a designer explore various compositional PUF architectures and their resilience to ML attacks automatically before converging on a strong design. |
18:30 CET | PhDF.39 | RELIABILITY MODELING AND MITIGATION IN ADVANCED MEMORY TECHNOLOGIES AND PARADIGMS Speaker and Author: Mahta Mayahinia, Karlsruhe Institute of Technology, DE Abstract Scaling the VLSI technology toward the more advanced smaller nodes on the one side, and emerging new devices such as non-volatile resistive memories, on the other side, open up new horizons for designing high-performance and energy-efficient computational and memory platforms. However, both the long-term and short-term reliability of these structures is of paramount importance. Moreover, due to the smaller technology node, utilizing the emerging devices and non-conventional processing units such as computation in memory, the previous modeling for both the functionality and reliability is not sufficiently accurate anymore. Therefore, new models need to be developed by considering the new challenges. In this work, we investigate the reliability issues of the advanced and emerging memory and processing elements and try to resolve them in different levels of abstraction; from low-level circuit-based to higher-level application-oriented solutions. |
18:30 CET | PhDF.40 | MACHINE LEARNING FOR RESOURCE-CONSTRAINED COMPUTING SYSTEMS Speaker and Author: Martin Rapp, Karlsruhe Institute of Technology, DE Abstract Optimizing the management of the limited resources of computing systems such as processors is of paramount importance to achieve goals like maximum performance. In particular, system-level resource management has a major impact on the performance, power, and temperature during application execution by utilizing application mapping, application migration, and dynamic voltage and frequency scaling (DVFS). This work presents novel machine learning (ML)-based resource management techniques. ML-based solutions allow tackling the involved challenges by predicting the impact of potential resource management actions, by estimating hidden, i.e., unobservable at run time, properties of applications, or by directly learning a resource management policy. Finally, since ML also needs to run with limited resources, this work presents resource-aware distributed on-device learning. Ultimately, this work shows that ML is a key technology to optimize system-level resources management by tackling the involved challenges and enabling technical innovations to further exploit the full potential of computing systems. |
18:30 CET | PhDF.41 | COUNTERMEASURES AGAINST FPGA-BASED NON-INVASIVE ATTACKS Speaker: Ali Asghar, Technische Universitat Ilmenau, DE Authors: Ali Asghar and Daniel Ziener, TU-Ilmenau, DE Abstract Non-Invasive attacks have now been known for decades, however, the security community's interest in these attacks hasn't yet diminished which shows their relevance in hardware security. In this work, we propose countermeasures for two different types of FPGA-based non-invasive attacks. Our major contribution is a countermeasure against a class of non-invasive physical attacks known as Side Channel Analysis (SCA), for which we have developed and evaluated a dynamically reconfigurable system. The proposed system allows exchanging different realizations of a cryptographic algorithm during run-time. This dynamic behavior renders the static principles of SCA ineffective and consequently increases the overall system security. The second contribution of this work deals with Intellectual Property (IP) piracy which is a non-invasive logical attack. We have extended an existing idea which establishes the ownership of an IP core using look-up table (LUT) contents as signatures. Our contributions scale the work for much larger designs and 6-LUT FPGAs and associated CAD tools. The results show a 100% core identification rate with no false-positives or false-negatives. |
18:30 CET | PhDF.42 | ENERGY-EFFICIENT LOCALIZATION ON AUTONOMOUS NANO-UAVS WITH NOVEL MULTIZONE DEPTH SENSORS AND PARALLEL RISC-V PROCESSORS Speaker: Hanna Müller, ETH Zürich, CH Author: Hanna Mueller, ETH Zurich, CH Abstract Unmanned aerial vehicles (UAVs) are nowadays used in many fields, such as monitoring, inspection, surveillance, transportation, and communication. In many of those scenarios, a small form factor brings advantages - smaller drones are more agile, can fly through narrow passages, and allow safe operation close to humans. Especially miniaturized UAVs (i.e., nano-UAVs that weigh a few tens of grams) often rely on offboard computation in the form of a powerful computer, as onboard computation is strongly limited by power and size constraints. However, only relying on onboard sensing and computation has many advantages, like higher reliability as their mission does not critically rely on the presence of a reliable communication link with a central computer or a pilot, and increased reach, as they do not have to stay close to a base station anymore. However, to navigate autonomously, one must fulfill several compute-intensive tasks, such as localization, mapping, and planning, while avoiding obstacles. I identified three main challenges in fully autonomous nano-UAVs, (i) miniaturization of the UAVs, (ii) obstacle avoidance and (iii) localization. This work addresses those challenges by exploiting for the first time novel depth-map sensors from STMicroelectronics (VL53L5CX), and novel processing units only consuming tens of milliwatts while providing tens of GOPS, such as parallel ultra-low power (PULP) System-on-Chips (SoCs), as well as optimized algorithms, fitted for execution on microcontrollers. |
18:30 CET | PhDF.43 | ENERGY EFFICIENT DOMAIN-SPECIFIC HARDWARE DESIGN Speaker: Kailash Prasad, IIT Gandhinagar, IN Authors: Kailash Prasad and Joycee Mekie, IIT Gandhinagar, IN Abstract The advent of Deep Neural Networks (DNNs) has ushered in a new era of breakthroughs in a wide variety of tasks, including image classification and language translation. However, the complexity of these workloads has led to an enormous increase in computational demands. In recent years, novel paradigms have been proposed for energy-efficient circuits, one of which is approximate computing. This approach aims to exploit the inherent ability of many applications to produce acceptable results, even when there are some errors in their computations. Previous studies on DNN accelerators have shown that on-chip and off-chip memory accounts for a significant portion of the system energy consumption, with data movement being the dominant energy-consuming factor. To overcome this challenge, In-Memory Computing (IMC) has emerged as a promising approach that enables computation within on-chip memory cells, offering numerous benefits in computation time and energy efficiency. In this Ph.D. thesis, we propose approximate circuits, architecture, and evaluation tools to examine their impact on various applications. Additionally, we propose IMC architectures and their evaluation framework to overcome the data movement bottleneck. Our research offers valuable insights into the potential of approximate computing and IMC to improve energy efficiency and performance in a wide range of applications. |
18:30 CET | PhDF.44 | TOWARDS ENERGY-EFFICIENT IN-MEMORY COMPUTING Speaker and Author: Muhammad Rashedul Haq Rashed, University of Central Florida, US Abstract The rapid growth of sensor devices in the internet of things (IoT) has resulted in that the amount of available digital data is exponentially increasing. This has powered the emergence of data-driven applications such as computer vision, natural language processing, and search. These new applications have endless computing demands that cannot be met by today's high performance computing systems. Unfortunately, these demands are not expected to be solved by further scaling silicon technology due to the slowdown of Moore's law, the end of Dennard scaling, and the von-Neumann bottleneck. Solving this grand computing efficiency challenge has been the focus of several federal funding agenesis, with multiple billion dollar investments on programs such as the exascale computing project (ECP), the brain initiative (BRAIN), and the joint university microelectronics program (JUMP). My research is aligned with these efforts and aims at developing future computing systems based on emerging hardware. These computing systems promise substantial (orders of magnitude) improvements in throughput and energy-efficiency. The high-level idea of this research direction is to leverage emerging non-volatile memories (NVMs) and perform energy-efficient processing in-memory (PIM). The solution strategy allows otherwise expensive operations such as matrix-vector multiplication to be performed efficiently in the analog domain. Moreover, the processing in-memory eliminates the expensive data movement between the processor and the memory. Within this research direction, I have made several key contributions towards the robustness, scalability, and energy-efficiency of such systems. My five main research contributions are outlined in the attached extended abstract. |
18:30 CET | PhDF.45 | DATE PHD FORUM: MEMRISTOR BASED ARTIFICIAL INTELLIGENCE ACCELERATORS USING IN/NEAR MEMORY PARADIGM Speaker: kamel-eddine harabi, Universite Paris-Saclay, FR Authors: kamel-eddine harabi1 and Damien Querlioz2 1C2N, Université Paris Saclay, CNRS, FR; 2Université Paris-Sud, FR Abstract Memristors are a new type of memory technology fully embeddable in CMOS, providing a compact nonvolatile, and fast memory. These devices provide fantastic opportunities to integrate logic and memory tightly and allow low-power computing. It is therefore essential to prototype computing concepts involving memristors experimentally. However, appropriate platforms are extremely complex to fabricate due to the need to co-integrate commercial CMOS and memristor devices on the same die. My Ph.D. thesis is about the design and development of energy-efficient AI systems, with low energy consumption, using memristors. In our projects, we rely on In/Near-Memory computing approach, where memory and computation are co-located. During my PhD, I worked mainly on three projects, two of which were published in Nature Electronics, and one presented at ASPDAC 2023. |
18:30 CET | PhDF.46 | HARDWARE SECURITY ASSURANCE VIA OBFUSCATION AND AUTHENTICATION Speaker and Author: Mohammad Rahman, University of Florida, US Abstract Due to the globalization of IC manufacturing, there have been increased security concerns, notably IP theft. One promising countermeasure is logic locking, which includes programmable elements in a design to obscure the true functionality during manufacturing. In general, logic locking techniques are meant to provide IP security without incurring large overheads. This dissertation contributes in several ways to this goal. We perform an exhaustive security analysis of the existing logic locking techniques, revealing several vulnerabilities. Once such vulnerability comes from the extit{satisfiability}-based SAT attack, where the circuit under attack (CUA) is represented in a propositional logic form and response from an unlocked chip is utilized to quickly prune out the incorrect key. Criteria for successful SAT attacks on locked circuits include: (i) the circuit under attack is fully combinational, or (ii) the attacker has scan chain access. These vulnerabilities inform the development of a novel dynamically-obfuscated scan chain (DOSC) architecture and illustrate its resiliency against the SAT attacks both mathematically and experimentally when inserted into the scan chain of an obfuscated design. Scan obfuscation methods, e.g., DOSC, require that the functional IP core is locked by a functional logic locking method. However, none of the existing logic locking methods is resilient against emerging attacks on logic locking. To strengthen the protection of the underlying functional IP core against these emerging attacks, O'Clock, a clock-gating based logic locking method has been proposed that `locks the clock' to protect IP cores in a complex SoC environment. O'Clock obstructs data/control flows and makes the underlying logic dysfunctional for incorrect keys by manipulating the activity factor of the clock tree with minimal power, performance, and area (PPA) overhead and maximum resiliency against emerging attacks. |
18:30 CET | PhDF.47 | RELIABLE MEMRISTIVE NEUROMORPHIC IN-MEMORY COMPUTING: AN ALGORITHM-HARDWARE CO-DESIGN APPROACH Speaker: Soyed Tuhin Ahmed, KIT - Karlsruhe Institute of Technology, DE Authors: Soyed Tuhin Ahmed1 and Mehdi Tahoori2 1KIT - Karlsruhe Institute of Technology, Karlsruhe, Germany, DE; 2Karlsruhe Institute of Technology, DE Abstract The capability of neural networks (NN) to tackle difficult cognitive tasks, such as sensor data processing, image recognition, and language modeling, has made them appealing for hardware realization. To obtain high inference accuracy, most neural network (NN) models enhance their depth and breadth. Also, they require numerous matrix-vector multiplications, which are expensive. NNs applications can be efficiently accelerated in neuromorphic compute-in-memory (CiM) architectures based on emerging resistive non-volatile memories (NVMs) such as Spin Transfer Torque Magnetic RAM (STT-MRAM). NVMs offer many benefits, such as fast switching, high endurance, and low power consumption. However, the manufacturing process for NVM memories has not yet matured. As a result, they suffer from various non-ideal behaviors, such as device-to-device process variation, runtime temperature variations, defective devices, and the retention problem. Consequently, the reliability of the CiM-implemented NN during post-manufacturing and post-deployment is essential and challenging for the proper operation of the NN in safety-critical applications such as medical imaging and autonomous driving. Hardware-only solutions may not be optimal because they may increase hardware overhead. As a result, in this PhD research, hardware-algorithm co-design-based solutions are explored to solve the reliability aspects of the NN implemented in the CiM architecture. Also, we intend to take advantage of the statistical nature of NVM devices and propose statistical NN inferences such as Bayesian inference that not only provide inherent robustness to variations but also quantify model uncertainty. |
REC Welcome Reception
Add this session to my calendar
Date: Monday, 17 April 2023
Time: 18:30 CET - 20:00 CET
Location / Room: Atrium
Tuesday, 18 April 2023
ASD5 ASD focus session 1: Autonomy-driven Emerging Directions in Software-defined Vehicles
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 08:30 CET - 10:00 CET
Location / Room: Gorilla Room 1.5.4/5
Session chair:
Enrico Fraccaroli, University of North Carolina, US
Over the past two decades, the volume of electronics and software in cars have grown tremendously. There is now widespread consensus that more than 90% of the innovation in modern vehicles is driven by them. But this growth has also resulted in hardware and software architectures that are proving to be a bottleneck for further innovation and efficient design flows, especially when implementing compute-intensive functions necessary for autonomous features. Another emerging trend in the domain of automotive software is the need for continuous improvement and continuous deployment (CI/CD) of functionality, that is enabled by Over-The-Air (OTA) capability. The goal of this special session is to discuss these new trends, the resulting challenges, and explore emerging solutions and directions in the broad area of design, development, and verification of software-defined vehicles. The three talks will highlight different aspects of software-defined vehicle designs, what research challenges they pose, and how they would impact the future automotive design ecosystem.
Time | Label | Presentation Title Authors |
---|---|---|
08:30 CET | ASD5.1 | IMPACTS OF SERVICE ORIENTED COMMUNICATION ON SDV ARCHITECTURES Presenter: Prachi Joshi, General Motors, R&D, US Author: Prachi Joshi, General Motors, R&D, US Abstract . |
09:00 CET | ASD5.2 | "SHIFT-LEFT” DEVELOPMENT AND VALIDATION OF SOFTWARE DEFINED VEHICLES WITH A VIRTUAL PLATFORM Presenter: Unmesh D. Bordoloi, Siemens, US Author: Unmesh D. Bordoloi, Siemens, US Abstract . |
09:30 CET | ASD5.3 | DESIGN TOOLS FOR ASSURED AUTONOMY Presenter: Samarjit Chakraborty, UNC Chapel Hill, US Author: Samarjit Chakraborty, UNC Chapel Hill, US Abstract . |
BPA1 Testing
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 08:30 CET - 10:30 CET
Location / Room: Gorilla Room 1.5.1
Session chair:
Alberto Bosio, Ecole Centrale de Lyon, FR
Time | Label | Presentation Title Authors |
---|---|---|
08:30 CET | BPA1.1 | DEVICE-AWARE TEST FOR BACK-HOPPING DEFECTS IN STT-MRAMS Speaker: Sicong Yuan, TU Delft, NL Authors: Sicong Yuan1, Mottaqiallah Taouil1, Moritz Fieback1, Hanzhi Xun1, Erik Marinissen2, Gouri Kar2, Siddharth Rao2, Sebastien Couet2 and Said Hamdioui1 1TU Delft, NL; 2IMEC, BE Abstract The development of Spin-transfer torque magnetic RAM (STT-MRAM) mass production requires high-quality dedicated test solutions, for which understanding and modeling of manufacturing defects of the magnetic tunnel junction (MTJ) is crucial. This paper introduces and characterizes a new defect called Back-Hopping (BH); it also provides its fault models and test solutions. The BH defect causes MTJ state to oscillate during write operations, leading to write failures. The characterization of the defect is carried out based on manufactured MTJ devices. Due to the observed non-linear characteristics, the BH defect cannot be modeled with a linear resistance. Hence, device-aware defect modeling is applied by considering the intrinsic physical mechanisms; the model is then calibrated based on measurement data. Thereafter, the fault modeling and analysis is performed based on circuit-level simulations; new fault primitives/models are derived. These accurately describe the way the STT-MRAM behaves in the presence of BH defect. Finally, the dedicated march test and Design-for-Test solutions are proposed. |
08:55 CET | BPA1.2 | CORRECTNET: ROBUSTNESS ENHANCEMENT OF ANALOG IN-MEMORY COMPUTING FOR NEURAL NETWORKS BY ERROR SUPPRESSION AND COMPENSATION Speaker: Amro Eldebiky, TU Munich, DE Authors: Amro Eldebiky1, Grace Li Zhang2, Georg Bocherer3, Bing Li1 and Ulf Schlichtmann1 1TU Munich, DE; 2TU Darmstadt, DE; 3Huawei Munich Research Center, DE Abstract The last decade has witnessed the breakthrough of deep neural networks (DNNs) in many fields. With the increasing depth of DNNs, hundreds of millions of multiply-and-accumulate (MAC) operations need to be executed. To accelerate such operations efficiently, analog in-memory computing platforms based on emerging devices, e.g., resistive RAM (RRAM), have been introduced. These acceleration platforms rely on analog properties of the devices and thus suffer from process variations and noise. Consequently, weights in neural networks configured into these platforms can deviate from the expected values, which may lead to feature errors and a significant degradation of inference accuracy. To address this issue, in this paper, we propose a framework to enhance the robustness of neural networks under variations and noise. First, a modified Lipschitz constant regularization is proposed during neural network training to suppress the amplification of errors propagated through network layers. Afterwards, error compensation is introduced at necessary locations determined by reinforcement learning to rescue the feature maps with remaining errors. Experimental results demonstrate that inference accuracy of neural networks can be recovered from as low as 1.69% under variations and noise back to more than 95% of their original accuracy, while the training and hardware cost are negligible. |
09:20 CET | BPA1.3 | ASSESSING CONVOLUTIONAL NEURAL NETWORKS RELIABILITY THROUGH STATISTICAL FAULT INJECTIONS Speaker: Annachiara Ruospo, Politecnico di Torino, IT Authors: Annachiara Ruospo1, Gabrile Gavarini1, Corrado De Sio1, Juan Guerrero Balaguera1, Luca Sterpone1, Matteo Sonza Reorda1, Ernesto Sanchez1, Riccardo Mariani2, Joseph Aribido3 and Jyotika Athavale3 1Politecnico di Torino, IT; 2NVIDIA, IT; 3nvidia, US Abstract Assessing the reliability of modern devices running CNN algorithms is a very difficult task. Actually, the complexity of the state-of-the-art devices makes exhaustive Fault Injection (FI) campaigns impractical and typically out of the computational capabilities. A possible solution consists of resorting to statistical FI campaigns that allow a reduction in the number of needed experiments by injecting only a carefully selected small part of it. Under specific hypothesis, statistical FIs guarantee an accurate picture of the problem, albeit selecting a reduced sample size. The main problems today are related to the choice of the sample size, the location of the faults, and the correct understanding of the statistical assumptions. The intent of this paper is twofold: first, we describe how to correctly specify statistical FIs for Convolutional Neural Networks; second, we propose a data analysis on the CNN parameters that drastically reduces the number of FIs needed to achieve statistically significant results without compromising the validity of the proposed method. The methodology is experimentally validated on two CNNs, ResNet-20 and MobileNetV2, and the results show that a statistical FI campaign on about 1.21% and 0.55% of the possible faults, provides very precise information of the CNN reliability. The statistical results have been confirmed by the exhaustive FI campaigns on the same cases of study. |
09:45 CET | BPA1.4 | INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Speaker: Authors of the session, DATE, BE Author: Session Chairs, DATE, BE Abstract Participants can freely interact with authors during their interactive technical presentations. |
BPA2 From synthesis to application
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 08:30 CET - 10:30 CET
Location / Room: Okapi Room 0.8.3
Session chair:
Mirjana Stojilovic, EPFL, CH
Time | Label | Presentation Title Authors |
---|---|---|
08:30 CET | BPA2.1 | EFFICIENT PARALLELIZATION OF 5G-PUSCH ON A SCALABLE RISC-V MANY-CORE PROCESSOR Speaker: Marco Bertuletti, ETH Zurich, IT Authors: Marco Bertuletti1, Yichao Zhang1, Alessandro Vanelli-Coralli2 and Luca Benini2 1ETH Zurich, CH; 2ETH Zurich, CH | Università di Bologna, IT Abstract 5G Radio access network disaggregation and softwarization pose challenges in terms of computational performance to the processing units. At the physical layer level, the baseband processing computational effort is typically offloaded to specialized hardware accelerators. However, the trend toward software-defined radio-access networks demands flexible, programmable architectures. In this paper, we explore the software design, parallelization and optimization of the key kernels of the lower physical layer (PHY) for physical uplink shared channel (PUSCH) reception on MemPool and TeraPool, two manycore systems having respectively 256 and 1024 small and efficient RISC-V cores with a large shared L1 data memory. PUSCH processing is demanding and strictly time-constrained, it represents a challenge for the baseband processors, and it is also common to most of the uplink channels. Our analysis thus generalizes to the entire lower PHY of the uplink receiver at gNodeB (gNB). Based on the evaluation of the computational effort (in multiply-accumulate operations) required by the PUSCH algorithmic stages, we focus on the parallel implementation of the dominant kernels, namely fast Fourier transform, matrix-matrix multiplication, and matrix decomposition kernels for the solution of linear systems. Our optimized parallel kernels achieve respectively on MemPool and TeraPool speedups of 211, 225, 158, and 762, 880, 722, at high utilization (0.81, 0.89, 0.71, and 0.74, 0.88, 0.71), comparable a single-core serial execution, moving a step closer toward a full-software PUSCH implementation. |
08:55 CET | BPA2.2 | NARROWING THE SYNTHESIS GAP: ACADEMIC FPGA SYNTHESIS IS CATCHING UP WITH THE INDUSTRY Speaker: Benjamin Barzen, University of California, Berkeley, DE Authors: Benjamin Barzen1, Arya Reais-Parsi1, Eddie Hung2, Minwoo Kang1, Alan Mishchenko1, Jonathan Greene1 and John Wawrzynek1 1University of California, Berkeley, US; 2FPG-eh Research and University of British Columbia, CA Abstract Historically, open-source FPGA synthesis and technology mapping tools have been considered far inferior to industry-standard tools. We show that this is no longer true. Improvements in recent years to Yosys (Verilog elaborator) and ABC (technology mapper) have resulted in substantially better performance, evident in both the reduction of area utilization and the increase in the maximum achievable clock frequency. More specifically, we describe how ABC9 --- a set of feature additions to ABC --- was integrated into Yosys upstream and available in the latest version. Technology mapping now has a complete view of the circuit, including support for hard blocks (e.g., carry chains) and multiple clock domains for timing-aware mapping. We demonstrate how these improvements accumulate in dramatically better synthesis results, with Yosys-ABC9 reducing the delay gap from 30% to 0% on a commercial FPGA target for the commonly used VTR benchmark, thus matching Vivado's performance in terms of maximum clock frequency. We also measured the performance on a selection of circuits from OpenCores as well as literature, comparing the results produced by Vivado, Yosys-ABC1 (existing work), and the proposed Yosys-ABC9 integration. |
09:20 CET | BPA2.3 | SAGEROUTE: SYNERGISTIC ANALOG ROUTING CONSIDERING GEOMETRIC AND ELECTRICAL CONSTRAINTS WITH MANUAL DESIGN COMPATIBILITY Speaker: Haoyi Zhang, Peking University, CN Authors: Haoyi Zhang, Xiaohan Gao, Haoyang Luo, Jiahao Song, Xiyuan Tang, Junhua Liu, Yibo Lin, Runsheng Wang and Ru Huang, Peking University, CN Abstract Routing is critical to the post-layout performance of analog circuits. As modern analog layouts need to consider both geometric constraints (e.g., design rules and low bending constraints) and electrical constraints (e.g., electromigration (EM), IR drop, symmetry, etc.), analog routing becomes increasingly challenging to investigate the complicated design space. Most previous work has focused only on geometric constraints or basic electrical constraints, lacking holistic and systematic investigation. Such an approach is far from typical manual design practice and can not guarantee post-layout performance on real-world designs. In this work, we propose SAGERoute, a synergistic routing framework taking both geometric and electrical constraints into consideration. Through Steiner tree based wire sizing and guided detailed routing, the framework can generate high-quality routing solutions efficiently under versatile constraints on real-world analog designs. |
09:45 CET | BPA2.4 | INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Speaker: Authors of the session, DATE, BE Author: Session Chairs, DATE, BE Abstract Participants can freely interact with authors during their interactive technical presentations. |
BPA5 Benchmarking and verification
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 08:30 CET - 10:30 CET
Location / Room: Okapi Room 0.8.2
Session chair:
Daniel Große, Johannes Kepler University Linz, AT
Time | Label | Presentation Title Authors |
---|---|---|
08:30 CET | BPA5.1 | BENCHMARKING LARGE LANGUAGE MODELS FOR AUTOMATED VERILOG RTL CODE GENERATION Speaker: Shailja Thakur, New York University, US Authors: Shailja Thakur1, Baleegh Ahmad1, Zhenxing Fan1, Hammond Pearce1, Benjamin Tan2, Ramesh Karri1, Brendan Dolan-Gavitt1 and Siddharth Garg1 1New York University, US; 2University of Calgary, CA Abstract Automating hardware design could obviate a significant amount of human error from the engineering process and lead to fewer errors. Verilog is a popular hardware description language to model and design digital systems, thus generating Verilog code is a critical first step. Emerging large language models (LLMs) are able to write high-quality code in other programming languages. In this paper, we characterize the ability of LLMs to generate useful Verilog. For this, we fine-tune pre-trained LLMs on Verilog datasets collected from GitHub and Verilog textbooks. We construct an evaluation framework comprising test-benches for functional analysis and a flow to test the syntax of Verilog code generated in response to problems of varying difficulty. Our findings show that across our problem scenarios, the fine-tuning results in LLMs more capable of producing syntactically correct code (25.9\% overall). Further, when analyzing functional correctness, a fine-tuned open-source CodeGen LLM can outperform the state-of-the-art commercial Codex LLM (6.5\% overall). Training/evaluation scripts and LLM checkpoints are available as open source contributions. |
08:55 CET | BPA5.2 | PROCESSOR VERIFICATION USING SYMBOLIC EXECUTION: A RISC-V CASE-STUDY Speaker: Niklas Bruns, Group of Computer Architecture of Universität Bremen, DE Authors: Niklas Bruns1, Vladimir Herdt2 and Rolf Drechsler3 1University of Bremen, DE; 2DFKI, DE; 3University of Bremen | DFKI, DE Abstract We propose to leverage state-of-the-art symbolic execution techniques from the Software (SW) domain for processor verification at the Register-Transfer Level (RTL). In particular, we utilize an Instruction Set Simulator (ISS) as a reference model and integrate it with the RTL processor under test in a co-simulation setting. We then leverage the symbolic execution engine KLEE to perform a symbolic exploration that searches for functional mismatches between the ISS and RTL processor. To ensure a comprehensive verification process, symbolic values are used to represent the instructions and also to initialize the register values of the ISS and processor. As a case study, we present results on the verification of the open source RISC-V based MicroRV32 processor, using the ISS of the open source RISC-V VP as a reference model. Our results demonstrate that modern symbolic execution techniques are applicable to a full scale processor co-simulation in the embedded domain and are very effective in finding bugs in the RTL core. |
09:20 CET | BPA5.3 | PERSPECTOR: BENCHMARKING BENCHMARK SUITES Speaker: Sandeep Kumar, IIT Delhi, IN Authors: Sandeep Kumar1, Abhisek Panda2 and Smruti R. Sarangi1 1IIT Delhi, IN; 2Indian Institute of Technology, IN Abstract Estimating the quality of a benchmark suite is a non-trivial task. A poorly selected or improperly configured benchmark suite can present a distorted picture of the performance of the evaluated framework. With computing venturing into new domains, the total number of benchmark suites available is increasing by the day. Researchers must evaluate these suites quickly and decisively for their effectiveness. We present Perspector, a novel tool to quantify the performance of a benchmark suite. Perspector comprises novel metrics to characterize the quality of a benchmark suite. It provides a mathematical framework for capturing some qualitative suggestions and observations made in prior work. The metrics are generic and domain-agnostic. Furthermore, our tool can be used to compare the efficacy of one suite vis-a`-vis other benchmark suites, systematically and rigorously create a suite of workloads, and appropriately tune them for a target system. |
09:45 CET | BPA5.4 | INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Speaker: Authors of the session, DATE, BE Author: Session Chairs, DATE, BE Abstract Participants can freely interact with authors during their interactive technical presentations. |
BPA8 Machine Learning techniques for embedded systems
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 08:30 CET - 10:30 CET
Location / Room: Marble Hall
Session chair:
Bing Li, TU Munich, DE
Time | Label | Presentation Title Authors |
---|---|---|
08:30 CET | BPA8.1 | PRADA: POINT CLOUD RECOGNITION ACCELERATION VIA DYNAMIC APPROXIMATION Speaker: Zhuoran Song, Shanghai Jiao Tong University, CN Authors: Zhuoran Song, Heng Lu, Gang Li, Li Jiang, Naifeng Jing and Xiaoyao Liang, Shanghai Jiao Tong University, CN Abstract Recent point cloud recognition (PCR) tasks tend to utilize deep neural network (DNN) for better accuracy. Still, the computational intensity of DNN makes them far from real-time processing, given the fast-increasing number of points that need to be processed. Because the point cloud represents 3D-shaped discrete objects in the physical world using a mass of points, the points tend for an uneven distribution in the view space that exposes strong clustering possibility and local pairs' similarities. Based on this observation, this paper proposes PRADA, an algorithm-architecture co-design that can accelerate PCR while reserving its accuracy. We propose dynamic approximation, which can approximate and eliminate the similar local pairs' computations and recover their results by copying key local pairs' features for PCR speedup without losing accuracy. For accuracy good, we further propose an advanced re-clustering technique to maximize the similarity between local pairs. For performance good, we then propose a PRADA architecture that can be built on any conventional DNN accelerator to dynamically approximate the similarity and skip the redundant DNN computation with memory accesses at the same time. Our experiments on a wide variety of datasets show that PRADA averagely achieves 4.2x, 4.9x, 7.1x, and 12.2x speedup over Mesorasi, V100 GPU, 1080TI GPU, and Xeon CPU with negligible accuracy loss. |
08:55 CET | BPA8.2 | FEDERATED LEARNING WITH HETEROGENEOUS MODELS FOR ON-DEVICE MALWARE DETECTION IN IOT NETWORKS Speaker: Sanket Shukla, George Mason University, IN Authors: sanket shukla1, Setareh Rafatirad2, Houman Homayoun3 and Sai Manoj Pudukotai Dinakarrao4 1George mason university, US; 2University of California, Davis, US; 3University of California Davis, US; 4George Mason University, US Abstract IoT devices have been widely deployed in a vast number of applications to facilitate smart technology, increased portability, and seamless connectivity. Despite being widely adopted, security in IoT devices is often considered an afterthought due to resource and cost constraints. Among multiple security threats, malware attacks are observed to be a pivotal threat to IoT devices. Considering the spread of IoT devices and the threats they experience over time, deploying a static malware detector that is trained offline seems to be an ineffective solution. On the other hand, on-device learning is an expensive or infeasible option due to the limited available resources on IoT devices. To overcome these challenges, this work employs ‘Federated Learning' (FL) which enables timely updates to the malware detection models for increased security while mitigating the high communication or data storage overhead of centralized cloud approaches. Federated learning allows training machine learning models with decentralized data while preserving its privacy by design. However, one of the challenges with the FL is that the ondevice models are required to be homogeneous, which may not be true in the case of networked IoT systems. As a panacea, we introduce a methodology to unify the models in the cloud with minimal overheads and an impact on on-device malware detection. We evaluate the proposed technique against homogeneous models in networked IoT systems encompassing Raspberry Pi devices. The experimental results and system efficiency analysis indicate that end-to-end training time is just 1.12× higher than traditional FL, testing latency is 1.63× faster, and malware detection performance is improved by 7% to 13% for resource-constrained IoT devices. |
09:20 CET | BPA8.3 | GENETIC ALGORITHM-BASED FRAMEWORK FOR LAYER-FUSED SCHEDULING OF MULTIPLE DNNS ON MULTI-CORE SYSTEMS Speaker: Sebastian Karl, TU Munich, DE Authors: Sebastian Karl1, Arne Symons2, Nael Fasfous3 and Marian Verhelst2 1TU Munich, DE; 2KU Leuven, BE; 3BMW AG, DE Abstract Heterogeneous multi-core architectures are becoming a popular design choice to accelerate the inference of modern deep neural networks (DNNs). This trend allows for more flexible mappings onto the cores, but shifts the challenge to keeping all cores busy due to limited network parallelism. To this extent, layer-fused processing, where several layers are mapped simultaneously to an architecture and executed in a depth-first fashion, has shown promising opportunities to maximize core utilization. However, SotA mapping frameworks fail to efficiently map layer-fused DNNs onto heterogeneous multi-core architectures due to ignoring 1.) on-chip weight traffic and 2.) inter-core communication congestion. This work tackles these shortcomings by introducing a weight memory manager (WMM), which manages the weights present in a core and models the cost of re-fetching weights. Secondly, the inter-core communication (ICC) of feature data is modeled through a limited-bandwidth bus, and optimized through a contention-aware scheduler (CAS). Relying on these models, a genetic algorithm is developed to optimally schedule different DNN layers across the different cores. The impact of our enhanced modeling, core allocation and scheduling capabilities is shown in several experiments and demonstrates a decrease of 52% resp. 38% in latency, resp. energy when mapping a multi-DNN inference, consisting of ResNet-18, MobileNet-V2 and Tiny YOLO V2, on a heterogeneous multi-core platform compared to iso-area homogeneous architectures. |
09:45 CET | BPA8.4 | INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Speaker: Authors of the session, DATE, BE Author: Session Chairs, DATE, BE Abstract Participants can freely interact with authors during their interactive technical presentations. |
MPP2 Multi-partner projects
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 08:30 CET - 10:00 CET
Location / Room: Gorilla Room 1.5.3
Session chair:
Paul Pop, TU Denmark, DK
Time | Label | Presentation Title Authors |
---|---|---|
08:30 CET | MPP2.1 | SECURING A RISC-V ARCHITECTURE: A DYNAMIC APPROACH Speaker: Sebastien Pillement, IETR - Nantes University, FR Authors: Sebastien Pillement1, Maria Mendez Real1, Juliette Pottier1, Thomas Nieddu2, Bertrand Le Gall2, Sébastien Faucou3, Jean-Luc Béchennec4, Mikaël Briday5, Sylvain Girbal6, Jimmy Le Rhun6, Olivier Gilles6, Daniel Gracia Pérez7, Andre Sintzoff8 and Jean-Roch Coulon8 1École Polytechnique de l'Université de Nantes, FR; 2IMS, FR; 3Université de Nantes, FR; 4LS2N/CNRS, FR; 5École Centrale de Nantes - LS2N, FR; 6THALES TRT, FR; 7Thales, FR; 8THALES DIS, FR Abstract For decades, the evolution of processors has focused on improving their performance. In recent years, attacks directly exploiting optimization mechanisms have appeared. Using for example caches, performance counters or speculation units, they jeopardize the safety and security of processors and the industrial systems that operate them. We can cite SPECTRE and Meltdown as flagship examples. The open-HW approaches and in particular the RISC-V initiative are now both an economic reality and an innovation opportunity for European players in the field of processors architecture. The use of this open-source approach requires the design of secure processor cores, and therefore makes it possible to move towards greater independence in the field of cyber-security. The SECURE-V project offers an innovative open-source, secure and high-performance processor core based on the ISA RISC-V. The originality of the approach lies in the integration of a dynamic code transformation unit covering 4 of the 5 NIST functions of cybersecurity, in particular via monitoring (identify, detect), obfuscation (protect), and dynamic adaptation (reacting). This dynamic management paves the way for online optimizations that improve the security and safety of the micro-architecture without overhauling the software or the architecture of the chip. |
08:33 CET | MPP2.2 | THE ZUSE-KI-MOBIL AI ACCELERATOR SOC: OVERVIEW AND A FUNCTIONAL SAFETY PERSPECTIVE Speaker: Fabian Kempf, Karlsruhe Institute of Technology, DE Authors: Fabian Kempf1, Julian Hoefer1, Tanja Harbaum1, Juergen Becker1, Nael Fasfous2, Alexander Frickenstein3, Hans-Jörg Vögel3, Simon Friedrich4, Robert Wittig4, Emil Matus4, Gerhard Fettweis4, Matthias Lueders5, Holger Blume6, Karl-Heinz Eickel7, Darius Grantz7, Jens Benndorf7, Martin Zeller7 and Dietmar Engelke7 1Karlsruhe Institute of Technology, DE; 2BMW AG, DE; 3BMW Group, DE; 4TU Dresden, DE; 5Leibniz University Hannover, DE; 6Leibniz Universität Hannover, DE; 7Dream Chip Technologies GmbH, DE Abstract The goal of the ZuKIMo project is to develop a new System-on-Chip (SoC) platform and corresponding ecosystem to enable efficient Artificial Intelligence (AI) applications with specific requirements. With ZuKIMo, we specifically target applications from the mobility domain, i.e. autonomous vehicles and drones. The initial ecosystem is built by a consortium consisting of seven partners from German academia and industry. We develop the SoC platform and its ecosystem around a novel AI Accelerator design. The customizable accelerator is conceived from scratch to fulfill the functional and non-functional requirements derived from the ambitious use cases. A tape-out in 22~nm FDX-technology is planned in 2023. Apart from the System-on-Chip hardware design itself, the ZuKIMo ecosystem has the objective of providing software tooling for easy deployment of new use cases and hardware-CNN co-design. Furthermore, AI accelerators in safety-critical applications like our mobility use cases, necessitate the fulfillment of safety requirements. Therefore, we investigate new design methodologies for fault analysis of Deep Neural Networks (DNNs) and introduce our new redundancy mechanism for AI accelerators. |
08:36 CET | MPP2.3 | ZUSE-KI-AVF: APPLICATION-SPECIFIC AI PROCESSOR FOR INTELLIGENT SENSOR SIGNAL PROCESSING IN AUTONOMOUS DRIVING Speaker: Sven Gesper, TU Braunschweig, DE Authors: Gia Bao Thieu1, Sven Gesper2, Guillermo Payá Vayá1, Christoph Riggers3, Oliver Renke3, Till Fiedler3, Jakob Marten3, Tobias Stuckenberg4, Holger Blume3, Christian Weis5, Lukas Steiner5, Chirag Sudarshan5, Norbert Wehn5, Lennart Reimann6, Rainer Leupers6, Michael Beyer7, Daniel Köhler7, Alisa Jauch7, Jan Micha Bormann7, Setareh Jaberansari7, Tim Berthold8, Meinolf Blawat8, Markus Kock8, Gregor Schewior8, Jens Benndorf8, Frederik Kautz9, Hans-Martin Bluethgen10 and Christian Sauer9 1TU Braunschweig, DE; 2TU Braunschweig, Chair of Chip Design for Embedded Computing, DE; 3Leibniz Universität Hannover, DE; 4Leibniz Universität Hannover, Institute of Microelectronic Systems, DE; 5TU Kaiserslautern, DE; 6RWTH Aachen University, DE; 7Robert Bosch GmbH, DE; 8Dream Chip Technologies GmbH, DE; 9Cadence Design Systems, DE; 10Cadence Design System GmbH, DE Abstract Modern and future AI-based automotive applications, such as autonomous driving, require the efficient real-time processing of huge amounts of data from different sensors, like camera, radar, and LiDAR. In the ZuSE-KI-AVF project, multiple university, and industry partners collaborate to develop a novel massive parallel processor architecture, based on a customized RISC-V host processor, and an efficient high-performance vertical vector coprocessor. In addition, a software development framework is also provided to efficiently program AI-based sensor processing applications. The proposed processor system was verified and evaluated on a state-of-the-art UltraScale+ FPGA board, reaching a processing performance of up to 126.9 FPS, while executing the YOLO-LITE CNN on 224x224 input images. Further optimizations of the FPGA design and the realization of the processor system on a 22nm FDSOI CMOS technology are planned. |
08:39 CET | MPP2.4 | EUFRATE: EUROPEAN FPGA RADIATION-HARDENED ARCHITECTURE FOR TELECOMMUNICATIONS Speaker: Luca Sterpone, Politecnico di Torino- Department of Control and Control Engineering (DAUIN), IT Authors: Ludovica Bozzoli1, Antonino Catanese1, Emilio Fazzoletto1, Eugenio Scarpa1, Diana Goehringer2, Sergio Pertuz2, Lester Kalms2, Cornelia Wulf2, Najdet Charaf3, Luca Sterpone4, Sarah Azimi4, Daniele Rizzieri4, Salvatore Gabriele La Greca4, David Merodio Codinachs5 and Stephen King5 1Argotec, IT; 2TU Dresden, DE; 3TU Dresden, Faculty of Computer Science, Chair of Adaptive Dynamic Systems, DE; 4Politecnico di Torino, IT; 5European Space Agency, NL Abstract The EuFRATE project aims to research, develop and test radiation-hardening methods for telecommunication payloads deployed for Geostationary-Earth Orbit (GEO) using Commercial-Off-The-Shelf Field Programmable Gate Arrays (FPGAs). This project is conducted by Argotec Group (Italy) with the collaboration of two partners: Politecnico di Torino (Italy) and Technische Universität Dresden (Germany). The idea of the project focuses on high-performance telecommunication algorithms and the design and implementation strategies for connecting an FPGA device into a robust and efficient cluster of multi-FPGA systems. The radiation-hardening techniques currently under development are addressing both device and cluster levels, with redundant datapaths on multiple devices, comparing the results and isolating fatal errors. This paper introduces the current state of the project's hardware design description, the composition of the FPGA cluster node, the proposed cluster topology, and the radiation hardening techniques. Intermediate stage experimental results of the FPGA communication layer performance and fault detection techniques are presented. Finally, a wide summary of the project's impact on the scientific community is provided. |
08:42 CET | MPP2.5 | THE SERRANO PLATFORM: STEPPING TOWARDS SEAMLESS APPLICATION DEVELOPMENT & DEPLOYMENT IN THE HETEROGENEOUS EDGE-CLOUD CONTINUUM Speaker: Argyrios Kokkinis, Aristotle University of Thessaloniki, GR Authors: Aggelos Ferikoglou1, Argyris Kokkinis1, Dimitrios Danopoulos1, Ioannis Oroutzoglou1, Anastasios Nanos2, Stathis Karanastasis3, Marton Sipos4, Javad Ghotbi5, Juan Jose Olmos6, Dimosthenis Masouros1 and Kostas Siozios7 1Aristotle University of Thessaloniki, GR; 2Nubificus LTD, GB; 3INNOV, GR; 4Chocolate Cloud, DK; 5HLRS, DE; 6NVIDIA, DK; 7Department of Physics, Aristotle University of Thessaloniki, GR Abstract The need for real-time analytics and faster decision-making mechanisms has led to the adoption of hardware accelerators such as GPUs and FPGAs within the edge cloud computing continuum. However, their programmability and lack of orchestration mechanisms for seamless deployment make them difficult to use efficiently. We address these challenges by presenting SERRANO, a project for transparent application deployment in a secure, accelerated, and cognitive cloud continuum. In this work, we introduce the SERRANO platform and its software, orchestration, and deployment services, focusing on its methods for automated GPU/FPGA acceleration and efficient, isolated, and secure deployments. By evaluating these services against representative use cases, we highlight SERRANO 's ability to simplify the development and deployment process without sacrificing performance. |
08:45 CET | MPP2.6 | EVALUATION OF HETEROGENEOUS AIOT ACCELERATORS WITHIN VEDLIOT Speaker: Rene Griessl, Bielefeld University, DE Authors: Rene Griessl1, Florian Porrmann1, Nils Kucza1, Kevin Mika1, Jens Hagemeyer1, Martin Kaiser1, Mario Porrmann2, Marco Tassemeier2, Marcel Flottmann2, Fareed Qararyah3, Muhammad Azhar3, Pedro Trancoso3, Daniel Odman4, Karol Gugala5 and Grzegorz Latosinksi5 1Bielefeld University, DE; 2Osnabrueck University, DE; 3Chalmers, SE; 4EmbeDL AB, SE; 5Antmicro Ltd, PL Abstract Within VEDLIoT, a project targeting the development of energy-efficient Deep Learning for distributed AIoT applications, several accelerator platforms based on technologies like CPUs, embedded GPUs, FPGAs, or specialized ASICs are evaluated. The VEDLIoT approach is based on modular and scalable cognitive IoT hardware platforms. Modular microserver technology enables the integration of different, heterogeneous accelerators into one platform. Benchmarking of the different accelerators takes into account performance, energy efficiency and accuracy. The results in this paper provide a solid overview regarding available accelerator solutions and provide guidance for hardware selection for AIoT applications from far edge to cloud. VEDLIoT is an H2020 EU project which started in November 2020. It is currently in an intermediate stage. The focus is on the considerations of the performance and energy efficiency of hardware accelerators. Apart from the hardware and accelerator focus presented in this paper, the project also covers toolchain, security and safety aspects. The resulting technology is tested on a wide range of AIoT applications. |
08:48 CET | MPP2.7 | SPHERE-DNA: PRIVACY-PRESERVING FEDERATED LEARNING FOR EHEALTH Speaker: Jari Nurmi, Tampere University, FI Authors: Jari Nurmi1, Yinda Xu1, Jani Boutellier2 and Bo Tan1 1Tampere University, FI; 2University of Vaasa, FI Abstract The rapid growth of chronic diseases and medical conditions (e.g. obesity, depression, diabetes, respiratory and musculoskeletal diseases) in many OECD countries has become one of the most significant wellbeing problems, which also poses pressure to the sustainability of healthcare and economies. Thus, it is important to promote early diagnosis, intervention, and healthier lifestyles. One partial solution to the problem is extending long-term health monitoring from hospitals to natural living environments. It has been shown in laboratory settings and practical trials that sensor data, such as camera images, radio samples, acoustics signals, infrared etc., can be used for accurately modelling activity patterns that are related to different medical conditions. However, due to the rising concern related to private data leaks and, consequently, stricter personal data regulations, the growth of pervasive residential sensing for healthcare applications has been slow. To mitigate public concern and meet the regulatory requirements, our national multi-partner project aims to combine pervasive sensing technology with secured and privacy-preserving distributed privacy frameworks for healthcare applications. The project leverages local differential privacy federated learning (LDP-FL) to achieve resilience against active and passive attacks, as well as edge computing to avoid transmitting sensitive data over networks. Combinations of sensor data modalities and security architectures are explored by a machine learning architecture for finding the most viable technology combinations, relying on metrics that allow balancing between computational cost and accuracy for a desired level of privacy. We also consider realistic edge computing platforms and develop hardware acceleration and approximate computing techniques to facilitate the adoption of LDP-FL and privacy preserving signal processing to lightweight edge processors. A proof-of-concept (PoC) multimodal sensing system will be developed and a novel multimodal dataset will be collected during the project to verify the concept. |
08:51 CET | MPP2.8 | INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Speaker: Authors of the session, DATE, BE Author: Session Chairs, DATE, BE Abstract Participants can freely interact with authors during their interactive technical presentations. |
SpD1 Special Day on Human AI-Interaction: Introduction, innovations and technologies
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 08:30 CET - 10:00 CET
Location / Room: Darwin Hall
Computing systems are increasingly entangled with the physical world, such that keyboards and screens are no longer the only way to communicate between humans and computers. More “natural” ways to communicate such as voice commands, analysis of the environment and imaging are increasingly widespread thanks to the progress of Artificial Intelligence. To further enhance communication and understanding between humans and machines, the next step for computing systems will be to enable more precise evaluation of all implicit communications, including emotions. In exchange, they should provide more natural, human-like responses, in a trustworthy way. The goal of this special day on Human AI-interaction is to show the latest developments in this field, including “emotional systems” but also to present the corresponding ethical aspects.
Time | Label | Presentation Title Authors |
---|---|---|
08:30 CET | SpD1.1 | INTRODUCTION AND A QUICK STATE-OF-THE-ART ON HUMAN AI-INTERACTION. Speaker: Marina Zapater, University of Applied Sciences Western Switzerland (HES-SO), CH and Marc Duranton, CEA, FR Authors: Marina Zapater1 and Marc Duranton2 1University of Applied Sciences Western Switzerland (HES-SO), CH; 2CEA, FR Abstract We will make a short introduction of the topic of Human AI-Interaction which involves the study of how humans and machines can communicate and interact with each other in a more natural and intuitive way, and show existing realizations exhibiting few aspects of this field. |
09:00 CET | SpD1.2 | THE FUTURE OF BRAIN-MACHINE INTERFACES: AI-DRIVEN INNOVATIONS Presenter: Shoaran Mahsa, EPFL, CH Author: Shoaran Mahsa, EPFL, CH Abstract Implantable neural devices and Brain-Machine Interfaces (BMIs) hold the promise to offer new therapies for brain disorders when symptoms no longer improve with medications and other treatments. Despite significant advances in neural interface microsystems over the past decade, the limited embedded processing and small number of channels in the existing technologies remain a barrier to their therapeutic potential. In this talk, I will provide an overview of the state-of-the-art research on BMIs and our recent efforts to integrate modern machine learning techniques on neural microchips for various neurological and psychiatric disorders. I will also discuss how AI can improve next-generation BMIs to restore movement and communication for paralyzed patients. |
09:30 CET | SpD1.3 | PRIVACY-PRESERVING EDGE FEDERATED LEARNING. Presenter: Amir Aminifar, Lund University, SE Author: Amir Aminifar, Lund University, SE Abstract We are now entering the era of intelligent Internet of Things (IoT) systems. The bar is set high. Despite the inherently complex nature of human interactions, we would like these systems to react to our inputs, and perhaps even to our emotions, in real time. We also expect such systems to be self-adaptive, i.e., continuously learn and evolve over time in interaction with humans. At the same time, we would like these systems to be trustworthy, e.g., to ensure privacy with respect to our personal data. In this talk, we discuss how edge federated learning could address such challenges and pave the way for the development of intelligent IoT systems. |
W03 Workshop on Nano Security: From Nano-Electronics to Secure Systems
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 08:30 CET - 12:30 CET
Location / Room: Nightingale Room 2.6.1/2
Organisers:
Ilia Polian, University of Stuttgart, DE
Nan Du, Friedrich Schiller University Jena, Germany, DE
Shahar Kvatinsky, Technion – Israel Institute of Technology, IL
Ingrid Verbauwhede, KU Leuven, BE
Today’s societies critically depend on electronic systems. Security of such systems are facing completely new challenges due to the ongoing transition to radically new types of nano-electronic devices, such as memristors, spintronics, or carbon nanotubes. The use of such emerging nano-technologies is inevitable to address the essential needs related to energy efficiency, computing power and performance. Therefore, the entire industry is switching to emerging nano-electronics alongside scaled CMOS technologies in heterogeneous integrated systems. These technologies come with new properties and also facilitate the development of radically different computer architectures.
The proposed workshop will bring together researchers from hardware-oriented security and from emerging hardware technology. It will explore the potential of new technologies and architectures to provide new opportunities for achieving security targets, but it will also raise questions about their vulnerabilities to new types of hardware-oriented attacks. The workshop is based on a Priority Program https://spp-nanosecurity.uni-stuttgart.de/ funded since 2019 by the German DFG, and will be open to members and non-members of that Priority Program alike.
W03.1 Keynote
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 08:30 CET - 09:15 CET
Location / Room: Nightingale Room 2.6.1/2
Session chair:
Ilia Polian, University of Stuttgart, DE
Securing the Internet of Bodies using Human Body as a ‘Wire’
Shreyas Sen, Purdue University
Abstract: Radiative communication using electromagnetic (EM) fields is the state-of-the-art for connecting wearable and implantable devices enabling prime applications in the fields of connected healthcare, electroceuticals, neuroscience, augmented and virtual reality (AR/VR) and human-computer interaction (HCI), forming a subset of the Internet of Things called the Internet of body (IoB). However, owing to such radiative nature of the traditional wireless communication, EM signals propagate in all directions, inadvertently allowing an eavesdropper to intercept the information. Moreover, since only a fraction of the energy is picked up by the intended device, and the need for high carrier frequency compared to information content, wireless communication tends to suffer from poor energy-efficiency (>nJ/bit). Noting that all IoB devices share a common medium, i.e., the human body, using the conductivity of the human the body allows low-loss transmission, termed as human body communication (HBC) and improves energy-efficiency. Conventional HBC implementations still suffer from significant radiation compromising physical security and efficiency. Our recent work has developed Electro-Quasistatic Human Body Communication (EQS-HBC), a method for localizing signals within the body using low-frequency transmission, thereby making it extremely difficult for a nearby eavesdropper to intercept critical personal data, thus producing a covert communication channel, i.e., the human body as a ‘wire’ along with reducing interference.
In this talk, I will explore the fundamentals of radio communication around the human body to lead to the evolution of EQS-HBC and show recent advancements in the field which has a strong promise to become the future of Body Area Network (BAN). I will show the theoretical development of the first Bio-Physical Model of EQS-HBC and how it was leveraged to develop the world’s lowest-energy (<10pJ/b) and world’s first sub-uW Physically and Mathematically Secure (AES 256) IoB Communication SoC, with >100x improvement in energy-efficiency over Bluetooth. I’ll also highlight how recent developments in mixed-signal circuit techniques allow orders of magnitude improvement in side-channel attack resistance of the encryption engines in such SoCs. Finally, I will highlight the possibilities and applications in the fields of HCI, Medical Device Communication, and Neuroscience including a video demonstration. I will highlight how such low-power secure communication in combination with in-sensor intelligence is paving the way forward for Secure and Efficient IoB Sensor Nodes.
Bio: Shreyas Sen is an Elmore Associate Professor of ECE & BME, Purdue University and received his Ph.D. degree from ECE, Georgia Tech and the Founder and CTO of Ixana. His current research interests span mixed-signal circuits/systems and electromagnetics for the Internet of Things (IoT), Biomedical, and Security. He has authored/co-authored 3 book chapters, over 175 journal and conference papers and has 25 patents granted/pending. Dr. Sen serves as the founding Director of the Center for Internet of Bodies (C-IoB) at Purdue. Dr. Sen is the inventor of the Electro-Quasistatic Human Body Communication (EQS-HBC), or Body as a Wire technology, for which, he is the recipient of the MIT Technology Review top-10 Indian Inventor Worldwide under 35 (MIT TR35 India) Award in 2018 and the Georgia Tech 40 Under 40 Award in 2022. His work has been covered by 250+ news releases worldwide, IEEE Spectrum feature, invited appearance on TEDx Indianapolis, Indian National Television CNBC TV18 Young Turks Program, NPR subsidiary Lakeshore Public Radio and the CyberWire podcast. Dr. Sen is a recipient of the NSF CAREER Award 2020, AFOSR Young Investigator Award 2016, NSF CISE CRII Award 2017, Intel Outstanding Researcher Award 2020, Google Faculty Research Award 2017, Purdue CoE Early Career Research Award 2021, Intel Labs Quality Award 2012 for industrywide impact on USB-C type, Intel Ph.D. Fellowship 2010, IEEE Microwave Fellowship 2008, GSRC Margarida Jacome Best Research Award 2007, and nine best paper awards including IEEE CICC 2019, 2021 and in IEEE HOST 2017-2020, for four consecutive years. Dr. Sen's work was chosen as one of the top-10 papers in the Hardware Security field (TopPicks 2019). He serves/has served as an Associate Editor for IEEE Solid States Circuits Letters (SSC-L), Nature Scientific Reports, Frontiers in Electronics, IEEE Design & Test, Executive Committee member of IEEE Central Indiana Section and Technical Program Committee member of DAC, CICC, IMS, CCS, DATE, ISLPED, ICCAD, ITC, VLSI Design, among others. Dr. Sen is a Senior Member of IEEE.
W03.2 Session 1: PUFs and RNGs
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 09:15 CET - 09:45 CET
Location / Room: Nightingale Room 2.6.1/2
Session chair:
Nan Du, Friedrich Schiller University Jena, Germany, DE
Carbon-Nanotube-Based Physical Unclonable Functions and True Random Number Generators
Nikolaos Athanasios Anagnostopoulos1, Tolga Arul1,2, Simon Böttger3, Florian Frank1, Ali Mohamed3, Martin Hartmann3, Sascha Hermann3,4 and Stefan Katzenbeisser1,
1University of Passau, 2TU Darmstadt, 3TU Chemnitz, 4Fraunhofer ENAS, Chemnitz
Towards a PVT-Variation Resistant Resistor-Based PUF
Carl Riehm1, Christoph Frisch1, Florin Burcea1, Matthias Hiller2, Michael Pehl1 and Ralf Brederlow1,
1TU Munich 2Fraunhofer AISEC, Garching
W03.3 Session 2: Side-channel Attacks
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 09:45 CET - 10:15 CET
Location / Room: Nightingale Room 2.6.1/2
Session chair:
Ingrid Verbauwhede, KU Leuven, BE
Practical Considerations for Optical Side-Channel Analysis: A Case Study on Reconfigurable FETs
Thilo Krachenfels1, Giulio Galderisi2, Thomas Mikolajick2,3, Jens Trommer2 and Jean-Pierre Seifert1,4,
1TU Berlin, 2NaMLab gGmbH, Dresden, 3TU Dresden, 4Fraunhofer SIT, Darmstadt
Side-Channel Leakage Evaluation of Multi-Chip Cryptographic Modules
Kazuki Monta1, Takumi Matsumaru1, Takaaki Okidono2, Takuji Miki1 and Makoto Nagata1,
1Kobe U 2SCU Co. Ltd, Tokyo
W03.4 Poster session: Projects of Priority Program Nano Security
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 10:15 CET - 11:00 CET
Location / Room: Nightingale Room 2.6.1/2
Session chair:
Shahar Kvatinsky, Technion – Israel Institute of Technology, IL
PUFMem: Intrinsic Physical Unclonable Functions from Emerging Non-Volatile Memories
Tolga Arul, Stefan Katzenbeisser, Florian Frank, University of Passau
nanoEBeam: E Beam Probing for backside attacks against nanoscale ICs
Frank Altmann, Jörg Jatzkowski, FhG IMWS Halle, Elham Amini, Jean-Pierre Seifert, Christian Boit, Tholo Krachenfels, TU Berlin
STAMPS: From Strain to Trust: tAMper aware silicon PufS
Ralf Brederlow, TU Munich, Matthias Hiller, FhG AISEC
RAINCOAT: Randomization in Secure Nano-Scale Microarchitectures
Christian Niesler1, Jan Thoma2, Lucas Davi1, Tim Güneysu2
1University of Duisburg-Essen, 2Ruhr University Bochum
OptiSecure: Securing Nano-Circuits against Optical Probing
Sajjad Parvin1, Thilo Krachenfels2, Frank Sill Torres3, Jean-Pierre Seifert2,4, Rolf Drechsler1,5
1University of Bremen, 2TU Berlin, 3DLR, Bremerhaven, 4Fraunhofer SIT, Darmstadt, 5DFKI, Bremen
MemCrypto: Towards Secure Electroforming-free Memristive Cryptographic Implementations
Nan Du (University of Jena and Leibniz IPHT), Ilia Polian (University of Stuttgart)
HaSPro: Verifiable Hardware Security for Out-of-Order Processors
Thomas Eisenbarth, University of Lübeck, Wolfgang Kunz, Tobias Jauch, TU Kaiserslautern
NANOSEC: Tamper-Evident PUFs based on Nanostructures for Secure and Robust Hardware Security Primitives
Sascha Hermann, TU Chemnitz, Stefan Katzenbeisser, Nikolaus Athanasios Anagnostopoulos, University of Passau
SecuReFET: Secure Circuits through inherent Reconfigurable FET
Shubham Rai, Akash Kumar, TU Dresden
Giulio Galderisi, Thomas Mikolajick, Jens Trommer, NaMLab gGmbH, Dresden
BioNanoLock: Bio-Nanoelectronic based Logic Locking for Secure Systems
Farhad Amirali Merchant, Vivek Pachauri, Rainer Leupers, Elmira Moussavi, RWTH Aachen
RRAMPUFTRNG: CMOS-compatible RRAM-based structures for the implementation of Physical Unclonable Functions (PUF) and True Random Number Generators (TRNG)
Sahitya Yarragolla, Torben Hemke, Thomas Mussenbrock, Ruhr University Bochum
W03.5 Session 3: Trustworthy Electronics
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 11:00 CET - 11:45 CET
Location / Room: Nightingale Room 2.6.1/2
Session chair:
Jean-Pierre Seifert, TU Berlin, DE
Quantifying Trust in Hardware through Physical Inspection
Bernhard Lippmann1, Matthias Ludwig1 and Horst Gieser2,
1Infineon Technologies AG, Munich 2Fraunhofer EMFT, Munich
(Un)Attractiveness for State Machine Obfuscation
Michaela Brunner1, Hye Hyun Lee1, Alexander Hepp1, Johanna Baehr1 and Georg Sigl1,2,
1TU Munich 2Fraunhofer AISEC, Garching
Thwarting Structural Attacks on Logic Locking with Reconfigurable Nanotechnologies
Armin Darjani, Nima Kavand and Akash Kumar,
TU Dresden
W03.6 Panel
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 11:45 CET - 12:30 CET
Location / Room: Nightingale Room 2.6.1/2
Session Chair:
Ilia Polian, University of Stuttgart, DE
Security Issues in Heterogeneous Systems
Panelists:
Farimah Farahmandi, University of Florida
Sandip Kundu, University of Massachusetts, Amherst
Shahar Kvatinsky, Technion
Johanna Sepulveda, Airbus Defence and Space
ASD6 ASD focus session 2: SelPhys: Self-awareness in Cyber-physical Systems
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 11:00 CET - 12:30 CET
Location / Room: Gorilla Room 1.5.4/5
Session chair:
Lukas Esterle, Aarhus University, DK
Session co-chair:
Axel Jantsch, TU Wien, AT
Computational self-awareness enables autonomous systems to operate in rapidly unfolding situations and conditions that have not been considered during development. Cyber-physical systems, constantly interacting with the physical world, have to deal with an even wider spectrum of potentially unknown situations introduced in their environment, including other (autonomous) systems and humans. Their ability to respond appropriately is vital for these systems not only to achieve their goals but also to ensure the safety of other machines and humans in the process. In this special session, we will have various invited talks on different aspects of computational self-awareness and its contribution to autonomous systems design. Specifically, we aim to have talks ranging from fundamental theory on computational self-awareness, over signal processing, embedded and high-performance computing, towards applications utilising self-aware properties for increased safety and performance. After the short presentations, the presenters will be invited to participate in a panel discussion together with the audience.
Time | Label | Presentation Title Authors |
---|---|---|
11:00 CET | ASD6.1 | SELF-AWARE MACHINE INTELLIGENCE Presenter: Peter Lewis, Ontario University of Technology, CA Author: Peter Lewis, Ontario University of Technology, CA Abstract . |
11:22 CET | ASD6.2 | INCREMENTAL SELF-AWARENESS BASED ON FREE ENERGY MINIMIZATION FOR AUTONOMOUS AGENTS Presenter: Carlo Regazzoni, University of Genova, IT Authors: Carlo Regazzoni and Lucio Marcenaro, University of Genova, IT Abstract . |
11:45 CET | ASD6.3 | ADAPTIVE, RESILIENT COMPUTING PLATFORMS THROUGH SELF-AWARENESS Presenter: Nikil Dutt, UC Irvine, US Author: Nikil Dutt, UC Irvine, US Abstract . |
12:07 CET | ASD6.4 | COGNITIVE ENERGY SYSTEMS Presenter: Christian Gruhl, University of Kassel, DE Author: Christian Gruhl, University of Kassel, DE Abstract . |
SA3 Applications of emerging technologies and computing paradigms
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 11:00 CET - 12:30 CET
Location / Room: Gorilla Room 1.5.3
Session chair:
Ioana Vatajelu, TIMA - CNRS / Université Grenoble Alpes, FR
11:00 CET until 11:24 CET: Pitches of regular papers
11:24 CET until 12:30 CET: Interactive technical presentations by the authors of regular papers and extended abstracts
Regular Papers
Time | Label | Presentation Title Authors |
---|---|---|
11:00 CET | SA3.1 | HDGIM: HYPERDIMENSIONAL GENOME SEQUENCE MATCHING ON UNRELIABLE HIGHLY SCALED FEFET Speaker: Hamza Errahmouni Barkam, University of California, Irvine, US Authors: Hamza Errahmouni Barkam1, Sanggeon Yun2, Paul Genssler3, Zhuowen Zou1, Che-Kai Liu4, Hussam Amrouch5 and Mohsen Imani1 1University of California, Irvine, US; 2Kookmin University, KR; 3University of Stuttgart, DE; 4Zhejiang Univsersity, CN; 5TU Munich, DE Abstract This is the first work to i) define theoretically the memorization capacity of Hyperdimensional (HDC) hyperparameters and ii) present a reliable application for highly scaled (down to merely 3nm), multi-bit Ferroelectric FET (FeFET) technology. FeFET is one of the up-and-coming emerging technologies that is not only fully compatible with the existing CMOS but does hold the promise to realize ultra-efficient and compact Compute-in-Memory (CiM) architectures. Nevertheless, FeFETs struggle with the 10nm thickness of the Ferroelectric (FE) layer. This makes scaling profoundly challenging if not impossible because thinner FE significantly shrinks the memory window leading to large error probabilities that cannot be tolerated. To overcome these challenges, we propose HDGIM, a hyperdimensional computing framework catered to FeFET in the context of genome sequence matching. Genome Sequence Matching is known to have high computational costs, primarily due to huge data movement that substantially overwhelms von-Neuman architectures. On the one hand, our cross-layer FeFET reliability modeling (starting from device physics to circuits) accurately captures the impact of FE scaling on errors induced by process variation and inherent stochasticity in multi-bit FeFETs. On the other hand, our HDC learning framework iteratively adapts by using two models, a full-precision, ideal model for training and a quantized, noisy version for validation and inference. Our results demonstrate that highly scaled FeFET realizing 3-bit and even 4-bit can withstand any noise given high dimensionality during inference. If we consider the noise during model adjustment, we can improve the inherent robustness compared to adding noise during the matching process. |
11:03 CET | SA3.2 | QUANTUM MEASUREMENT DISCRIMINATION USING CUMULATIVE DISTRIBUTION FUNCTIONS Speaker: Prabhat Mishra, University of Florida, US Authors: Zachery Utt, Daniel Volya and Prabhat Mishra, University of Florida, US Abstract Quantum measurement is one of the critical steps in quantum computing that determines the probabilities associated with qubit states after conducting several circuit executions and measurements. As a mesoscopic quantum system, real quantum computers are prone to noise. Therefore, a major challenge in quantum measurement is how to correctly interpret the noisy results of a quantum computer. While there are promising classification based solutions, they either produce incorrect results (misclassify) or require many measurements (expensive). In this paper, we present an efficient technique to estimate a qubit's state through analysis of probability distributions of post-measurement data. Specifically, it estimates the state of a qubit using cumulative distribution functions to compare the measured distribution of a sample with the distributions of basis states. Our experimental results demonstrate a drastic reduction (78%) in single qubit readout error. It also provides significant reduction (12%) when used to boost existing multi-qubit discriminator models. |
11:06 CET | SA3.3 | EXTENDING THE DESIGN SPACE OF DYNAMIC QUANTUM CIRCUITS FOR TOFFOLI BASED NETWORK Speaker: Abhoy Kole, DFKI, IN Authors: Abhoy Kole1, Arighna Deb2, Kamalika Datta1 and Rolf Drechsler3 1German Research Centre for Artificial Intelligence (DFKI), DE; 2School of Electronics Engineering KIIT DU, IN; 3University of Bremen | DFKI, DE Abstract Recent advances in fault tolerant quantum systems allow to perform non-unitary operations like mid-circuit measurement, active reset and classically controlled gate operations in addition to the existing unitary gate operations. Real quantum devices that support these non-unitary operations enable us to execute a new class of quantum circuits, known as Dynamic Quantum Circuits (DQC). This helps to enhance the scalability, thereby allowing execution of quantum circuits comprising of many qubits by using at least two qubits. Recently DQC realizations of multi-qubit Quantum Phase Estimation (QPE) and Bernstein–Vazirani (BV) algorithms have been demonstrated in two separate experiments. However the dynamic transformation of complex quantum circuits consisting of Toffoli gate operations have not been explored yet. This motivates us to: (a) explore the dynamic realization of Toffoli gates by extending the design space of DQC for Toffoli networks, and (b) propose a general dynamic transformation algorithm for the first time to the best of our knowledge. More precisely, we introduce two dynamic transformation schemes (dynamic-1 and dynamic-2) for Toffoli gates, that differ with respect to the required number of classically controlled gate operations. For evaluation, we consider the Deutsch–Jozsa (DJ) algorithm composed of one or more Toffoli gates. Experimental results demonstrate that dynamic DJ circuits based on dynamic-2 Toffoli realization scheme provides better computational accuracy over the dynamic-1 scheme. Further, the proposed dynamic transformation scheme is generic and can also be applied to non-Toffoli quantum circuits, e.g. BV algorithm. |
11:09 CET | SA3.4 | AI-BASED DETECTION OF DROPLETS AND BUBBLES IN DIGITAL MICROFLUIDIC BIOCHIPS Speaker: Luca Pezzarossa, TU Denmark, DK Authors: Jianan Xu, Wenjie Fan, Georgi Plamenov Tanev, Jan Madsen and Luca Pezzarossa, TU Denmark, DK Abstract Digital microfluidic biochips exploit the electrowetting on dielectric effect to move and manipulate microliter-sized liquid droplets on a planar surface. This technology has the potential to automate and miniaturize biochemical processes, but reliability is often an issue. The droplets may get temporarily stuck or gas bubbles may impede their movement leading to a disruption of the process being executed. However, if the position and size of the droplets and bubbles are known at run-time, these undesired effects can be easily mitigated by the biochip control system. This paper presents an AI-based computer vision solution for real-time detection of droplets and bubbles in DMF biochips and its implementation that supports cloud-based deployment. The detection is based on the YOLOv5 framework in combination with custom pre and post-processing techniques. The YOLOv5 neural network is trained using our own data set consisting of 5115 images. The solution is able to detect droplets and bubbles with real-time speed and high accuracy and to differentiate between them even in the extreme case where bubbles coexist with transparent droplets. |
11:12 CET | SA3.5 | SPLIT ADDITIVE MANUFACTURING FOR PRINTED NEUROMORPHIC CIRCUITS Speaker: Haibin Zhao, Karlsruhe Institute of Technology, CN Authors: Haibin Zhao1, Michael Hefenbrock2, Michael Beigl1 and Mehdi Tahoori1 1Karlsruhe Institute of Technology, DE; 2RevoAI, DE Abstract Printed and flexible electronics promises smart devices for application domains, such as smart fast moving consumer goods and medical wearables, which are generally untouchable by conventional rigid silicon technologies. This is due to their remarkable properties such as flexibility, non-toxic materials, and having low-cost per area. Combined with neuromorphic computing, printed neuromorphic circuits pose an attractive solution for these application domains. Particularly, the additive printing technologies can reduce large amount of fabrication complexities and costs. On the one hand, high-throughput additive printing processes, such as roll-to-roll printing, can reduce the per-device fabrication time and cost. On the other hand, jet-printing can provide point-of-use customization at the expense of lower fabrication throughput. In this work, we propose a machine learning based design framework, that respects the objective and physical constraints of split additive manufacturing for printed neuromorphic circuits. With the proposed framework, multiple printed neural networks are trained jointly with the aim to sensibly combine multiple fabrication techniques (e.g., roll-to-roll and jet-printing). This should lead to a cost-effective fabrication of multiple different printed neuromorphic circuits and achieve high fabrication throughput, lower cost, and point-of-use customization. |
11:15 CET | SA3.6 | PIMPR: PIM-BASED PERSONALIZED RECOMMENDATION WITH HETEROGENEOUS MEMORY HIERARCHY Speaker: Tao Yang, Shanghai Jiao Tong University, Shanghai, China, CN Authors: Tao Yang1, Hui Ma1, Yilong Zhao1, Fangxin Liu1, Zhezhi He1, Xiaoli Sun2 and Li Jiang1 1Shanghai Jiao Tong University, CN; 2Institute of Scientific and Technical Information of Zhejiang Province, CN Abstract Deep learning-based personalized recommendation models (DLRMs) are dominating AI tasks in data centers. The performance bottleneck of typical DLRMs mainly lies in the memory-bounded embedding layers. Resistive Random Access Memory (ReRAM)-based Processing-in-memory (PIM) architecture is a natural fit for DLRMs thanks to its in-situ computation and high computational density. However, it remains two challenges before DLRMs fully embrace PIM architectures: 1) The size of DLRM's embedding tables can reach tens of GBs, far beyond the memory capacity of typical ReRAM chips. 2) The irregular sparsity conveyed in the embedding layers is difficult to exploit in PIM architecture. In this paper, we present the first PIM-based DLRM accelerator named PIMPR. PIMPR has a heterogeneous memory hierarchy—ReRAM crossbar-based PIM modules serve as the computing caches with high computing parallelism, while DIMM modules are able to hold the entire embedding table—leveraging the data locality of DLRM's embedding layers. Moreover, we propose a runtime strategy to skip the useless calculation induced by the sparsity and an offline strategy to balance the workload of each ReRAM crossbar. Compared to the state-of-the-art DLRM accelerator SPACE and TRiM, PIMPR achieves on average 2.02× and 1.79× speedup, 5.6×, and 5.1× energy reduction, respectively |
11:18 CET | SA3.7 | FSL-HD: ACCELERATING FEW-SHOT LEARNING ON RERAM USING HYPERDIMENSIONAL COMPUTING Speaker: Weihong Xu, University of California, San Diego, CN Authors: Weihong Xu, Jaeyoung Kang and Tajana Rosing, University of California, San Diego, US Abstract Few-shot learning (FSL) is a promising meta-learning paradigm that trains classification models on the fly with a few training sam- ples. However, existing FSL classifiers are either computationally expensive, or are not accurate enough. In this work, we propose an efficient in-memory FSL classifier, FSL-HD, based on hyper- dimensional computing (HDC) that achieves state-of-the-art FSL accuracy and efficiency. We devise an HDC-based FSL framework with efficient HDC encoding and search to reduce high complexity caused by the large HDC dimensionality. Also, we design a scal- able in-memory architecture to accelerate FSL-HD on ReRAM with distributed dataflow and organization that maximizes the data paral- lelism and hardware utilization. The evaluation shows that FSL-HD achieves 4.2% higher accuracy compared to other FSL classifiers. FSL-HD achieves 100 − 1000× better energy efficiency and 9 − 66× speedup over the CPU and GPU baselines. Moreover, FSL-HD is more accurate, scalable and 2.5× faster than the state-of-the-art ReRAM-based FSL design, SAPIENS, while requiring 85% less area. |
11:21 CET | SA3.8 | HD-I-IOT: HYPERDIMENSIONAL COMPUTING FOR RESILIENT INDUSTRIAL INTERNET OF THINGS ANALYTICS Speaker: Baris Aksanli, San Diego State University, TR Authors: Onat Gungor1, Tajana Rosing2 and Baris Aksanli3 1UCSD & SDSU, US; 2University of California, San Diego, US; 3San Diego State University, US Abstract Industrial Internet of Things (I-IoT) enables fully automated production systems by continuously monitoring devices and analyzing collected data. Machine learning (ML) methods are commonly utilized for data analytics in such systems. Cyberattacks are a grave threat to I-IoT as they can manipulate legitimate inputs, corrupting ML predictions and causing disruptions in the production systems. Hyperdimensional (HD) computing is a brain-inspired ML method that has been shown to be sufficiently accurate while being extremely robust, fast, and energy-efficient. In this work, we use non-linear encoding-based HD for intelligent fault diagnosis against different adversarial attacks. Our black-box adversarial attacks first train a substitute model and create perturbed test instances using this trained model. These examples are then transferred to the target models. The change in the classification accuracy is measured as the difference before and after the attacks. This change measures the resiliency of a learning method. Our experiments show that HD leads to a more resilient and lightweight learning solution than the state-of-the-art deep learning methods. HD has up to 67.5% higher resiliency compared to the state-of-the-art methods while being up to 25.1× faster to train. |
Extended Abstracts
Time | Label | Presentation Title Authors |
---|---|---|
11:24 CET | SA3.9 | STAR: AN EFFICIENT SOFTMAX ENGINE FOR ATTENTION MODEL WITH RRAM CROSSBAR Speaker: Yifeng Zhai, Capital Normal University, CN Authors: Yifeng Zhai1, Bing Li1 and Bonan Yan2 1Capital Normal University, CN; 2Peking University, CN Abstract RRAM crossbars have been studied to construct in-memory accelerators for neural network applications due to their in-situ computing capability. However, prior RRAMbased accelerators show efficiency degradation when executing the popular attention models. We observed that the frequent softmax operations arise as the efficiency bottleneck and also are insensitive to computing precision. Thus, we propose STAR, which boosts the computing efficiency with an efficient RRAMbased softmax engine and a fine-grained global pipeline for the attention models. Specifically, STAR exploits the versatility and flexibility of RRAM crossbars to trade off the model accuracy and hardware efficiency. The experimental results evaluated on several datasets show STAR achieves up to 30.63× and 1.31× computing efficiency improvements over the GPU and the stateof-the-art RRAM-based attention accelerators, respectively. |
11:24 CET | SA3.10 | VALUE-BASED REINFORCEMENT LEARNING USING EFFICIENT HYPERDIMENSIONAL COMPUTING Speaker: Yang Ni, University of California, Irvine, US Authors: Yang Ni1, Danny Abraham1, Mariam Issa1, Yeseong Kim2, Pietro Mercati3 and Mohsen Imani1 1University of California, Irvine, US; 2DGIST, KR; 3Intel Labs, US Abstract Reinforcement Learning (RL) has opened up new opportunities to solve a wide range of complex decision-making tasks. However, modern RL algorithms, e.g., Deep Q-Learning, are based on deep neural networks, resulting in high computational costs. In this paper, we propose QHD, an off-policy value-based Hyperdimensional RL, that mimics brain properties toward robust and real-time learning. QHD relies on a lightweight brain-inspired model to learn an optimal policy in an unknown environment. We first develop a novel mathematical foundation and encoding module that maps state-action space into high-dimensional space. We accordingly develop a hyperdimensional regression model to approximate the Q-value function. QHD-powered agent makes decisions by comparing Q-values of each possible action. QHD provides 34.6× speedup and significantly better quality of learning than deep RL algorithms. |
11:24 CET | SA3.11 | DROPDIM: INCORPORATING EFFICIENT UNCERTAINTY ESTIMATION INTO HYPERDIMENSIONAL COMPUTING Speaker: Yang Ni, University of California, Irvine, US Authors: Yang Ni1, Hanning Chen1, Prathyush Poduval2, Pietro Mercati3 and Mohsen Imani1 1University of California, Irvine, US; 2University of Maryland Baltimore County, US; 3Intel Labs, US Abstract Research in the field of brain-inspired HyperDimensional Computing (HDC) brings orders of magnitude speedup to both Machine Learning (ML) training and inference compared to deep learning counterparts. However, current HDC algorithms generally lack uncertainty estimation On the other hand, existing solutions such as the Bayesian Neural Networks are generally slow and lead to high energy consumption. This paper proposes a hyperdimensional Bayesian framework called DropDim, which enables uncertainty estimation for the HDC-based regression algorithm. The core of our framework is a specially designed HDC encoder that maps input features to the high dimensional space with an extra layer of randomness, i.e., a small number of dimensions are randomly dropped for each input. Our key insight is that by using this encoder, DropDim implements Bayesian inference while maintaining the efficiency advantage of HDC. |
SD3 Hardware accelerators and memory subsystems
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 11:00 CET - 12:30 CET
Location / Room: Okapi Room 0.8.2
Session chair:
Valeria Bertacco, University of Michigan, US
11:00 CET until 11:24 CET: Pitches of regular papers
11:24 CET until 12:30 CET: Interactive technical presentations by the authors of regular papers and extended abstracts
Regular Papers
Time | Label | Presentation Title Authors |
---|---|---|
11:00 CET | SD3.1 | UVMMU: HARDWARE-OFFLOADED PAGE MIGRATION FOR HETEROGENEOUS COMPUTING Speaker: Jungrae Kim, Sungkyunkwan University, KR Authors: Jihun Park1, Donghun Jeong2 and Jungrae Kim2 1dept. of Artificial Intelligence, Sungkyunkwan University, KR; 2Sungkyunkwan University, KR Abstract In a heterogeneous computing system with multiple memories, placing data near its current processing unit and migrating data over time can significantly improve performance. GPU vendors have introduced Unified Memory (UM) to automate data migrations between CPU and GPU memories and support memory over-subscription. Although UM improves software programmability, it can incur high costs due to its software-based migration. We propose a novel architecture to offload the migration to hardware and minimize UM overheads. Unified Virtual Memory Management Unit (UVMMU) detects access to remote memories and migrates pages without software intervention. By replacing page faults and software handling with hardware offloading, UVMMU can reduce the page migration latency to a few μs. Our evaluation shows that UVMMU can achieve 1.59× and 2.40× speed-ups over the state-of-the-art UM solutions for no over-subscription and 150% over-subscription, respectively. |
11:03 CET | SD3.2 | ARRAYFLEX: A SYSTOLIC ARRAY ARCHITECTURE WITH CONFIGURABLE TRANSPARENT PIPELINING Speaker: Dionysios Filippas, Democritus University of Thrace, GR Authors: Christodoulos Peltekis1, Dionysios Filippas1, Giorgos Dimitrakopoulos1, Chrysostomos Nicopoulos2 and Dionisios Pnevmatikatos3 1Democritus University of Thrace, GR; 2University of Cyprus, CY; 3National TU Athens & ICCS, GR Abstract Convolutional Neural Networks (CNNs) are the state-of-the-art solution for many deep learning applications. For maximum scalability, their computation should combine high performance and energy efficiency. In practice, the convolutions of each CNN layer are mapped to a matrix multiplication that includes all input features and kernels of each layer and is computed using a systolic array. In this work, we focus on the design of a systolic array with configurable pipeline with the goal to select an optimal pipeline configuration for each CNN layer. The proposed systolic array, called ArrayFlex, can operate in normal, or in shallow pipeline mode, thus balancing the execution time in cycles and the operating clock frequency. By selecting the appropriate pipeline configuration per CNN layer, ArrayFlex reduces the inference latency of state-of-the-art CNNs by 11%, on average, as compared to a traditional fixed-pipeline systolic array. Most importantly, this result is achieved while using 13%-23% less power, for the same applications, thus offering a combined energy-delay-product efficiency between 1.4x and 1.8x. |
11:06 CET | SD3.3 | FASTRW: A DATAFLOW-EFFICIENT AND MEMORY-AWARE ACCELERATOR FOR GRAPH RANDOM WALK ON FPGAS Speaker: Fan Wu, Ku Leuven, BE Authors: Yingxue Gao, teng wang, Lei Gong, Chao Wang, Xi Li and Xuehai Zhou, University of Science and Technology of China, CN Abstract Graph random walk (GRW) sampling is becoming increasingly important with the widespread popularity of graph applications. It involves some walkers that wander through the graph to capture the desirable properties and reduce the size of the original graph. However, previous research suffers long sampling latency and severe memory access bottlenecks due to intrinsic data dependency and irregular vertex distribution. This paper proposes FastRW, a dedicated accelerator to release GRW acceleration on FPGAs. FastRW first schedules walkers' execution to address data dependency and mask long sampling latency. Then, FastRW leverages pipeline specialization and bit-level optimization to customize a processing engine with five modules and achieve a pipelining dataflow. Finally, to alleviate the differential accesses caused by irregular vertex distribution, FastRW implements a hybrid memory architecture to provide parallel access ports according to the vertex's degree. We evaluate FastRW with two classic GRW algorithms on a wide range of real-world graph datasets. The experimental results show that FastRW achieves a speedup of 14.13x on average over the system running on two 8-core Intel CPUs. FastRW also achieves 3.28x~198.24x energy efficiency over the architecture implemented on V100 GPU. |
11:09 CET | SD3.4 | TWIN ECC: A DATA DUPLICATION BASED ECC FOR STRONG DRAM ERROR RESILIENCE Speaker: Hyeong Kon Bae, Korea University, KR Authors: Hyeong Kon Bae1, Myung Jae Chung1, Young-Ho Gong2 and Sung Woo Chung1 1Korea University, KR; 2Kwangwoon University, KR Abstract With the continuous scaling of process technology, DRAM reliability has become a critical challenge in modern memory systems. Currently, DRAM memory systems for servers employ ECC DIMMs with a single error correction and double error detection (SECDED) code. However, the SECDED code is insufficient to ensure DRAM reliability since memory systems become more susceptible to errors. Though various studies have proposed multi-bit correctable ECC schemes, such ECC schemes cause performance and/or storage overhead. To minimize performance degradation while providing strong error resilience, in this paper, we propose Twin ECC, a low-cost memory protection scheme through data duplication. In a 512-bit data, Twin ECC duplicates meaningful data into meaningless zeros. Since ‘1'→‘0' error pattern is dominant in DRAM cells, Twin ECC provides strong error resilience by performing bitwise OR operations between the original meaningful data and duplicated data. After the bitwise OR operations, Twin ECC adopts the SECDED code for further enhancing data protection. Our evaluations show that Twin ECC reduces the system failure probability by average 64.8%, 56.9%, and 49.5%, when the portion of ‘1'→‘0' error is 100%, 90%, and 80%, respectively, while causing only 0.7% performance overhead and no storage overhead compared to the baseline ECC DIMM with SECDED code. |
11:12 CET | SD3.5 | AIDING TO MULTIMEDIA ACCELERATORS: A HARDWARE DESIGN FOR EFFICIENT ROUNDING OF BINARY FLOATING POINT NUMBERS Speaker: Urbi Chatterjee, IIT Kanpur, IN Authors: MAHENDRA RATHOR, Vishesh Mishra and Urbi Chatterjee, IIT Kanpur, IN Abstract Hardware accelerators for multimedia applications such as JPEG image compression and video compression are quite popular due to their capability of enhancing overall performance and system throughput. The core of essentially all lossy compression techniques is the quantization process. In the quantization process, rounding is performed to obtain integer values for the compressed images and video frames. The recent studies in the photo forensic research has revealed that the direct rounding e.g. round up or round down of floating point numbers results into some compression artifacts such as 'JPEG dimples'. Therefore in the compression process, performing rounding to the nearest integer value is important especially for High Dynamic Range (HDR) photography and videography. Since rounding to the nearest integer is a data-intensive process, hence its realization as a dedicated hardware is imperative to enhance overall performance. This paper presents a novel high performance hardware architecture for performing rounding of binary floating point numbers to the nearest integer. Additionally, an optimized version of the basic hardware design is also proposed. The proposed optimized version provides 6.7% reduction in area and 7.4% reduction in power consumption in comparison to the proposed basic architecture. Furthermore, the integration of the proposed floating point rounding hardware with the design flow of the computing kernel of the compression processor is also discussed in the paper. The proposed rounding hardware architecture and the integrated design with the computing kernel of compression process have been implemented on an Intel FPGA. The average resource overhead due to this integration is reported to be less than 1%. |
11:15 CET | SD3.6 | CRSPU: EXPLOIT COMMONALITY OF REGULAR SPARSITY TO SUPPORT VARIOUS CONVOLUTIONS ON SYSTOLIC ARRAYS Speaker: Jianchao Yang, College of Computer, National University of Defense Technology, CN Authors: Jianchao Yang, Mei Wen, Junzhong Shen, Yasong Cao, Minjin Tang, Renyu Yang, Xin Ju and Chunyuan Zhang, Colledge of Computer, National University of Defense Technology, CN Abstract Dilated convolution (DCONV) and transposed convolution (TCONV) are involved in the training of GANs and CNNs and introduces numerous regular zero-spaces into the feature maps or kernels. Existing accelerators typically pre-reorganize the zero-spaces, and then perform sparse computation to accelerate them, resulting in huge hardware resource overhead and control complexity. While the systolic array has proven advantages when it comes to accelerating convolutions, countermeasures for deploying DCONV and TCONV on systolic arrays are rarely proposed. Therefore, we opt to improve the traditional im2col algorithm to make full use of the regular sparsity and avoid data reorganization, thereby facilitating the use of systolic arrays in this context. Public Dimension Compression and Similar Sparsity Merging mechanisms are also designed to implement sparse computing, eliminating unnecessary computing caused by zero-spaces. We propose a systolic array-based processing unit, named CRSPU. Experiments show that CRSPU exhibits more competitive performance than the state-of-the-art baseline accelerator GANPU. Furthermore, CRSPU's ability to avoid zero-space data reorganization represents a huge advantage for bandwidth-unfriendly accelerators. |
11:18 CET | SD3.7 | CLAP: LOCALITY AWARE AND PARALLEL TRIANGLE COUNTING WITH CONTENT ADDRESSABLE MEMORY Speaker: Tianyu Fu, Tsinghua University, CN Authors: Tianyu Fu1, Chiyue Wei1, Zhenhua Zhu1, Shang Yang1, Zhongming Yu2, Guohao Dai3, Huazhong Yang1 and Yu Wang1 1Tsinghua University, CN; 2University of California, San Diego, US; 3Shanghai Jiao Tong University, CN Abstract Triangle counting (TC) is one of the most fundamental graph analysis tools with a wide range of applications. Modern triangle counting algorithms traverse the graph and perform set intersections of neighbor sets to find triangles. However, existing triangle counting approaches suffer from the heavy off-chip memory access and set intersection overhead. Thus, we propose , the first content addressable memory (CAM) based triangle counting architecture with the software and hardware co-optimizations. To reduce off-chip memory access and the number of set intersections, we propose the first force-based node index reorder method. It simultaneously optimizes both data locality and the computation amount. Compared with random node indices, the reorder method reduces the off-chip memory access and the set intersections by 61% and 64%, respectively, while providing 2.19× end-to-end speedup. To improve the set intersection parallelism, we propose the first CAM-based triangle counting architecture under chip area constraints. We enable the high parallel set intersection by translating it into content search on CAM with full parallelism. Thus, the time complexity of the set intersection reduces from O(m+n) or O(nlogm) to O(n). Extensive experiments on real-world graphs show that achieves 39×, 27×, and 78× speedup over state-of-the-art CPU, GPU, and processing-in-memory baselines, respectively. The software code is available at: https://github.com/thu-nics/CLAP-triangle-counting. |
11:21 CET | SD3.8 | ATOMIC BUT LAZY UPDATING WITH MEMORY-MAPPED FILES FOR PERSISTENT MEMORY Speaker: Qisheng Jiang, ShanghaiTech University, CN Authors: Qisheng Jiang, Lei Jia and Chundong Wang, ShanghaiTech University, CN Abstract Applications memory-map file data stored in the persistent memory and expect both high performance and failure atomicity. State-of-the-art NOVA and Libnvmmio guarantee failure atomicity but yield inferior performance. They enforce data staying fresh and intact at the mapped addresses by continually updating the data there, thereby incurring severe write amplifications. They also lack the adaptability to dynamic workloads and entail housekeeping overheads with complex designs. We hence propose Acumen with a group of reflection pages managed for a mapped file. Using a simplistic bitmap to track fine-grained data slices, Acumen makes a reflection page and a mapped file page pair to alternately carry updates to achieve failure atomicity. Only on receiving a read request will it deploy valid data from reflection pages into target mapped file pages. The cost of deployment is amortized over subsequent read requests. Experiments show that Acumen significantly outperforms NOVA and Libnvmmio with consistently higher performance in serving a variety of workloads. |
Extended Abstracts
Time | Label | Presentation Title Authors |
---|---|---|
11:24 CET | SD3.9 | OUT-OF-STEP PIPELINE FOR GATHER/SCATTER INSTRUCTIONS Speaker: Yi Ge, Fujitsu Limited, JP Authors: Yi Ge1, Katsuhiro Yoda1, Makiko Ito1, Toshiyuki Ichiba1, Takahide Yoshikawa1, Ryota Shioya2 and Masahiro Goshima3 1Fujitsu Limited, JP; 2University of Tokyo, JP; 3National Institute of Informatics, JP Abstract Wider SIMD units suffer from low scalability of gather/scatter instructions that appear in sparse matrix calculations. We address this problem with an out-of-step pipeline which tolerates bank conflicts of a multibank L1D by allowing element operations of SIMD instructions to proceed out of step with each other. We evaluated it with a sparse matrix-vector product kernel for matrices from HPCG and SuiteSparse Matrix Collection. The results show that, for the SIMD width of 1024 bit, it achieves 1.91 times improvement over a model of a conventional pipeline. |
11:24 CET | SD3.10 | MEMPOOL MEETS SYSTOLIC: FLEXIBLE SYSTOLIC COMPUTATION IN A LARGE SHARED-MEMORY PROCESSOR CLUSTER Speaker: Samuel Riedel, ETH Zurich, CH Authors: Samuel Riedel1, Gua Hao Khov1, Sergio Mazzola2, Matheus Cavalcante1, Renzo Andri3 and Luca Benini4 1ETH Zurich, CH; 2ETH Zürich, CH; 3Huawei Zurich Research Center, CH; 4ETH Zurich, CH | Università di Bologna, IT Abstract Systolic arrays and shared-memory manycore clusters are two widely used architectural templates that offer vastly different trade-offs. Systolic arrays achieve exceptional performance for workloads with regular dataflow at the cost of a rigid architecture and programming model. Shared-memory manycore systems are more flexible and easy to program, but data must be moved explicitly to/from cores. This work combines the best of both worlds by adding a systolic overlay to a general-purpose shared-memory manycore cluster allowing for efficient systolic execution while maintaining flexibility. We propose and implement two instruction set architecture extensions enabling native and automatic communication between cores through shared memory. Our hybrid approach allows configuring different systolic topologies at execution time and running hybrid systolic-shared-memory computations. The hybrid architecture's convolution kernel outperforms the optimized shared-memory one by 18%. |
11:24 CET | SD3.11 | NOVEL EFFICIENT SYNONYM HANDLING MECHANISM FOR VIRTUAL-REAL CACHE HIERARCHY Speaker: Varun Venkitaraman, IIT Bombay, IN Authors: Varun Venkitaraman, Ashok Sathyan, Shrihari Deshmukh and Virendra Singh, IIT Bombay, IN Abstract Optimizing L1 caches for latency is critical to improving the system's performance. Generally, virtually indexed physically tagged (VIPT) caches are preferred L1 cache configuration because we can perform address translation and set indexing parallelly, resulting in reduced L1 cache access latency. However, an address translation is essential for every L1 cache access. Address translation significantly contribute to the system's total power consumption. To reduce power consumed due to address translation, virtually indexed virtually tagged (VIVT) caches appears to be an attractive alternative. However, VIVT caches are plagued with the issue of synonyms. Prior works introduce new hardware structures in the cache hierarchy to detect and resolve synonyms. Rather than adding extra hardware structure to the cache hierarchy, we propose a new cache hierarchy design that modifies the last-level cache's tag array to detect and resolve synonyms. Our proposed scheme enhances system's performance by 22% on average and also reduces the dynamic energy consumption of the cache hierarchy by as much as 89%. |
11:24 CET | SD3.12 | TURBULENCE: COMPLEXITY-EFFECTIVE OUT-OF-ORDER EXECUTION ON GPU WITH DISTANCE-BASED ISA Speaker: Reoma Matsuo, University of Tokyo, JP Authors: Reoma Matsuo, Toru Koizumi, Hidetsugu Irie, Shuichi Sakai and Ryota Shioya, University of Tokyo, JP Abstract A graphic processing unit (GPU) is a processor that achieves high throughput by exploiting data parallelism. We found that many GPU workloads also contain instruction-level parallelism, which can be extracted through out-of-order execution to provide additional performance improvement opportunities. We propose the TURBULENCE architecture for very low-cost out-of-order execution on GPUs. TURBULENCE consists of 1) a novel ISA that introduces the concept of referencing operands by inter-instruction distance instead of register numbers and 2) a novel microarchitecture that executes the novel ISA. Our proposed ISA and microarchitecture enable cost-effective out-of-order execution on GPUs without introducing expensive hardware. |
SD4 Resource-aware computing
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 11:00 CET - 12:30 CET
Location / Room: Okapi Room 0.8.3
Session chair:
William Fornaciari, Politecnico di Milano, IT
11:00 CET until 11:24 CET: Pitches of regular papers
11:24 CET until 12:30 CET: Interactive technical presentations by the authors of regular papers and extended abstracts
Regular Papers
Time | Label | Presentation Title Authors |
---|---|---|
11:00 CET | SD4.1 | EFFICIENT HYPERDIMENSIONAL LEARNING WITH TRAINABLE, QUANTIZABLE, AND HOLISTIC DATA REPRESENTATION Speaker: Jiseung Kim, DGIST, KR Authors: Jiseung Kim1, Hyunsei Lee1, Mohsen Imani2 and Yeseong Kim1 1DGIST, KR; 2University of California, Irvine, US Abstract Hyperdimensional computing (HDC) is a computing paradigm that draws inspiration from a human memory model. It represents data in the form of high-dimensional vectors. Recently, many works in literature have tried to use HDC as a learning model due to its simple arithmetic and high efficiency. However, learning frameworks in HDC use encoders that are randomly generated and static, resulting in many parameters and low accuracy. In this paper, we propose TrainableHD, a framework for HDC that utilizes a dynamic encoder with effective quantization for higher efficiency. Our model considers errors gained from the HD model and dynamically updates the encoder during training. Our evaluations show that TrainableHD improves the accuracy of the HDC by up to 22.26% (on average 3.62%) without any extra computation costs, achieving a comparable level to state-of-theart deep learning. Also, the proposed solution is 56.4× faster and 73× more energy efficient as compared to the deep learning on NVIDIA Jetson Xavier, a low-power GPU platform. |
11:03 CET | SD4.2 | SMART KNOWLEDGE TRANSFER-BASED RUNTIME POWER MANAGEMENT Speaker: Lin Chen, Hong Kong University of Science and Technology, HK Authors: Lin Chen1, Xiao Li1, Fan Jiang1, Chengeng Li1 and Jiang Xu2 1Hong Kong University of Science and Technology, HK; 2Hong Kong University of Science and Technology, CN Abstract As Moore's law slows down, computing systems must pivot towards higher energy efficiency to continue scaling performance. Reinforcement learning (RL) performs more adaptively than conventional methods in runtime power management under varied hardware configurations and varying software workloads. However, prior works on either model-free or model-based RL approaches face a non-negligible challenge: relearning the policies to adapt to the new environment is unacceptably time-consuming, especially when encountering significant variances in workloads or hardware configurations. Moreover, existing research on accelerating learning has focused on the speedup while largely ignoring the efficiency degradation of the results. In this paper, we present a smart transfer-enabled Q-learning (STQL) approach to boost the learning process and guarantee the learning efficiency through a contradiction checking mechanism, which evicts inappropriate transferred knowledge. Experiments on realistic applications show that the proposed method can speed up the learning process up to 2.3x and achieve a 6.2% energy-delay product (EDP) reduction compared to the state-of-the-art design. |
11:06 CET | SD4.4 | REDRAW: FAST AND EFFICIENT HARDWARE ACCELERATOR WITH REDUCED READS AND WRITES FOR 3D UNET Speaker: Tom Glint, IIT Gandhinagar, IN Authors: Tom Glint1, Manu Awasthi2 and Joycee Mekie1 1IIT Gandhinagar, IN; 2Ashoka University, IN Abstract Hardware accelerators (HAs) proposed so far have been designed with a focus on 2D convolution neural networks (CNNs) and 3D CNNs using temporal data. To the best of our knowledge, there is no existing HA for 3D CNNs using spatial data. 3D UNet is a 3D CNN with significant applications in the medical domain. However, the total on-chip buffer size (>20 MB) required for the complete stationery approach of processing 3D UNet is cost prohibitive. In this work, we analyze the 3D UNet workload and propose an HA with an optimized memory hierarchy with a total on-chip buffer of less than 4 MB while conceding near theoretical minimum memory accesses required for processing 3D UNet. We demonstrate the efficiency of the proposed HA by comparing it with SOTA Simba architecture with the same number of MAC Units and show a 1.3x increase in TOPS/watt for an ISO-area design. Further, we revise the proposed architecture to increase the ratio of compute operations to memory operations and to meet the latency requirement of 3D UNet-based embedded applications. The revised architecture, compared against a dual instance of Simba, has similar latency. Against the dual instance of Simba, the proposed architecture achieves a 1.8x increase in TOPS/watt in a similar area. |
11:09 CET | SD4.5 | TEMPERATURE-AWARE SIZING OF MULTI-CHIP MODULE ACCELERATORS FOR MULTI-DNN WORKLOADS Speaker: Prachi Shukla, Advanced Micro Devices, US Authors: Prachi Shukla1, Derrick Aguren2, Tom Burd3, Ayse Coskun1 and John Kalamatianos3 1Boston University, US; 2Advanced Micro Devices, US; 3AMD, US Abstract This paper demonstrates the need for temperature awareness in sizing accelerators to target multi-DNN workloads. To that end, we build TESA, a TEmperature-aware methodology that Sizes and places Accelerators to balance both the cost and power of a multi-chip module (MCM) including DRAM power for multi-deep neural network workloads. TESA tunes the accelerator chiplet size and inter-chiplet spacing to generate a temperature-aware MCM layout, subject to user-defined latency, area, power, and thermal constraints. Using TESA for both 2D and 3D systolic array-based chiplets, we demonstrate up to 44% MCM cost savings and 63% DRAM power savings, respectively, over a temperature-unaware baseline at iso-frequency and iso-interposer area. We also demonstrate a need for TESA to obtain feasible MCM configurations for multi-DNN workloads such as augmented/virtual reality (AR/VR). |
11:12 CET | SD4.6 | JUMPING SHIFT: A LOGARITHMIC QUANTIZATION METHOD FOR LOW-POWER CNN ACCELERATION Speaker: David Aledo, TU Delft, ES Authors: Longxing Jiang, David Aledo and Rene Leuken, TU Delft, NL Abstract Logarithmic quantization for Convolutional Neural Networks (CNN): a) fits well typical weights and activation distributions, and b) allows the replacement of the multiplication operation by a shift operation that can be implemented with fewer hardware resources. We propose a new quantization method named Jumping Log Quantization (JLQ). The key idea of JLQ is to extend the quantization range, by adding a coefficient parameter "s" in the power of two exponents 2^(sx+i) This quantization strategy skips some values from the standard logarithmic quantization. In addition, we also develop a small hardware-friendly optimization called weight de-zeroing. Zero-valued weights that cannot be performed by a single shift operation are all replaced with logarithmic weights that to further reduce hardware resources with almost no accuracy loss. To implement the Multiply-And-Accumulate (MAC) operation (needed to compute convolutions) when the weights are JLQ-ed and de-zeroed, a new Processing Element (PE) have been developed. This new PE uses a modified barrel shifter that can efficiently avoid the skipped values. Resource utilization, area, and power consumption of the new PE standing alone are reported. We have found that JLQ performs better than other state-of-the-art logarithmic quantization methods when the bit width of the operands becomes very small. |
11:15 CET | SD4.7 | THERMAL MANAGEMENT FOR S-NUCA MANY-CORES VIA SYNCHRONOUS THREAD ROTATIONS Speaker: Yixian Shen, University of Amsterdam, NL Authors: Yixian Shen, Sobhan Niknam, Anuj Pathania and Andy Pimentel, University of Amsterdam, NL Abstract On-chip thermal management is quintessential to a thermally safe operation of a many-core processor. The presence of a physically-distributed logically-shared Last-Level Cache (LLC) significantly reduces the performance penalty of migrating threads within the cores of an S-NUCA many-core. This cost reduction allows novel thermal management of these many-cores via synchronous thread migration. Synchronous thread migration provides a viable alternative to Dynamic Voltage and Frequency Scaling (DVFS) and asynchronous thread migration used traditionally to manage thermals of S-NUCA many-cores. We present a theoretical method to compute the peak temperature in many-cores with synchronous thread migrations. We use the method to create a thermal management heuristic called HotPotato that maximizes the performance of S-NUCA many-cores under a peak temperature constraint. We implement HotPotato within the state-of-the-art HotSniper simulator. Detailed interval thermal simulations with HotSniper show an average 10.72% improvement in response time of S-NUCA many-cores when scheduling with HotPotato compared to a state-of-the-art thermal-aware S-NUCA scheduler. |
11:18 CET | SD4.8 | PROTEUS: HLS-BASED NOC GENERATOR AND SIMULATOR Speaker: Abhimanyu Rajeshkumar Bambhaniya, Georgia Tech, US Authors: Abhimanyu Rajeshkumar Bambhaniya, Yangyu Chen, FNU Anshuman, Rohan Banerjee and Tushar Krishna, Georgia Tech, US Abstract Networks-on-chip (NoCs) form the backbone fabric for connecting multi-core SoCs containing several processor cores and memories. Design-space exploration (DSE) of NoCs is a crucial part of the SoC design process to ensure that it does not become a bottleneck. DSE today is often hindered by the inherent trade-off between software simulation vs hardware emulation/e- valuation. Software simulators are easily extendable and allow for evaluation of new ideas but are not able to capture the hardware complexity. Meanwhile RTL development is known to be time- consuming. This has forced DSE to use simulators followed by RTL development, evaluation and feedback, which slows down the overall design process. In an effort to tackle this problem, we present Proteus, a configurable and modular NoC simulator and RTL generator. Proteus is the first of its kind framework to use HLS compiler to develop NoCs from a C++ description of the NoC circuit. These generated NoCs can be simulated in software and tested on FPGAs. This allows users to do rapid DSE by providing the opportunity to tweak and test NoC architectures in real-time. We also compare Proteus-generated RTL with Chisel- generated and hand-written RTL in terms of area, timing and productivity. The ability to synthesize the NoC design on FPGAs can benefit large designs as the custom hardware results in faster run-time than cycle-accurate software simulators. Proteus is modeled similar to existing state-of-the-art simulators and offers users modifiable parameters to generate custom topologies, routing algorithm, and router microarchitectures. |
11:21 CET | SD4.9 | MOELA: A MULTI-OBJECTIVE EVOLUTIONARY/LEARNING DESIGN SPACE EXPLORATION FRAMEWORK FOR 3D HETEROGENEOUS MANYCORE PLATFORMS Speaker: Sudeep Pasricha, CSU, US Authors: Sirui Qi1, Yingheng Li2, Sudeep Pasricha1 and Ryan Kim1 1Colorado State University, US; 2University of Pittsburgh, US Abstract To enable emerging applications such as deep machine learning and graph processing, 3D network-on-chip (NoC) enabled heterogeneous manycore platforms that can integrate many processing elements (PEs) are needed. However, designing such complex systems with multiple objectives can be challenging due to the huge associated design space and long evaluation times. To optimize such systems, we propose a new multi-objective design space exploration framework called MOELA that combines the benefits of evolutionary-based search with a learning-based local search to quickly determine PE and communication link placement to optimize multiple objectives (e.g., latency, throughput, and energy) in 3D NoC enabled heterogeneous manycore systems. Compared to state-of-the-art approaches, MOELA increases the speed of finding solutions by up to 128x, leads to a better Pareto Hypervolume (PHV) by up to 12.14x and improves energy-delay-product (EDP) by up to 7.7% in a 5-objective scenario. |
Extended Abstracts
Time | Label | Presentation Title Authors |
---|---|---|
11:24 CET | SD4.10 | DEVELOPING AN ULTRA-LOW POWER RISC-V PROCESSOR FOR ANOMALY DETECTION Speaker: Jina Park, Chung-Ang University, KR Authors: Jina Park1, Eunjin Choi1, Kyungwon Lee1, Jae-Jin Lee2, Kyuseung Han2 and Woojoo Lee1 1Chung-Ang University, KR; 2ETRI, KR Abstract With a focus on anomaly detection, a representative application in healthcare, this paper develops an ultra-low power processor for wearable devices. First, this paper proposes a processor architecture that divides the architecture into a part for general applications running on wearable devices (day part) and a part that performs anomaly detection by analyzing sensor data (night parts), and each part operates completely independently. This day-night architecture allows the day part, which contains the power-hungry main CPU and system interconnect, to be turned off most of the time except for intermittent work, and the night part, which consists only of the sub-CPU and minimal IPs, can run all the time with low power. Next, this paper designs an ultra-lightweight all-night core based on a subset of RV32I optimized for anomaly detection applications, and completes the development of an ultra-low power processor by introducing it to the sub-CPU of the proposed architecture. Finally, by prototyping the proposed processor and developing an anomaly detection application that runs on the processor prototype, this paper demonstrates the superiority of power savings along with design validation of the proposed processor technology. |
11:24 CET | SD4.11 | EXTENDED ABSTRACT: MONITORING-BASED THERMAL MANAGEMENT FOR MIXED-CRITICALITY SYSTEMS Speaker: Marcel Mettler, TU Munich, DE Authors: Marcel Mettler1, Martin Rapp2, Heba Khdr2, Daniel Mueller-Gritschneder1, Joerg Henkel2 and Ulf Schlichtmann1 1TU Munich, DE; 2Karlsruhe Institute of Technology, DE Abstract With a rapidly growing number of functions in embedded real-time systems, it becomes inevitable to integrate tasks of different safety integrity levels (SILs) into one mixed-criticality system. Here, it is important to not only isolate shared architectural resources, as tasks executing on different cores may also interfere via the processor's thermal manager. In order to prevent a scenario where best-effort tasks cause deadline violations for critical tasks, we propose a thermal management strategy that guarantees a sufficient thermal isolation between tasks of different SILs, and simultaneously reduces the run-time of best-effort tasks by up to 45% compared to the state of the art without incurring any real-time violations for critical tasks. |
11:24 CET | SD4.12 | A LIGHTWEIGHT CONGESTION CONTROL TECHNIQUE FOR NOCS WITH DEFLECTION ROUTING Speaker: Shruti Yadav Narayana, University of Wisconsin Madison, US Authors: Shruti Yadav Narayana1, Sumit Mandal2, Raid Ayoub3, Micheal Kishinevsky4 and Umit Ogras5 1University of wisconsin madison, US; 2Indian Institute of Science, IN; 3Intel corporation, US; 4Intel Corporation, US; 5University of Wisconsin - Madison, US Abstract Network-on-Chip (NoC) congestion builds up during heavy traffic load and cripples the system performance by stalling the cores. Moreover, congestion leads to wasted link bandwidth due to blocked buffers and bouncing packets. Existing approaches throttle the cores after congestion is detected, leading to highly congested NoC and stalled cores. In contrast, we propose a lightweight machine learning-based technique that helps predict congestion in the network. Specifically, our proposed technique collects the features related to traffic at each sink. Then, it labels the features using a novel time reversal approach. The labeled data is used to design a low overhead and an explainable decision tree model used at runtime congestion control. Experimental evaluations with synthetic and real traffic on industrial 6×6 NoC show that the proposed approach increases fairness and memory read bandwidth by up to 59% with respect to a state-of-the-art congestion control technique. |
SE2 Modelling, verification and timing analysis of cyber-physical systems
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 11:00 CET - 12:30 CET
Location / Room: Gorilla Room 1.5.1
Session chair:
Martin Horauer, University of Applied Sciences Technikum Wien, AT, AT
11:00 CET until 11:24 CET: Pitches of regular papers
11:24 CET until 12:30 CET: Interactive technical presentations by the authors of regular papers and extended abstracts
Regular Papers
Time | Label | Presentation Title Authors |
---|---|---|
11:00 CET | SE2.1 | IMPACTTRACER: ROOT CAUSE LOCALIZATION IN MICROSERVICES BASED ON FAULT PROPAGATION MODELING Speaker: Jiazhi Jiang, Sun Yat-sen University, CN Authors: Ru Xie1, Jing Yang2, Jingying Li2 and Liming Wang2 1Institute of Information Engineering,CAS;Unversity of Chinese Academy of Sciences, CN; 2Institute of Information Engineering,CAS, CN Abstract Microservice architecture is embraced by a growing number of enterprises due to the benefits of modularity and flexibility. However, being composed of numerous interdependent microservices, it is prone to cascading failures and afflicted by the arising problem of troubleshooting, which entails arduous efforts to identify the root cause node and ensure service availability. Previous works use call graph to characterize causality relationships of microservices but not completely or comprehensively, leading to an insufficient search of potential root cause nodes and consequently poor accuracy in culprit localization. In this paper, we propose ImpactTracer to address the above problems. ImpactTracer builds impact graph to provide a complete view of fault propagation in microservices and uses a novel backward tracing algorithm that exhaustively traverses the impact graph to identify the root cause node accurately. Extensive experiments on a real-world dataset demonstrate that ImpactTracer is effective in identifying the root cause node and outperforms the state-of-the-art methods by at least 72%, significantly facilitating troubleshooting in microservices. |
11:03 CET | SE2.2 | PUMPCHANNEL: AN EFFICIENT AND SECURE COMMUNICATION CHANNEL FOR TRUSTED EXECUTION ENVIRONMENT ON ARM-FPGA EMBEDDED SOC Speaker: Jingquan Ge, Nanyang Technological University, CN Authors: Jingquan Ge, Yuekang Li, Yang Liu, Yaowen Zheng, Yi Liu and Lida Zhao, Nanyang Technological University, SG Abstract ARM TrustZone separates the system into the rich execution environment (REE) and the trusted execution environment (TEE). Data can be exchanged between REE and TEE through the communication channel, which is based on shared memory and can be accessed by both REE and TEE. Therefore, when the REE OS kernel is untrusted, the security of the communication channel cannot be guaranteed. The proposed schemes to protect the communication channel have high performance overhead and are not secure enough. In this paper, we propose PumpChannel, an efficient and secure communication channel implemented on ARM-FPGA embedded SoC. PumpChannel avoids the use of secret keys, but utilizes a hardware and software collaborative pump to enhance the security and performance of the communication channel. Besides, PumpChannel implements a hardware-based hook integrity monitor to ensure the integrity of all hook codes. Security and performance evaluation results show that PumpChannel is more secure than the encrypted channel countermeasures and has better performance than all other schemes. |
11:06 CET | SE2.3 | ON THE DEGREE OF PARALLELISM IN REAL-TIME SCHEDULING OF DAG TASKS Speaker: Qingqiang He, The Hong Kong Polytechnic University, CN Authors: Qingqiang He1, Nan Guan2, Mingsong Lv1 and Zonghua Gu3 1The Hong Kong Polytechnic University, HK; 2City University of Hong Kong, HK; 3Umeå University, SE Abstract Real-time scheduling and analysis of parallel tasks modeled as directed acyclic graphs (DAG) have been intensively studied in recent years. The degree of parallelism of DAG tasks is an important characterization in scheduling. This paper revisits the definition and the computing algorithms for the degree of parallelism of DAG tasks, and clarifies some misunderstandings regarding the degree of parallelism which exist in real-time literature. Based on the degree of the parallelism, we propose a real-time scheduling approach for DAG tasks, which is quite simple but rather effective and outperforms the state-of-the-art by a considerable margin. |
11:09 CET | SE2.4 | TIMING PREDICTABILITY FOR SOME/IP-BASED SERVICE-ORIENTED AUTOMOTIVE IN-VEHICLE NETWORKS Speaker: Enrico Fraccaroli, University of North Carolina at Chapel Hill, US Authors: Enrico Fraccaroli1, Prachi Joshi2, Shengjie Xu1, Khaja Shazzad2, Markus Jochim2 and Samarjit Chakraborty3 1University of North Carolina at Chapel Hill, US; 2General Motors, R&D, US; 3UNC Chapel Hill, US Abstract In-vehicle network architectures are evolving from a typical signal-based client-server paradigm to a service-oriented one, introducing flexibility for software updates and upgrades. While signal-based networks are static by nature, service-oriented ones can more easily evolve during and after the design phase. As a result, service-oriented protocols are spreading through several layers of in-vehicle networks. While applications like infotainment are less sensitive to delays, others like sensing and control have more stringent timing and reliability requirements. Hence, wider adoption of service-oriented protocols requires timing analyzability and predictability problems to be addressed, which are more challenging than in their signal-oriented counterparts. In service-oriented architectures, the discovery phase defines how clients find their required services. The time required to complete the discovery phase is an important parameter since it determines the readiness of a sub-system or even the vehicle. In this paper, we develop a formal timing analysis of the discovery phase of SOME/IP, an emerging service-oriented protocol considered for adoption by automotive OEMs and suppliers. |
11:12 CET | SE2.5 | ANALYSIS AND OPTIMIZATION OF WORST-CASE TIME DISPARITY IN CAUSE-EFFECT CHAINS Speaker: Xiantong Luo, Northeastern University, CN Authors: Xu Jiang1, xiantong Luo1, Nan Guan2, Zheng Dong3, Shao-Shan Liu4 and Wang Yi5 1Northeastern University, CN; 2City University of Hong Kong, HK; 3Wayne State University, US; 4BeyonCa, CN; 5Uppsala University, SE Abstract In automotive systems, an important timing requirement is that the time disparity (the maximum difference among the timestamps of all raw data produced by sensors that an output originates from) must be bounded in a certain range, so that information from different sensors can be correctly synchronized and fused. In this paper, we study the problem of analyzing the worst-case time disparity in cause-effect chains. In particular, we present two bounds, where the first one assumes all chains are independent from each other and the second one takes the fork-join structures into consideration to perform more precise analysis. Moreover, we propose a solution to cut down the worstcase time disparity for a task by designing buffers with proper sizes. Experiments are conducted to show the correctness and effectiveness of both our analysis and optimization methods. |
11:15 CET | SE2.6 | DATA FRESHNESS OPTIMIZATION ON NETWORKED INTERMITTENT SYSTEMS Speaker: Wen Sheng Lim, Institute of Information Science, Academia Sinica, MY Authors: Hao-Jan Huang1, Wen Sheng Lim2, Chia-Heng Tu1, Chun-Feng Wu3 and Yuan-Hao Chang4 1National Cheng Kung University, TW; 2National Taiwan University (NTU), TW; 3National Yang Ming Chiao Tung University, TW; 4Academia Sinica, TW Abstract A networked intermittent system (NIS) is often deployed in the field for environmental monitoring, where sink nodes are responsible for relaying the data captured by sensors to a central system. To evaluate the quality of the captured monitoring data, Age of Information (AoI) is adopted to quantify the freshness of the data received by the central server. As the sink nodes are powered by ambient energy sources (e.g., solar and wind), the energy-efficient design of the sink nodes is crucial in order to improve the system-wide AoI. This work proposes the energy-efficient sink node design to save energy and extend system uptime. We devise an AoI-aware data forwarding algorithm based on the branch-and-bound (B&B) paradigm for deriving the optimal solution offline. In addition, an AoI-aware data forwarding algorithm is developed to approximate the optimal solution during runtime. The experimental results show that our solution can greatly improve the average data freshness for 148% against existing well-known strategies and achieves 91% performance of the optimal solution. Compared with the state-of-the-art algorithm, our energy-efficient design can deliver better A^3oI results by up to 9.6%. |
11:18 CET | SE2.7 | A SAFETY-GUARANTEED FRAMEWORK FOR NEURAL-NETWORK-BASED PLANNERS IN CONNECTED VEHICLES UNDER COMMUNICATION DISTURBANCE Speaker: Kevin Kai-Chun Chang, National Taiwan University, TW Authors: Kevin Kai-Chun Chang1, Xiangguo Liu2, Chung-Wei Lin1, Chao Huang3 and Qi Zhu2 1National Taiwan University, TW; 2Northwestern University, US; 3University of Liverpool, GB Abstract Neural-network-based (NN-based) planners have been increasingly used to enhance the performance of planning for autonomous vehicles. However, it is often difficult for NN-based planners to balance efficiency and safety in complicated scenarios, especially under real-world communication disturbance. To tackle this challenge, we present a safety-guaranteed framework for NN-based planners in connected vehicle environments with communication disturbance. Given any NN-based planner with no safety-guarantee, the framework generates a robust compound planner embedding the NN-based planner to ensure overall system safety. Moreover, with the aid of an information filter for imperfect communication and an aggressive approach for the estimation of the unsafe set, the compound planner could achieve similar or better efficiency than the given NN-based planner. A comprehensive case study of unprotected left turn and extensive simulations demonstrate the effectiveness of our framework. |
11:21 CET | SE2.8 | CO-DESIGN OF TOPOLOGY, SCHEDULING, AND PATH PLANNING IN AUTOMATED WAREHOUSES Speaker: Michele Lora, Università di Verona, IT Authors: Christopher Leet1, Chanwook Oh1, Michele Lora2, Sven Koenig1 and Pierluigi Nuzzo1 1University of Southern California, US; 2Università di Verona, IT Abstract We address the warehouse servicing problem (WSP) in automated warehouses, which use teams of mobile agents to bring products from shelves to packing stations. Given a list of products, the WSP amounts to finding a plan for a team of agents which brings every product on the list to a station within a given timeframe. The WSP consists of four subproblems, concerning what tasks to perform (task formulation), who will perform them (task allocation), and when (scheduling) and how (path planning) to perform them. These subproblems are NP-hard individually and are made more challenging by their interdependence. The difficulty of the WSP is compounded by the scale of automated warehouses, which frequently use teams of hundreds of agents. In this paper, we present a methodology that can solve the WSP at such scales. We introduce a novel, contract-based design framework which decomposes an automated warehouse into traffic system components. By assigning each of these components a contract describing the traffic flows it can support, we can synthesize a traffic flow satisfying a given WSP instance. Componentwise search-based path planning is then used to transform this traffic flow into a plan for discrete agents in a modular way. Evaluation shows that this methodology can solve WSP instances on real automated warehouses. |
Extended Abstracts
Time | Label | Presentation Title Authors |
---|---|---|
11:24 CET | SE2.9 | POLYGLOT MODAL MODELS THROUGH LINGUA FRANCA Speaker: Alexander Schulz-Rosengarten, Kiel University, DE Authors: Alexander Schulz-Rosengarten1, Reinhard von Hanxleden1, Marten Lohstroh2, Soroush Bateni3 and Edward Lee2 1Dept. of Computer Science, Kiel University, DE; 2University of California, Berkeley, US; 3University of Texas at Dallas, US Abstract Complex software systems often feature distinct modes of operation, each designed to handle a particular scenario that may require the system to respond in a certain way. Breaking down system behavior into mutually exclusive modes and discrete transitions between modes is a commonly used strategy to reduce implementation complexity and promote code readability. The work in this paper aims to bring the advantages of working with modal models to mainstream programming languages, by following the polyglot coordination approach of Lingua Franca (LF), in which verbatim target code (e. g., C, C++, Python, Typescript, or Rust) is encapsulated in composable reactive components called reactors. Reactors can form a dataflow network, are triggered by timed as well as sporadic events, execute concurrently, and can be distributed across nodes on a network. With modal models in LF, we introduce a lean extension to the concept of reactors that enables the coordination of reactive tasks based on modes of operation. |
11:24 CET | SE2.10 | DEL: DYNAMIC SYMBOLIC EXECUTION-BASED LIFTER FOR ENHANCED LOW-LEVEL INTERMEDIATE REPRESENTATION Speaker: Hany Abdelmaksoud, German Aerospace Center (DLR), DE Authors: Hany Abdelmaksoud1, Zain A. H. Hammadeh1, Goerschwin Fey2 and Daniel Luedtke1 1German Aerospace Center (DLR), DE; 2TU Hamburg, DE Abstract This work develops an approach that lifts binaries into an enhanced LLVM Intermediate Representation (IR) including indirect jumps. The proposed lifter combines both static and dynamic methods and strives to fully recover the Control-Flow Graph (CFG) of a program. Using Satisfiability Modulo Theories (SMT) supported by memory and register models, our lifter dynamically symbolically executes IR instructions after translating them into SMT expressions. |
11:24 CET | SE2.11 | WCET ANALYSIS OF SHARED CACHES IN MULTI-CORE ARCHITECTURES USING EVENT-ARRIVAL CURVES Speaker: Thilo L. Fischer, TU Hamburg, DE Authors: Thilo Fischer and Heiko Falk, TU Hamburg, DE Abstract We propose a novel analysis approach for shared LRU caches to classify accesses as definitive cache hits or potential misses. In this approach inter-core cache interference is modelled as an event stream. Thus, by analyzing the timing between subsequent accesses to a particular cache block, it is possible to bound the inter-core interference. This perspective allows us to classify accesses as cache hits or potential misses using a data-flow analysis. We compare the performance of the presented approach to a partitioning of the shared cache. |
11:24 CET | SE2.12 | RESOURCE OPTIMIZATION WITH 5G CONFIGURED GRANT SCHEDULING FOR REAL-TIME APPLICATIONS Speaker: Yungang Pan, Linköping University, SE Authors: Yungang Pan1, Rouhollah Mahfouzi1, Soheil Samii2, Petru Eles2 and Zebo Peng2 1Linköping University, SE; 2Linkoping University, SE Abstract 5G is expected to support ultra-reliable low latency communication to enable real-time applications such as industrial automation and control. 5G configured grant (CG) scheduling features a pre-allocated periodicity-based scheduling approach which reduces control signaling time and guarantees service quality. Although this enables 5G to support hard real-time periodic traffics, efficiently synthesizing the schedule and achieving high resource efficiency while serving multiple traffics, is still an open problem. To address this problem, we first formulate it using satisfiability modulo theories (SMT) so that an SMT-solver can be used to generate optimal solutions. For enhancing scalability, two efficient heuristic approaches are proposed. The experiments demonstrate the effectiveness and scalability of the proposed technique. |
11:24 CET | SE2.13 | MOTIVATING AGENT-BASED LEARNING FOR BOUNDING TIME IN MIXED-CRITICALITY SYSTEMS Speaker: Behnaz Ranjbar, TU Dresden, DE Authors: Behnaz Ranjbar, Ali Hosseinghorban and Akash Kumar, TU Dresden, DE Abstract In Mixed-Criticality (MC) systems, the high Worst-Case Execution Time (WCET) of a task is a pessimistic bound, the maximum execution time of the task under all circumstances, while the low WCET should be close to the actual execution time of most instances of the task to improve utilization and Quality-of-Service (QoS). Most MC systems consider a static low WCET for each task which cannot adapt to dynamism at run-time. In this regard, we consider the run-time behavior of tasks and motivate to propose a learning-based approach that dynamically monitors the tasks' execution times and adapts the low WCETs to determine the ideal trade-off between mode-switches, utilization, and QoS. Based on our observations on running embedded real-time benchmarks on a real platform, the proposed scheme reduces the utilization waste by 47.2%, on average, compared to state-of-the-art works. |
SpD2 Special Day on Human AI-Interaction: AI – potential, limitations and ethical aspects
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 11:00 CET - 12:30 CET
Location / Room: Darwin Hall
Time | Label | Presentation Title Authors |
---|---|---|
11:00 CET | SpD2.1 | THE EMERGENCE OF HUMAN-LIKE AI Presenter: Dave Raggett, ERCIM, GB Author: Dave Raggett, ERCIM, GB Abstract Human-like AI mimics human perception, reasoning, learning and action, combining advances in the cognitive sciences and rapid progress with artificial neural networks. Neurosymbolic approaches seek to combine the respective strengths of neural networks and symbolic AI in support of human-machine collaboration, using argumentation in place of formal logic and proof. This talk will explore the challenges for evolving today's large language models into practical cognitive agents, along with complementary opportunities for cognitive databases that embrace the realisation that most knowledge is uncertain, imprecise, incomplete and inconsistent. Recent advances with large language models have been pretty amazing, but have some significant drawbacks including their huge size, their carbon footprint, a tendency to stray from the facts, a lack of provenance and no support for continual learning, as well as issues around bias. What is needed to enable practical cognitive agents that can be trained and executed on modest hardware? What is the future for symbolic approaches given just how far neural approaches have improved? How will this affect the ways we develop software systems? |
11:30 CET | SpD2.2 | AI ETHICS: FROM ENGINEERING TO REGULATION Speaker and Author: Laurynas Adomaitis, CEA-Saclay, FR Abstract We'll review the field of AI ethics from the engineering point of view focusing on five core sources of ethical tension and using recent insights from the Horizon-Europe TechEthos project. We'll then discuss some of the salient points in the forthcoming European regulation of AI systems ("AI Act”), including a recently added article on generative AI and ChatGPT. |
12:00 CET | SpD2.3 | WHAT HUMAN-AI INTERACTION WILL BRING IN THE FIELDS COVERED BY DATE? Speaker: animated by Prof. Marina Zapater and Dr. Marc Duranton, HES-SO - CEA, BE Author: All participants, DATE, BE Abstract Panel session/round table: "What Human-AI interaction will bring in the fields covered by DATE?" |
ST2 Test methods and dependability
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 11:00 CET - 12:30 CET
Location / Room: Marble Hall
Session chair:
Görschwin Fey, TU Hamburg, DE
11:00 CET until 11:21 CET: Pitches of regular papers
11:21 CET until 12:30 CET: Interactive technical presentations by the authors of regular papers and extended abstracts
Regular Papers
Time | Label | Presentation Title Authors |
---|---|---|
11:00 CET | ST2.1 | IMPROVING RELIABILITY OF SPIKING NEURAL NETWORKS THROUGH FAULT AWARE THRESHOLD VOLTAGE OPTIMIZATION Speaker: Ayesha Siddique, University of Missouri-Columbia, US Authors: Ayesha Siddique and Khaza Anuarul Hoque, University of Missouri, US Abstract Spiking neural networks have made breakthroughs in computer vision by lending themselves to neuromorphic hardware. However, the neuromorphic hardware lacks parallelism and hence, limits the throughput and hardware acceleration of SNNs on edge devices. To address this problem, many systolic-array SNN accelerators (systolicSNNs) have been proposed recently, but their reliability is still a major concern. In this paper, we first extensively analyze the impact of permanent faults on the SystolicSNNs. Then, we present a novel fault mitigation method, i.e., fault-aware threshold voltage optimization in retraining (FalVolt). FalVolt optimizes the threshold voltage for each layer in retraining to achieve the classification accuracy close to the baseline in the presence of faults. To demonstrate the effectiveness of our proposed mitigation, we classify both static (i.e., MNIST) and neuromorphic datasets (i.e., N-MNIST and DVS Gesture) on a 256x256 systolicSNN with stuck-at faults. We empirically show that the classification accuracy of a systolicSNN drops significantly even at extremely low fault rates (as low as 0.012%). Our proposed FalVolt mitigation method improves the performance of systolicSNNs by enabling them to operate at fault rates of up to 60%, with a negligible drop in classification accuracy (as low as 0.1%). Our results show that FalVolt is 2x faster compared to other state-of-the-art techniques common in artificial neural networks (ANNs), such as fault-aware pruning and retraining without threshold voltage optimization. |
11:03 CET | ST2.2 | AUTOMATED AND AGILE DESIGN OF LAYOUT HOTSPOT DETECTOR VIA NEURAL ARCHITECTURE SEARCH Speaker: Zihao Chen, Fudan University, CN Authors: Zihao Chen1, Fan Yang1, Li Shang2 and Xuan Zeng1 1Fudan University, CN; 2fudan university, CN Abstract This paper presents a neural architecture search scheme for chip layout hotspot detection. In this work, hotspot detectors, in the form of neural networks, are modeled as weighted directed acyclic graphs. A variational autoencoder maps the discrete graph topological space into a continuous embedding space. Bayesian Optimization performs neural architecture search in this embedding space, where an architecture performance predictor is employed to accelerate the search process. Experimental studies on ICCAD 2012 and ICCAD 2019 Contest benchmarks demonstrate that, the proposed scheme significantly improves the agility of previous neural architecture search schemes, and generates hotspot detectors with competitive detection accuracy, false alarm rate, and inference time. |
11:06 CET | ST2.3 | UPHEAVING SELF-HEATING EFFECTS FROM TRANSISTOR TO CIRCUIT LEVEL USING CONVENTIONAL EDA TOOL FLOWS Speaker: Florian Klemme, University of Stuttgart, DE Authors: Florian Klemme1, Sami Salamin2 and Hussam Amrouch3 1University of Stuttgart, DE; 2Hyperstone, DE; 3TU Munich, DE Abstract In this work, we are the first to demonstrate how well-established EDA tool flows can be employed to upheave Self-Heating Effects (SHE) from individual devices at the transistor level all the way up to complete large circuits at the final layout (i.e., GDS-II) level. Transistor SHE imposes an ever-growing reliability challenge due to the continuous shrinking of geometries alongside the non-ideal voltage scaling in advanced technology nodes. The challenge is largely exacerbated when more confined 3D structures are adopted to build transistors such as upcoming Nanosheet FETs and Ribbon FETs. By employing increasingly-confined structures and materials of poorer thermal conductance, heat arising within the transistor's channel is trapped inside and cannot escape. This leads to accelerated defect generation and, if not considered carefully, a profound risk to IC reliability. Due to the lack of EDA tool flows that can consider SHE, circuit designers are forced to take pessimistic worst-case assumptions (obtained at the transistor level) to ensure reliability of the complete chip for the entire projected lifetime - at the cost of sub-optimal circuit designs and considerable efficiency losses. Our work paves the way for designers to estimate less pessimistic (i.e., small yet sufficient) safety margins for their circuits leading to higher efficiency without compromising reliability. Further, it provides new perspectives and opens new doors to estimate and optimize reliability correctly in the presence of emerging SHE challenge through identifying early the weak spots and failure sources across the design. |
11:09 CET | ST2.4 | BUILT-IN SELF-TEST AND BUILT-IN SELF-REPAIR STRATEGIES WITHOUT GOLDEN SIGNATURE FOR COMPUTING IN MEMORY Speaker: Yu-Chih Tsai, Author, TW Authors: Yu-Chih Tsai, Wen-chien Ting, Chia-Chun Wang, Chia-Cheng Chang and Ren-Shuo Liu, National Tsing Hua University, TW Abstract This paper proposes built-in self-test (BIST) and built-in self-repair (BISR) strategies for computing in memory (CIM), including a novel testing method and two repair schemes which are CIM output range adjusting and CIM bitline reordering. They all focus on mitigating the impacts of inherent and inevitable CIM inaccuracy on convolution neural networks (CNNs). Regarding the proposed BIST strategy, it exploits the distributive law to achieve at-speed CIM tests without storing testing vectors or golden results. Besides, it can assess the severity of the inherent inaccuracies among CIM bitlines instead of only offering a pass/fail outcome. In addition to BIST, we propose two BISR strategies. First, we propose to slightly offset the dynamic range of CIM outputs toward the negative side to create a margin for negative noises. By not cutting CIM outputs off at zero, negative noises are preserved to cancel out positive noises statistically, and accuracy impacts are mitigated. Second, we propose to remap the bitlines of CIM according to our BIST outcomes. Briefly speaking, we propose to map the least noisy bitlines to be the MSBs. This remapping can be done in the digital domain without touching the CIM internals. Experiments show that our proposed BIST and BISR strategies can restore CIM to less than 1% Top-1 accuracy loss with slight hardware overhead. |
11:12 CET | ST2.5 | SECURITY-AWARE APPROXIMATE SPIKING NEURAL NETWORK Speaker: Ayesha Siddique, University of Missouri-Columbia, US Authors: Syed Tihaam Ahmad, Ayesha Siddique and Khaza Anuarul Hoque, University of Missouri, US Abstract Deep Neural Networks (DNNs) and Spiking Neural Networks (SNNs) are both known for their susceptibility to adversarial attacks. Therefore, researchers in the recent past have extensively studied the robustness and defense of DNNs and SNNs under adversarial attacks. Compared to accurate SNNs (AccSNN), approximate SNNs (AxSNNs) are known to be up to 4X more energy-efficient for ultra-low power applications. Unfortunately, the robustness of AxSNNs under adversarial attacks is yet unexplored. In this paper, we first extensively analyze the robustness of AxSNNs under different structural parameters and approximation levels against two gradient-based and two neuromorphic attacks. Our study revealed that AxSNNs are more prone to adversarial attacks than AccSNNs. Then we propose a novel design approach for designing robust AxSNNs using two novel defense methods: precision scaling and approximation- and quantization-aware filtering (AQF). The effectiveness of these two defense methods was evaluated using one static and one neuromorphic dataset. Our results demonstrate that precision scaling and AQF can significantly improve the robustness of AxSNNs. For instance, a PGD attack on AxSNN results in 72\% accuracy loss, whereas the same attack on the precision-scaled AxSNN leads to only 17\% accuracy loss in the static MNIST dataset (4X robustness improvement). Similarly, for the neuromorphic DVS128 Gesture dataset, we observe that Sparse Attack on AxSNN leads to 77\% accuracy loss compared to AccSNN without any attack. However, with AQF, the accuracy loss is only 2\% (38X robustness improvement). |
11:15 CET | ST2.6 | BAFFI: A BIT-ACCURATE FAULT INJECTOR FOR IMPROVED DEPENDABILITY ASSESSMENT OF FPGA PROTOTYPES Speaker: ILYA TUZOV, Universitat Politecnica de Valencia, ES Authors: Ilya Tuzov1, David de Andres2, Juan-Carlos Ruiz2 and Carles Hernandez2 1Universidad Politécnica de Valencia, ES; 2Universidad Politecnica de Valencia, ES Abstract FPGA-based fault injection (FFI) is an indispensable technique for verification and dependability assessment of FPGA designs and prototypes. Existing FFI tools make use of Xilinx essential bits technology to locate the relevant fault targets in FPGA configuration memory (CM). Most FFI tools treat essential bits as black-box, while few of them are able to filter essential bits on the area basis in order to selectively target design components contained within the predefined Pblocks. This approach, however, remains insufficiently precise since the granularity of Pblocks in practice does not reach the smallest design components. This paper proposes an open-source FFI tool that enables much more fine-grained FFI experiments for Xilinx 7-series and Ultrascale+ FPGAs. By mapping the essential bits with the hierarchical netlist, it allows to precisely target any component in the design tree, up to an individual LUT or register, without the need for defining Pblocks (floorplanning). With minimal experimental effort it estimates the contribution of each DUT component into the resulting dependability features, and discovers weak points of the DUT. Through case studies we show how the proposed tool can be applied to different kinds of DUTs: from small-footprint microcontrollers, up to multicore RISC-V SoC. The correctness of FFI results is validated by means of RT-level and gate-level simulation-based fault injection. |
11:18 CET | ST2.7 | A NOVEL FAULT-TOLERANT ARCHITECTURE FOR TILED MATRIX MULTIPLICATION Speaker: Sandip Kundu, University of Massachusetts Amherst, US Authors: Sandeep Bal1, Chandra Sekhar Mummidi1, Victor da Cruz Ferreira2, Sudarshan Srinivasan3 and Sandip Kundu4 1University of Massachusetts, Amherst, US; 2Federal University of Rio de Janeiro, BR; 3Intel Labs, IN; 4University of Massachusetts Amherst, US Abstract General matrix multiplication (GEMM) is common to many scientific and machine-learning applications. Convolution, the dominant computation in Convolutional Neural Networks (CNNs), can be formulated as a GEMM problem. Due to its widespread use, a new generation of processors features GEMM acceleration in hardware. Intel recently announced an Advanced Matrix Multiplication (AMX®) instruction set for GEMM, which is supported by 1kB AMX registers and a Tile Multiplication unit (TMUL) for multiplying tiles (sub-matrices) in hardware. Silent Data Corruption (SDC) is a well-known problem that occurs when hardware generates corrupt output. Google and Meta recently reported findings of SDC in GEMM in their data centers. Algorithm-Based Fault Tolerance (ABFT) is an efficient mechanism for detecting and correcting errors in GEMM, but classic ABFT solutions are not optimized for hardware acceleration. In this paper, we present a novel ABFT implementation directly on hardware. Though the exact implementation of Intel TMUL is not known, we propose two different TMUL architectures representing two design points in the area-power-performance spectrum and illustrate how ABFT can be directly incorporated into the TMUL hardware. This approach has two advantages: (i) an error can be concurrently detected at the tile level, which is an improvement over finding such errors only after performing the full matrix multiplication; and (ii) we further demonstrate that performing ABFT at the hardware level has no performance impact and only a small area, latency, and power overhead. |
Extended Abstracts
Time | Label | Presentation Title Authors |
---|---|---|
11:21 CET | ST2.8 | REDUCE: A FRAMEWORK FOR REDUCING THE OVERHEADS OF FAULT-AWARE RETRAINING Speaker: Muhammad Abdullah Hanif, New York University Abu Dhabi, AE Authors: Muhammad Abdullah Hanif and Muhammad Shafique, New York University Abu Dhabi, AE Abstract Fault-aware retraining has emerged as a prominent technique for mitigating permanent faults in Deep Neural Network (DNN) hardware accelerators. However, retraining leads to huge overheads, specifically when used for fine-tuning large DNNs designed for solving complex problems. Moreover, as each fabricated chip can have a distinct fault pattern, fault-aware retraining is required to be performed for each chip individually considering its unique fault map, which further aggravates the problem. To reduce the overall retraining cost, in this work, we introduce the concept of resilience-driven retraining amount selection. To realize this concept, we propose a novel framework, Reduce, that, at first, computes the resilience of the given DNN to faults at different fault rates and with different amounts of retraining. Then, based on the resilience, it computes the amount of retraining required for each chip considering its unique fault map. We demonstrate the effectiveness of our methodology for a systolic array-based DNN accelerator experiencing permanent faults in the computational array. |
11:21 CET | ST2.9 | BITSTREAM-LEVEL INTERCONNECT FAULT CHARACTERIZATION FOR SRAM-BASED FPGAS Speaker: Christian Fibich, University of Applied Sciences Technikum Wien, AT Authors: Christian Fibich1, Martin Horauer1 and Roman Obermaisser2 1University of Applied Sciences Technikum Wien, AT; 2University of Siegen, DE Abstract A significant portion of the configuration memory of modern SRAM-based FPGAs is dedicated to configuring the interconnect. Understanding the effects of interconnect-related Single-Event Upsets (SEUs) on the circuit's behavior is critical for developing accurate reliability prediction and efficient fault mitigation approaches. This work describes an approach to classify the effects of single-bit interconnect faults into well- known fault models, and to characterize the electrical effects of these modeled faults. An experimental fault characterization for two families of Xilinx and Lattice FPGAs shows that different types of single-bit interconnect faults exhibit significantly different criticality. This may serve as a partial explanation for the large discrepancies reported in literature between faults predicted to be critical by state-of-the-art methods ("essential bits”) compared to the numbers of actually critical bits determined experimentally and may be used to improve prediction accuracy or reliability-aware routing approaches. |
11:21 CET | ST2.10 | COMPACT TEST PATTERN GENERATION FOR MULTIPLE FAULTS IN DEEP NEURAL NETWORKS Speaker: Dina Moussa, Karlsruhe Institute of Technology (KIT) - CDNC, EG Authors: Dina Moussa1, Michael Hefenbrock2 and Mehdi Tahoori1 1Karlsruhe Institute of Technology, DE; 2RevoAI, DE Abstract Deep neural networks (DNNs) have achieved record-breaking performance in various applications. However, this often comes at significant computational costs. To reduce the energy footprint and increase performance, DNNs are often implemented on specific hardware accelerators, such as Tensor Processing Units (TPU) or emerging Memristive technologies. Unfortunately, the presence of various hardware faults can threaten these accelerators' performance and degrade the inference accuracy. This necessitates the development of efficient testing methodologies to unveil hardware faults in DNN accelerators. In this work, we propose a test pattern generation approach to detect fault patterns in DNNs for a common type of hardware fault, namely, faulty (weight) value representation on the bit level. Opposed to most related works which reveal faults via output deviations, our test patterns are constructed to reveal faults via misclassification which is more realistic for black-box testing. The experimental results show that the generated test patterns provide 100% fault coverage for targeted fault patterns. Besides, a high compaction ratio was achieved over different datasets and model architectures (up to 50x), and high fault coverage (up to 99.9%) for unseen fault patterns during the test generation phase. |
11:21 CET | ST2.11 | READ: RELIABILITY-ENHANCED ACCELERATOR DATAFLOW OPTIMIZATION USING CRITICAL INPUT PATTERN REDUCTION Speaker: Zuodong Zhang, Peking University, CN Authors: Zuodong Zhang1, Meng Li2, Yibo Lin3, Runsheng Wang3 and Ru Huang3 1School of Integrated Circuits, Peking University, CN; 2Institute for Artificial Intelligence and School of Integrated Circuits, Peking University, CN; 3Peking University, CN Abstract With the rapid advancements of deep learning in recent years, hardware accelerators are continuously deployed in more and more safety-critical applications such as autonomous driving and robotics. While the accelerators are usually fabricated with advanced technology nodes for higher performance and energy efficiency, which are more prone to timing errors under process, voltage, temperature, and aging (PVTA) variations. By revisiting the physical sources of timing errors, we show that most of the timing errors in the accelerator are caused by several specific input patterns, defined as critical input patterns. To improve the robustness of the accelerator, in this paper, we propose READ, a reliability-enhanced accelerator dataflow optimization method that can effectively reduce timing errors. READ can reduce the critical input patterns by exploring the optimal computing sequence when mapping a trained deep neural network to the accelerator. READ only changes the order of MAC operations in a convolution, and it does not introduce any additional hardware overhead to the computing array. The experimental results on VGG-16 and ResNet-18 demonstrate on average 6.3× timing error reduction and up to 24.25× timing error reduction for certain layers. The results also show that READ enables the accelerator to maintain accuracy over a wide range of PVTA variations, making it a promising approach for robust deep learning design. |
11:21 CET | ST2.12 | ROBUST RESISTIVE OPEN DEFECT IDENTIFICATION USING MACHINE LEARNING WITH EFFICIENT FEATURE SELECTION Speaker: Zahra Paria Najafi-Haghi, University of Stuttgart, DE Authors: Zahra Paria Najafi-Haghi1, Florian Klemme1, Hanieh Jafarzadeh1, Hussam Amrouch2 and Hans-Joachim Wunderlich1 1University of Stuttgart, DE; 2TU Munich, DE Abstract Resistive open defects in FinFET circuits are reliability threats and should be ruled out before deployment. The performance variations due to these defects are similar to the effect of process variations which are mostly benign. In order not to sacrifice yield for reliability the effect of defects should be distinguished from process variations. It has been shown that machine learning (ML) schemes are able to classify defective circuits with high accuracy based on the maximum frequencies Fmax obtained under multiple supply voltages Vdd ∈ Vop. The paper at hand presents a method to minimize the number of required measurements. Each supply voltage Vdd defines a feature Fmax(Vdd). A feature selection technique is presented, which uses also the already available Fmax measurements. It is shown that ML-based techniques can work efficiently and accurately with this reduced number of Fmax(Vdd) measurements. |
LK2 Special Day Lunchtime Keynote
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 13:00 CET - 14:00 CET
Location / Room: Darwin Hall
Session chair:
Marina Zapater, University of Applied Sciences Western Switzerland, CH
Session co-chair:
Marc Duranton, CEA, FR
Time | Label | Presentation Title Authors |
---|---|---|
13:00 CET | LK2.1 | INTERACTING WITH SOCIALLY INTERACTIVE AGENT Presenter: Catherine Pelachaud, CNRS-ISIR, Sorbonne Université, FR Author: Catherine Pelachaud, CNRS-ISIR, Sorbonne Université, FR Abstract Our research work focuses on modeling Socially Interactive Agents, i.e. agents capable of interacting socially with human partners, of communicating verbally and non-verbally, of showing emotions. but also of adapting their behaviors to favor the engagement of their partners during the interaction. As partner of an interaction, SIA should be able to adapt its multimodal behaviors and conversational strategies to optimize the engagement of its human interlocutors. We have developed models to equip these agents with these communicative and social abilities. In this talk, I will present the works we have been conducted. |
BPA11 Supply chain attacks
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 14:00 CET - 16:00 CET
Location / Room: Okapi Room 0.8.3
Session chair:
Francesco Regazzoni, University of Amsterdam and ALaRI - USI, CH
Time | Label | Presentation Title Authors |
---|---|---|
14:00 CET | BPA11.1 | HARDWARE TROJANS IN ENVM NEUROMORPHIC DEVICES Speaker: Mircea R. Stan, University of Virginia, US Authors: Lingxi Wu, Rahul Sreekumar, Rasool Sharifi, Kevin Skadron, Stan Mircea and Ashish Venkat, University of Virginia, US Abstract Fast and energy-efficient execution of a DNN on traditional CPU- and GPU-based architectures is challenging due to excessive data movement and inefficient computation. Emerging non-volatile memory (eNVM)-based accelerators that mimic biological neuron computations in the analog domain have shown significant performance improvements. However, the potential security threats in the supply chain of such systems have been largely understudied. This work describes a hardware supply chain attack against analog eNVM neural accelerators by identifying potential Trojan insertion points and proposes a hardware Trojan design that stealthily leaks model parameters while evading detection. Our evaluation shows that such a hardware Trojan can recover over 90% of the synaptic weights. |
14:25 CET | BPA11.2 | EVOLUTE: EVALUATION OF LOOK-UP-TABLE-BASED FINE-GRAINED IP REDACTION Speaker: Farimah Farahmandi, University of Florida, US Authors: Rui Guo1, Mohammad Rahman1, Hadi Mardani Kamali1, Fahim Rahman1, Farimah Farahmandi1 and Mark Tehranipoor2 1University of Florida, US; 2Intel Charles E. Young Preeminence Endowed Chair Professor in Cybersecurity, Associate Chair for Research and Strategic Initiatives, ECE Department, University of Florida, US Abstract Recent studies on intellectual property (IP) protection techniques demonstrate that engaging embedded reconfigurable components (e.g., eFPGA redaction) would be a promising approach to concealing the functional and structural information of the security-critical design. However, detailed investigation reveals that such techniques suffer from almost prohibited overhead in terms of area, power, delay, and testability. In this paper, we introduce "EvoLUTe", a distinct and significantly more fine-grained redaction methodology using smaller reconfigurable components (such as look-up-tables (LUTs)). In "EvoLUTe", we examine both eFPGA-based and LUT-based design spaces, demonstrating that a novel cone-based and fine-grained universal function modeling approach using LUTs is capable of providing the same degree of resiliency at a much lower area/power/delay and testability costs. |
14:50 CET | BPA11.3 | RTLOCK: IP PROTECTION USING SCAN-AWARE LOGIC LOCKING AT RTL Speaker: Farimah Farahmandi, University of Florida, US Authors: Md Rafid Muttaki1, Shuvagata Saha1, Hadi Mardani Kamali1, Fahim Rahman1, Mark Tehranipoor2 and Farimah Farahmandi1 1University of Florida, US; 2Intel Charles E. Young Preeminence Endowed Chair Professor in Cybersecurity, Associate Chair for Research and Strategic Initiatives, ECE Department, University of Florida, US Abstract Conventional logic locking techniques mainly focus on gate-level netlists to combat IP piracy and IC overproduction. However, this is generally not sufficient for protecting semantics and behaviors of the design. Further, these techniques are even more objectionable when the IC supply chain is at risk of insider threats. This paper proposes RTLock, a robust logic locking framework at the RTL abstraction. RTLock provides a detailed formal analysis of the design specs at the RTL that determines the locking candidate points w.r.t. attacks resiliency (SAT/BMC), locking key size, and overhead. RTLock incorporates (partial) DFT infrastructure (scan chain) at the RTL, enabled with a scan locking mechanism. It allows us to push all the necessary security-driven actions to the highest abstraction level, thus making the flow EDA tool agnostic. Additionally, RTLock demonstrates why RTL-based locking must be coupled with encryption and management protocols (e.g., IEEE P1735), to be effective against insider threats. Our experimental results show that, vs. other techniques, RTLock protects the design against broader threats at low overhead and without compromising testability. |
15:15 CET | BPA11.4 | INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Speaker: Authors of the session, DATE, BE Author: Session Chairs, DATE, BE Abstract Participants can freely interact with authors during their interactive technical presentations. |
BPA7 Improving Heterogenous hardware utilization
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 14:00 CET - 16:00 CET
Location / Room: Okapi Room 0.8.2
Session chair:
Bastien Deveautour, CPE-INL, FR
Time | Label | Presentation Title Authors |
---|---|---|
14:00 CET | BPA7.1 | DITTY: DIRECTORY-BASED CACHE COHERENCE FOR MULTICORE SAFETY-CRITICAL SYSTEMS Speaker: Zhuanhao Wu, University of Waterloo, CA Authors: Zhuanhao Wu, Marat Bekmyrza, Nachiket Kapre and Hiren Patel, University of Waterloo, CA Abstract Ditty is a predictable directory-based cache coherence mechanism for multicore safety-critical systems that guarantees a worst-case latency (WCL) on data accesses. Prior approaches for predictable cache coherence use a shared snooping bus to interconnect cores. This restricts the number of cores in the multicore to typically four or eight due to scalability concerns. Ditty takes a first step towards a scalable cache coherence mechanism that is predictable and one that can support a larger number of cores. In designing Ditty, we propose a coherence protocol and micro-architecture additions to deliver a WCL bound that is lower than a naive approach. Our WCL analysis reveals that the resulting bounds are comparable to state-of-the-art busbased predictable coherence approaches. We prototype Ditty in hardware and empirically evaluate it on an FPGA. Our evaluation shows the observed WCL is within computed WCL bound for both the synthetic and SPLASH-3 benchmarks. We release our implementation to the public domain. |
14:25 CET | BPA7.2 | LIGHT FLASH WRITE FOR EFFICIENT FIRMWARE UPDATE ON ENERGY-HARVESTING IOT DEVICES Speaker: Qingqiang He, The Hong Kong Polytechnic University, HK Authors: Songran Liu1, Mingsong Lv2, Wei Zhang3, Xu Jiang1, Chuancai Gu4, Tao Yang4, Wang Yi5 and Nan Guan6 1Northeastern University, CN; 2The Hong Kong Polytechnic University, HK; 3School of Cyber Science and Technology, Shandong University, CN; 4Huawei Technologies Company, CN; 5Uppsala University, SE; 6City University of Hong Kong, HK Abstract Firmware update is an essential service on Internet-of-Things (IoT) devices to fix vulnerabilities and add new functionalities. Firmware update is energy-consuming since it involves intensive flash erase/write operations. Nowadays, IoT devices are increasingly powered by energy harvesting. As the energy output of the harvesters on IoT devices is typically tiny and unstable, a firmware update will likely experience power failures during its progress and fail to complete. This paper presents an approach to increase the success rate of firmware update on energy-harvesting IoT devices. The main idea is to first conduct a lightweight flash write with reduced erase/write time (and thus less energy consumed) to quickly save the new firmware image to flash memory before a power failure occurs. To ensure a long data retention time, a reinforcement step follows to re-write the new firmware image on the flash with default erase/write configuration when the system is not busy and has free energy. Experiments conducted with different energy scenarios show that our approach can significantly increase the success rate and the efficiency of firmware update on energy-harvesting IoT devices. |
14:50 CET | BPA7.3 | HADAS: HARDWARE-AWARE DYNAMIC NEURAL ARCHITECTURE SEARCH FOR EDGE PERFORMANCE SCALING Speaker: Halima Bouzidi, LAMIH/UMR CNRS, Université Polytechnique Hauts-de-France, FR Authors: Halima Bouzidi1, Mohanad Odema2, Hamza Ouarnoughi3, Mohammad Al Faruque2 and Smail Niar4 1University Polytechnique Hauts-de-France, LAMIH, CNRS, UMR 8201, F-59313 Valenciennes, France, FR; 2University of California, Irvine, US; 3INSA Hauts-de-France, FR; 4INSA Hauts-de-France and CNRS, FR Abstract Dynamic neural networks (DyNNs) have become viable techniques to enable intelligence on resource-constrained edge devices while maintaining computational efficiency. In many cases, the implementation of DyNNs can be sub-optimal due to its underlying backbone architecture being developed at the design stage independent of both: (i) potential support for dynamic computing, e.g. early exiting, and (ii) resource efficiency features of the underlying hardware, e.g., dynamic voltage and frequency scaling (DVFS). Addressing this, we present HADAS, a novel Hardware- Aware Dynamic Neural Architecture Search framework that realizes DyNN architectures whose backbone, early exiting features, and DVFS settings have been jointly optimized to maximize performance and resource efficiency. Our experiments using the CIFAR-100 dataset and a diverse set of edge computing platforms have shown that HADAS can elevate dynamic models' energy efficiency by up to 57% for the same level of accuracy scores. |
15:15 CET | BPA7.4 | INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Speaker: Authors of the session, DATE, BE Author: Session Chairs, DATE, BE Abstract Participants can freely interact with authors during their interactive technical presentations. |
FS-Ex Focus Session: Smart Additive Manufacturing: Fabrication and Design (Automation)
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 14:00 CET - 15:30 CET
Location / Room: Okapi Room 0.8.1
Flexible electronics is an emerging and fast-growing field which can be used in many demanding and emerging application domains such as wearables, smart sensors, and Internet of Things (IoT). There are several technologies, processes and paradigms which can be used to design and fabricate flexible circuits. Unlike traditional computing and electronics domain which is mostly driven by performance characteristics, flexible electronics are mainly associated with low fabrication costs (as they are used even in consumer market) and low energy consumption (as they could be used in energy-harvested systems). While the main advances in this field is mainly focused on fabrication and process aspects, the design and in particular design automation flow, and the required computing paradigms had limited exposure. The purpose of this special session is to bring to the attention of design automation community on some of the key advances in the field of flexible electronics and additive manufacturing as well as some of the design (automation) aspects, which can hopefully inspire some further attention by design automation community to this fast-growing field.
Time | Label | Presentation Title Authors |
---|---|---|
14:00 CET | FS-Ex.1 | LOWERING THE BARRIER FOR ENTRY FOR FLEXIBLE FOUNDRY TECHNOLOGY Presenter: David Verity, PragmatIC, GB Author: David Verity, PragmatIC, GB Abstract It is important for providers of novel technology to offer compatibility with current EDA tools and flows. We will discuss some of the ways we are lowering the barrier for entry for using our novel technology, allowing partners and customers to take advantage of dedicated tapeouts and rapid prototyping services. We will introduce some of the features of our core technology along with some of the additional services enabling easy adoption of flexible technology. We will also discuss out future plans to enhance our IP offering and more fully integrate with commercial and open-source tool chains. |
14:30 CET | FS-Ex.2 | THE FOUNDRY MODEL FOR THIN-FILM TRANSISTOR TECHNOLOGIES Presenter: Kris Myny, KU Leuven, BE Author: Kris Myny, KU Leuven, BE Abstract The foundry model in semiconductor industry has been proved successfully for Si CMOS technologies, whereby fabless design houses focus on design activities, while the manufacturing of the chips is outsourced to foundries. Thin-film transistor (TFT) technologies today do not structurally offer such a foundry model, allowing external design houses or universities to exploit the technology. The main reason is that the current product portfolio of TFTs focuses on displays, whereby the foundry can also support the design needs. However, as the internet-of-things era is growing, with upcoming needs for ubiquitous sensors and actuators, the complexity of the designs increases from single pixel electronics to full systems comprising analog and digital circuits based on TFTs or even hybrid with Si CMOS. As such, the foundry model for TFT technologies would benefit, whereby fabless design houses and universities could focus on the development of novel circuit designs for applications such as microfluidics, internet-of-things, etc. In this presentation, I will dive into the high potential of the fabless design model for TFTs. I will present our first results based on the fabless model for indium-gallium-zinc-oxide (IGZO) and low-temperature poly-crystalline silicon (LTPS) based technologies, discussing their benefits for various applications. |
15:00 CET | FS-Ex.3 | HIGHLY-BESPOKE ROBUST PRINTED NEUROMORPHIC CIRCUITS Speaker: Mehdi Tahoori, Karlsruhe Institute of Technology, DE Authors: Haibin Zhao1, Brojogopal Sapui1, Michael Hefenbrock2, Zhidong Yang1, Michael Beigl1 and Mehdi Tahoori1 1Karlsruhe Institute of Technology, DE; 2RevoAI GmbH, DE Abstract With the rapid growth of the Internet of Things, smart fast-moving consumer products, and wearable devices, requirements such as flexibility, non-toxicity, and low cost are desperately required. However, these requirements are usually beyond the reach of conventional rigid silicon technologies. In this regard, printed electronics offers a promising alternative. Combined with neuromorphic computing, printed neuromorphic circuits offer not only the aforementioned properties, but also compensate for some of the weaknesses of printed electronics, such as manufacturing variations, low device count, and high latency. Generally, (printed) neuromorphic circuits express their functionality through printed resistor crossbars to emulate matrix multiplication, and nonlinear circuitry to express activation functions. The values of the former are usually learned, while the latter is designed beforehand and considered fixed in training for all tasks. The additive manufacturing feature of printed electronics allows the design of highly-bespoke designs. In the case of printed neuromorphic circuits, the circuit is optimized to a particular dataset. Moreover, we explore an approach to learn not only the values of the crossbar resistances, but also the parameterization of the nonlinear components for a bespoke implementation. While providing additional flexibility of the functionality to be expressed, this will also allow an increased robustness against printing variation. The experiments show that the accuracy and robustness of printed neuromorphic circuits can be improved by 26% and 75% respectively under 10% variation of circuit components. |
LKS3 Later … with the keynote speakers
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 14:00 CET - 15:30 CET
Location / Room: Darwin Hall
Session chair:
Marina Zapater, University of Applied Sciences Western Switzerland, CH
Session co-chair:
Marc Duranton, CEA, FR
SA8 Industrial Experiences Brief Papers
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 14:00 CET - 16:00 CET
Location / Room: Gorilla Room 1.5.3
Session chair:
Paolo Bernardi, Politecnico di Torino, iT
Time | Label | Presentation Title Authors |
---|---|---|
14:00 CET | SA8.1 | MULTIPHYSICS DESIGN AND SIMULATION METHODOLOGY FOR DENSE WDM SILICON PHOTONICS Speaker: Luca Ramini, Hewlett Packard Enterprise, IT Authors: Jinsung Youn1, Luca Ramini1, Zeqin Lu2, Ahsan Alam2, James Pond2, Marco Fiorentino1 and Raymond Beausoleil1 1Hewlett Packard Enterprise, US; 2Ansys, CA Abstract We present a novel design methodology covering multiphysics simulation workflows for microring-based dense wavelength division multiplexing (DWDM) Silicon Photonics (SiPh) circuits used for high-performance computing systems and data centers. The main workflow is an electronics-photonics co-simulation comprising various optical devices from a SiPh process design kit (PDK), electronic circuits designed with a commercial CMOS foundry's PDK, and channel S-parameter models, such as interposers and packages, generated by using a full-wave electromagnetic (EM) solver. With the co-simulation, electrical and optical as well as electro-optical behaviors can be analyzed at the same time because best-in-class electronics and photonic integrated circuit simulators interact with each other. As a result, not only optical spectrum and eye diagrams but also electrical eye diagrams can be evaluated on the same simulation platform. In addition, the proposed methodology includes a statistical- and thermal-aware photonic circuit simulation workflow to evaluate process and temperature variations as well as estimate the required thermal tuning power as those non-idealities can lead to microring's resonance wavelengths shifting. For this, thermal simulation is conducted with a 3D EM model which is also used for such signal and power integrity analysis as a channel link simulation and IR drop. Also, photonic circuit simulations are performed where a design exploration and optimization of such microring's design parameters as Q-factor, and bias voltages are required to select the most promising designs, for example, to satisfy a specific bit-error rate. With the proposed design methodology having those multiphysics simulation workflows, DWDM SiPh can be fully optimized to have reliable system performance. |
14:25 CET | SA8.2 | TWO-STREAM NEURAL NETWORK FOR POST-LAYOUT WAVEFORM PREDICTION Speaker: Sanghwi Kim, SK Hynix, KR Authors: Sanghwi Kim, Hyejin Shin and Hyunkyu Kim, SK Hynix, KR Abstract The gap between pre- and post-simulation, as well as the considerable layout time, increases the significance of the post-layout waveform prediction in dynamic random access memory (DRAM) design. This study develops a post-layout prediction model using the following two-stream neural network: (1) a multi-layer perceptron neural network to calculate the coupling noise by using the physical properties of global interconnects, and (2) a convolutional neural network to compute the time series trends of the waveforms by referencing adjacent signals. The proposed model trains two types of heterogeneous data such that accuracy of 95.5% is achieved on the 1b DRAM process 16Gb DDR5 composed of hundreds of millions of transistors. The model significantly improves the design completeness by pre-detecting the deterioration in the signal quality via post-layout waveform prediction. Generally, although a few weeks are required to obtain post-layout waveforms after the circuit design process, waveforms can be instantly predicted using our proposed model. |
14:50 CET | SA8.3 | QUANTIZATION-AWARE NEURAL ARCHITECTURE SEARCH WITH HYPERPARAMETER OPTIMIZATION FOR INDUSTRIAL PREDICTIVE MAINTENANCE APPLICATIONS Speaker: Nick van de Waterlaat, NXP Semiconductors, NL Authors: Nick van de Waterlaat, Sebastian Vogel, Hiram Rayo Torres Rodriguez, Willem Sanberg and Gerardo Daalderop, NXP Semiconductors, NL Abstract Optimizing the efficiency of neural networks is crucial for ubiquitous machine learning on the edge. However, it requires specialized expertise to account for the wide variety of applications, edge devices, and deployment scenarios. An attractive approach to mitigate this bottleneck is Neural Architecture Search (NAS), as it allows for optimizing networks for both efficiency and task performance. This work shows that including hyperparameter optimization for training-related parameters alongside NAS enables substantial improvements in efficiency and task performance on a predictive maintenance task. Furthermore, this work extends the combination of NAS and hyperparameter optimization with INT8 quantization since efficiency is of utmost importance for resource-constrained devices in industrial applications. Our combined approach, which we refer to as Quantization-Aware NAS (QA-NAS), allows for further improvements in efficiency on the predictive maintenance task. Consequently, our work shows that QA-NAS is a promising research direction for optimizing neural networks for deployment on resource-constrained edge devices in industrial applications. |
15:16 CET | SA8.4 | INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Speaker: Authors of the session, DATE, BE Author: Session Chairs, DATE, BE Abstract Participants can freely interact with authors during their interactive technical presentations. |
STF Student Teams Fair
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 14:00 CET - 15:00 CET
Location / Room: Marble Hall
Session chair:
Dirk Stroobandt, Ghent University, BE
This is a Young People Programme event. The Student Teams Fair brings together university student teams participating at international competitions with EDA and microelectronic companies and DATE attendees. Student teams will have the opportunity to present their activities, success stories and challenges, and to get support from companies and DATE researchers for future activities.
Selected Teams:
KITcar e.V. - cognitive autonomous racing, https://www.intl.kit.edu/english/19321.php
Delft Hyperloop, https://www.delfthyperloop.nl/
UGent Racing, https://www.ugentracing.be/
HYPED Hyperloop Edinburgh, https://www.hyp-ed.com/
NeuroTech Leuven, https://www.ntxl.org/
US1 Unplugged session
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 14:00 CET - 15:30 CET
Location / Room: Nightingale Room 2.6.1/2
Come join us for stimulating brainstorm discussions in small groups about the future of digital engineering. Our focus will be on the digital twinning paradigm where virtual instances are created of a system as it is operated, maintained, and repaired (e.g., each individual car of a certain model). We investigate how to take advantage of this paradigm in engineering systems and what new system engineering approaches and architectures (hardware/software) and design workflows are needed and become possible.
W06 Can Autonomy be Safe?
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 14:00 CET - 18:00 CET
Location / Room: Gorilla Room 1.5.4/5
Organisers:
Selma Saidi, TU Dortmund, DE
Rolf Ernst, TU Braunschweig, DE
Sebastian Steinhorst, TU Munich, DE
Sponsor
ASD Workshop (Tuesday April 18: 14h - 18h00): Can Autonomous Systems Be Safe?
Despite the advancement of machine learning and artificial intelligence, safety still constitutes a main hurdle for supporting high levels of autonomy in domains such as self-driving cars, where thousands of car accidents involving autonomous functionalities are reported every year. There are many more examples where autonomous systems reliability and safety are core requirements, from robotics, trains or UAVs all the way to large systems-of-systems, such as the smart grid. The design of safety-critical and high-reliability systems is governed by strict regulations covering the whole product life cycle, from conception to production to deployment and maintenance. The design process according to safety standards typically assumes a correct and complete system specification. For autonomous systems, it is often impossible to show that the specification is complete, due to the underspecified environment and evolving, and often emerging, behaviour. Verification and test of autonomous systems, as well as monitoring safety goals in operation, are huge system design challenges. The set-backs in ambitious autonomous driving goals raise the question if systems autonomy is an appropriate concept for safety critical systems at all. On the other hand, systems autonomy with advanced capabilities, such as self-protection or self-awareness in decision making, might help to control risk under uncertainty and change, and might become an asset and even an enabler for critical complex systems design. So, guaranteeing safety emerges as challenging, but central topic in the design of autonomous systems.
This year, the workshop offers a unique opportunity for participants to contribute to the discussion and be part of a community working on the design of autonomous systems.
The workshop will start with introductory talks by experts from academia and industry that will highlight main challenges for safe systems autonomy and applications. After that, there is an opportunity for a limited number of short pitches (ca. 3 min) where workshop participants can give a statement about the main challenging question of “Can Autonomous Systems Be Safe?”, an abstract of the topic is available below. Statements can be on your research topics, practical issues, limitations, visions, or design ideas and suggestions. The short talks will be arranged in thematic blocks, followed by a discussion each.
The last part will be an open discussion with all presenters of the workshop and the audience. In the end, the results will be summarized in a report that will be made available to the workshop participants.
14h00 - 15h30:
- 14h00: Opening and Welcoming
- 14h15: Collective Reasoning for Safe Autonomous Systems Design, Selma Saidi, Professor of Embedded Systems, TU Dortmund University, Germany
- Abstract: Collaboration in multi-agent autonomous systems (AS) is critical to increase performance while ensuring safety. However, due to differences in e.g., perception qualities, some AS should be considered more trustworthy than others to contribute building collaboratively a common environmental model, especially during disagreement. We discuss in this talk increase reliability of autonomous systems by relying on collective knowledge. We borrow concepts from social epistemology to exploit individual characteristics of autonomous systems, and define and formalize rules for collective reasoning to achieve collaboratively increased safety, trustworthiness and good decision-making under uncertainty.
- 14h30: Limitation-aware designs – a road towards safer systems in complex environments, Peter Schneider, Safety Expert @Bosch Research, Robert Bosch GmbH
- Abstract: The development of safe autonomous driving systems (ADS) revealed many interdependent design challenges that often cannot simply be solved one-by-one (or measure-by-measure) but need more holistic solution approaches. Traditionally automotive safety engineering relies a lot on composing safe systems from well-defined and intrinsically safe components. For systems that operate in constantly changing environments, finding a practical and safe ‘one-size-fits-all’-solution via static designs and the traditional safety engineering toolbox becomes increasingly hard. Hence, instead of further and further tweaking single components to potentially reach ‘safety-grade’ reliability at some point (or risking getting lost in the long-tail problem), we propose to set a stronger research focus on safety engineering tools and technologies that support the creation of limitation-aware and adaptive system designs which are able to dynamically handle component limitations, without compromising on the system application’s safety goals. In order to illustrate some of the aforementioned challenges in a practical example, this talk will discuss a few of the interdisciplinary design challenges in the development of a safe ADS environment sensing system. Furthermore, different possible solution strategies are discussed on how to potentially enhance the system’s ‘safety-by-design’ via limitation modelling, design automation and safety-oriented compensation of limitations through interactions with other systems.
- 14h45: Safety Cases for Autonomous Systems, Richard Hawkins, Senior Research Fellow, Assuring Autonomy International Programme (AAIP), Department of Computer Science, University of York, UK
- Abstract: Demonstrating sufficient safety is challenging for all systems, but is even more so for autonomous systems (AS). Autonomy increases uncertainty in the safe operation of autonomous systems, particularly when operating in complex, dynamic and open environments; the pace of technological change in AS also tends to be greatly increased; in addition there is little established best practice to guide safety assurance activities. In this talk I will discuss how safety cases provide a means to address these uncertainties and provide confidence in the safety of an AS by providing explicit safety arguments supported by evidence. I will discuss guidance we have developed at the University of York on the assurance activities to be undertaken and the evidence required to be generated to create a compelling safety case for an AS.
- 15h00: First Round of Statements
- Opening Statement: Digital Twins Enabling Safe Autonomy, Unmesh Bordoloi, Siemens Mentor
- 15h30: Coffee Break
- 16h00: Second Round of Statements
- 16h30: Panel Discussion
- 17h30: Summary of the Workshop and Closing
UF University Fair presentations
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 15:30 CET - 16:30 CET
Location / Room: Marble Hall
Session chair:
Nima TaheriNejad, Heidelberg University, DE
In University Fair, academics from across Europe showcase their top-notch pre-commercial research results and prototypes. You get to interact with them, ask questions, get inspired, or join them on their journey beyond their current prototypes.
Time | Label | Presentation Title Authors |
---|---|---|
15:30 CET | UF.1 | HLS-ING UP RISC-V: STREAMLINING DESIGN AND OPTIMIZATION Presenter: Deepak Ravibabu, DFKI, DE Authors: Deepak Ravibabu1, Muhammad Hassan2 and Rolf Drechsler3 1DFKI, DE; 2University of Bremen/Cyber Physical Systems, DFKI, DE; 3University of Bremen | DFKI, DE Abstract Traditional design and verification methods using Register Transger Level (RTL) languages for Field Programmable Gate Array (FPGA) design results in high development costs and long time-to-market. To overcome such disadvantages, High-level Synthesis (HLS) is used which employ high-level languages such as SystemC and C++ for FPGA design. In comparison to traditional RTL languages like VDHL and Verilog, the object-oriented nature of C++ substantially enhances the code understandability. Additionally, it reduces design and verification efforts. In this work, we design a 32-bit synthesizable processor core to implement the RISC-V Instruction Set Architecture (ISA) using HLS tool. We implement the processor core in SystemC and use Xilinx Vivado as HLS tool. The main challenge is working with the tool specific parameters which must be used in the code to synthesize the required hardware components on the FPGA. Our experiments on Basys 3 Artix-7 FPGA trainer board demonstrate the enormous potential for saving design time without sacrificing performance and cost when employing HLS for RISC-V. |
UF.2 | WAL: A LANGUAGE FOR AUTOMATED AND PROGRAMMABLE ANALYSIS OF WAVEFORMS Presenter: Lucas Klemmer, Johannes Kepler University Linz, AT Authors: Lucas Klemmer and Daniel Grosse, Johannes Kepler University Linz, AT Abstract Waveforms are generated by EDA tools at virtually every step of the design process. However, waveform viewing is still a highly manual and tedious process, and unfortunately, there has been no progress for automating the analysis of waveforms. Therefore, we present the open-source Waveform Analysis Language (WAL). WAL is a Domain Specific Language (DSL) for hardware analysis. With WAL analysis problems can be written in a natural and generic style which we demonstrate in several case studies. These case studies include the performance analysis of several open-source and proprietary RISC-V processors. |
|
UF.3 | MACHINE LEARNING-BASED PERFORMANCE ANALYTICS IN COMPUTER SYSTEMS Presenter: Efe Sencan, Boston University, US Authors: Efe Sencan, Burak Aksar and Ayse Coskun, Boston University, US Abstract As data centers that serve many essential societal applications grow larger and become more complex, they suffer more from performance variations due to software bugs, shared resource contention (memory, network, CPU, etc.), and hardware-related problems. These variations have become more prominent due to limitations on power budget as we move towards exascale systems. Unpredictable performance degrades the energy and power efficiency of computer systems, resulting in lower quality-of-service for users, power waste, and higher operational costs. Machine learning (ML) has been gaining popularity as a promising method to detect and diagnose anomalies in computer systems. However, many of the proposed ML solutions are not publicly available to the research community. This demo aims to demonstrate how our ML-based performance anomaly detection and diagnosis frameworks operate and how they can be integrated into a web application for wider spread, and easier use for the community. |
|
UF.4 | POWER CONVERTER LARGE SIGNAL SIMULATION BASED ON MACHINE LEARNING - NEURAL NETWORK MODELS TARGETING ENERGY HARVESTING APPLICATIONS Presenter: Christos Sad, Department of Physics, Aristotle University of Thessaloniki,, GR Authors: Christos Sad1, Vasso Gogolou2, Thomas Noulis1, Kostas Siozios2 and Stylianos Siskos2 1Department of Physics, Aristotle University of Thessaloniki,, GR; 2Department of Physics, Aristotle University of Thessaloniki, GR Abstract A Machine Learning (ML) approach for the simulation of DC-DC power converter topologies' behaviour is proposed. A great variety of applications from the industrial field, are using the DC-DC power converter topologies, indicatively, uninterruptible power supply (UPS), electric or hybrid vehicles, medium-voltage DC (MVDC) and high-voltage DC (HVDC) power systems, consist some characteristic use-cases.To this end, ML model is proposed to simulate the dynamic nonlinear behaviour of the DC-DC power converter, aiming design cycle speed-up, minimizing the simulations' iterations, using accurate simulation results. |
|
UF.5 | SEAMLESS ENERGY-AWARE WORKLOAD OPTIMIZATION FOR THE HETEROGENEOUS EDGE-CLOUD CONTINUUM Presenter: Aggelos Ferikoglou, Aristotle University of Thessaloniki, GR Authors: Aggelos Ferikoglou1, Argyris Kokkinis1, Dimitrios Danopoulos2, Dimosthenis Masouros3 and Kostas Siozios4 1Aristotle University of Thessaloniki, GR; 2National Technical University of Athens, GR; 3National TU Athens, GR; 4Department of Physics, Aristotle University of Thessaloniki, GR Abstract Meeting performance objectives and requirements of state-of-the-art edge-cloud infrastructures and users is crucial nowadays. Efficient resource management in scenarios with increased computational demand, especially in modern applications, is not a trivial task. Cloud providers often employ hardware accelerators to handle the high computational requests but the the diversity of requirements makes the efficient and secure deployment a major challenge to overcome. This work presents a novel framework for deploying highly demanding, dynamic and security-critical applications for a variety of domains. Workloads are processed in a holistic and automated manner overcoming the existing platform barriers stemming from the heterogeneity of computing units. The application development and deployment within this framework focuses on methodologies for automatic GPU and FPGA acceleration as well as efficient, isolated, and secure deployments in the edge-cloud and HPC computing continuum. |
|
UF.6 | A LOW-COST IOT SYSTEM FOR INDOOR POSITIONING TARGETING ASSISTIVE ENVIRONMENTS Presenter: Vasileios Serasidis, Aristotle University of Thessaloniki, GR Authors: Vasileios Serasidis1, Ioannis Sofianidis1, Argyris Kokkinis1, Vasileios Konstantakos1 and Kostas Siozios2 1Aristotle University of Thessaloniki, GR; 2Department of Physics, Aristotle University of Thessaloniki, GR Abstract The elderly population is increasing, imposing among others a continues demand for customized health-care solutions that rely on ambient assisted living (AAL) technologies. The majority of these systems are triggered based on people movement and/or their location within homes; thus, efficient technologies that enable accurate indoor positioning are upmost important. Existing solutions for this purpose mainly rely on fingerprinting-based and proximity technologies, such as the BLE and WiFi beacons. These solutions support indoor positioning in room-scale size to activate a device when somebody enters/leaves to/from a room. However, their limited accuracy of estimations cannot support more advanced services such as the positioning or navigation of elder people within the room. To overcome this drawback, algorithms that improve accuracy were also explored. |
|
UF.7 | NEW HARDWARE TROJAN THREATS IN ENVM-BASED NEUROMORPHIC COMPUTING SYSTEMS Presenter: Lingxi Wu, University of Virginia, US Authors: Lingxi Wu, Rahul Sreekumar, Rasool Sharifi, Kevin Skadron, Stan Mircea and Ashish Venkat, University of Virginia, US Abstract Fast and energy-efficient execution of a DNN on traditional CPU- and GPU-based architectures is challenging due to excessive data movement and inefficient computation. Emerging non-volatile memory (eNVM)-based accelerators that mimic biological neuron computations in the analog domain have shown significant performance improvements. However, the potential security threats in the supply chain of such systems have been largely understudied. This work describes a hardware supply chain attack against analog eNVM neural accelerators by identifying potential Trojan insertion points and proposes a hardware Trojan design that stealthily leaks model parameters while evading detection. Our evaluation shows that such hardware Trojan can recover over 90% of the synaptic weights, allowing for the reconstruction of the original model. |
|
UF.8 | DISCUSSION WITH THE AUTHORS Presenter: University Fair Participants, DATE, BE Author: University Fair Participants, DATE, BE Abstract Discussion with the authors |
WST Workshop for Student Teams
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 16:00 CET - 18:00 CET
Location / Room: Okapi Room 0.8.3
Session chair:
Anton Klotz, Cadence Design Systems, DE
Presenter: Graeme Ritchie, Cadence Design Systems
Abstract: This is a Young People Programme Event. An introduction to the Microwave Office design platform. Topics covered will include a general overview of MWO, how to set up a new process technology for use in MWO, EM extraction and simulation, Synthesis capabilities in MWO, Thermal simulation using Celsius from within MWO”
FS5 Focus session: Cross Layer Design for the Predictive Assessment of Technology-Enabled Architectures
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 16:30 CET - 18:00 CET
Location / Room: Gorilla Room 1.5.3
Session chair:
Michael Niemier, University of Notre Dame, US
Time | Label | Presentation Title Authors |
---|---|---|
16:30 CET | FS5.1 | "END TO END" FEW SHOT LEARNING WITH MEMRISTIVE CROSSBAR ARRAYS AND MEMORY AUGMENTED NEURAL NETWORKS Presenter: Can Li, University of Hong Kong, HK Author: Can Li, University of Hong Kong, HK Abstract This presentation considers a "case study" where technology-enabled architectures based on RRAM crossbar arrays are evaluated in the context few-shot learning models/memory augmented neural networks (MANNs). RRAM-based crossbars arrays are employed to universally realize matrix vector multiplication for the CNNs, hashing, and AM functions that form the computational workload for a MANN. This case study will illustrate: (1) how device models may be employed; (2) the need to assess different aspects of the design stack -- e.g., as degradations from iso-accuracy may stem not just from device variation, but from architectural-level design constraints as well; and (3) the need for comprehensive benchmarking when different algorithmic models (e.g., MLP versus CNN versus HDC) as well as heterogenous architectural solutions for said models (e.g., TPU-GPU hybrids) that may ultimately represent the ideal baseline/software-based solution are employed. This talk serves as motivation not only to illustrate what analysis is needed to determine if industrial investment in a technology-enabled architecture is justifiable, but simultaneously serve as motivation for both (a) architectural modeling efforts and (b) analytical modeling tools to triage a large design space and identify the most meaningful/plausible points of comparison for a technology-driven architectural solution. |
16:53 CET | FS5.2 | EXPLORING ACCELERATOR-CENTRIC EDGE ARCHITECTURES: FROM OPEN PLATFORMS TO SYSTEM SIMULATION Presenter: David Atienza, EPFL, CH Author: David Atienza, EPFL, CH Abstract Two complementary approaches have emerged as main avenues for evaluating novel technology-enabled accelerators for edge computing. The first one is based on their integration in systems comprising validated open hardware components (processors, memories, and peripherals) to derive prototype systems-on-chip. The second is to model the accelerator characteristics as a module in an entire system simulator infrastructure. In this talk, I will cover the pros and cons of these two approaches, focusing on recent works at the Embedded Systems Laboratory (ESL) of EPFL on the X-HEEP open hardware architectural template and the gem5-X system simulator. I will describe how we employed these frameworks to explore emerging computation (e.g., coarse-grained reconfiguration) and communication (e.g., in-package wireless) paradigms and report on their respective energy and performance benefits. |
17:15 CET | FS5.3 | ANALYTICAL MODELING TOOLS FOR RAPID AND ACCURATE DESIGN SPACE EXPLORATIONS OF TECHNOLOGY DRIVEN ARCHITECTURES Presenter: X. Sharon Hu, University of Notre Dame, US Author: X. Sharon Hu, University of Notre Dame, US Abstract This talk considers the design, validation, and use of analytical modeling tools to support cross-layer design exploration efforts, i.e., to evaluate the impact of technology-driven architectures at scales offering meaningful benefits on applications that are of relevance to the industry. Work will be framed in the context of tools used to evaluate in-memory computing (IMC) architectures, where the development of modeling and prediction tools that are indispensable for benchmarking IMC solutions in a cross-layer fashion will be discussed. The talk will also highlight unique challenges and needs when considering both the design of IMC circuits and architectures, as well as the infrastructure needed to rapidly and accurately evaluate them. |
17:38 CET | FS5.4 | NEURO-VECTOR-SYMBOLIC ARCHITECTURES: AN EFFICIENT ENGINE FOR PERCEPTION, REASONING, AND COMPUTATIONALLY-HARD PROBLEMS Presenter: Abbas Rahimi, IBM Research, CH Author: Abbas Rahimi, IBM Research, CH Abstract Neither deep neural nets nor symbolic AI alone has approached the kind of intelligence expressed in humans. This is mainly because neural nets are not able to decompose joint representations to obtain distinct objects (the so-called binding problem), while symbolic AI suffers from exhaustive rule searches, among other problems. These two problems are still pronounced in neuro-symbolic AI, which aims to combine the best of the two paradigms. The two problems can be addressed with our proposed neuro-vector-symbolic architecture (NVSA). In this talk, we show how the realization of NVSA can be informed and benefitted by the physical properties of in-memory computing (IMC) hardware. Particularly, we demonstrate how NVSA exploits O(1) MVM, in-situ progressive crystallization, and intrinsic stochasticity of IMC based on phase-change memory devices to enable on-device few-shot continual learning, and solving computationallyhard problems such as factorization of holographic perceptual representations and visual abstract reasoning. |
FS6 Focus session: New perspectives for neuromorphic cameras: algorithms, architectures and circuits for event-based CMOS sensors
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 16:30 CET - 18:00 CET
Location / Room: Okapi Room 0.8.1
Session chair:
Pascal VIVET, CEA-List, FR
Session co-chair:
Christoph Posch, PROPHESEE, FR
Time | Label | Presentation Title Authors |
---|---|---|
16:30 CET | FS6.1 | THE CNN VS. SNN EVENT-CAMERA DICHOTOMY AND PERSPECTIVES FOR EVENT-GRAPH NEURAL NETWORKS Speaker: Thomas DALGATY, CEA-LIST, FR Authors: Thomas DALGATY1, Thomas Mesquida2, Damien JOUBERT3, Amos SIRONI3, Pascal Vivet4 and Christoph POSCH3 1CEA-List, FR; 2Université Grenoble Alpes, CEA, LETI, MINATEC Campus, FR; 3Prophesee, FR; 4CEA-Leti, FR Abstract Since neuromorphic event-based pixels and cameraswere first proposed, the technology has greatly advanced suchthat there now exists several industrial sensors, processors andtoolchains. This has also paved the way for a blossoming newbranch of AI dedicated to processing the event-based data thesesensors generate. However, there is still much debate about whichof these approaches can best harness the inherent sparsity, low-latency and fine spatiotemporal structure of event-data to obtainbetter performance and do so using the least time and energy.The latter is of particular importance since these algorithms willtypically be employed near or inside of the sensor at the edgewhere the power supply may be heavily constrained. The twopredominant methods to process visual events - convolutionaland spiking neural networks - are fundamentally opposed inprinciple. The former converts events into static 2D frames suchthat they are compatible with 2D convolutions, while the lattercomputes in an event-driven fashion naturally compatible withthe raw data. We review this dichotomy by studying recentalgorithmic and hardware advances of both approaches. Weconclude with a perspective on an emerging alternative approachwhereby events are transformed into a graph data structure andthereafter processed using techniques from the domain of graphneural networks. Despite promising early results, algorithmic andhardware innovations are required before this approach can beapplied close or within the Event-based sensor. |
16:53 CET | FS6.2 | BRAIN-INSPIRED SPATIOTEMPORAL PROCESSING ALGORITHMS FOR EFFICIENT EVENT-BASED PERCEPTION Speaker: Saibal Mukhopadhyay, Georgia Tech, US Authors: Biswadeep Chakraborty, Uday Kamal, Xueyuan She, Saurabh Dash and Saibal Mukhopadhyay, Georgia Tech, US Abstract Neuromorphic event-based cameras can unlock the true potential of bio-plausible sensing systems that mimic our human perception. However, efficient spatiotemporal processing algorithms must enable their low-power, low-latency, real-world application. In this talk, we highlight our recent efforts in this direction. Specifically, we talk about how brain-inspired algorithms such as spiking neural networks (SNNs) can approximate spatiotemporal sequences efficiently without requiring complex recurrent structures. Next, we discuss their event-driven formulation for training and inference that can achieve real-time throughput on existing commercial hardware. We also show how a brain-inspired recurrent SNN can be modeled to perform on event-camera data. Finally, we will talk about the potential application of associative memory structures to efficiently build representation for event-based perception. |
17:15 CET | FS6.3 | LOW-THROUGHPUT EVENT-BASED IMAGE SENSORS AND PROCESSING Speaker: Laurent FESQUET, Grenoble INP / TIMA, FR Authors: Laurent Fesquet1, Rosalie TRAN2, Xavier LESAGE2, Mohamed Akrarai3 and Gilles Sicard4 1TIMA - Grenoble Institute of Technology, FR; 2University Grenoble Alpes, FR; 3University of Grenoble (UGA), FR; 4CEA-Leti, FR Abstract This paper presents new kinds of image sensors based on TFS (Time to First Spike) pixels and DVS (Dynamic Vision Sensor) pixels, which take advantage of non-uniform sampling and redundancy suppression to reduce the data throughput. The DVS pixels only detect a luminance variation, while TFS pixels quantied luminance by measuring the required time to cross a threshold. Such image sensors output requests through an Address Event Representation (AER), which helps to reduce the throughput. The resulting event bitstream is composed by time, position, polarity, and magnitude information. Such a bitstream offers new possibilities for image processing such as event-by-event object tracking. In particular, we propose some processing to cluster events, filter noise and extract other useful features, such as a velocity estimation. |
17:38 CET | FS6.4 | HARDWARE ARCHITECTURES FOR PROCESSING AND LEARNING WITH EVENT-BASED DATA Presenter: Charlotte Frenkel, TU Delft, NL Author: Charlotte Frenkel, TU Delft, NL Abstract By encoding visual information as a temporally and spatially sparse event stream that preserves microsecond-scale dynamics, neuromorphic cameras are a key enabler for low-power low-latency vision applications. However, as event-driven computation implies less structure in memory access patterns, it is still an open challenge to design hardware architectures that can efficiently exploit the event-based nature of neuromorphic cameras. I will survey emerging hardware-algorithm co-design techniques for processing and learning with event-based data, highlighting the current solutions and the next steps toward adaptive neuromorphic smart sensors. |
SE4 Hardware accelerators serving efficient machine learning software architectures
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 16:30 CET - 18:00 CET
Location / Room: Okapi Room 0.8.2
Session chair:
Smail Niar, INSA Hauts-de-France and CNRS, FR
16:30 CET until 16:54 CET: Pitches of regular papers
16:54 CET until 18:00 CET: Interactive technical presentations by the authors of regular papers and extended abstracts
Regular Papers
Time | Label | Presentation Title Authors |
---|---|---|
16:30 CET | SE4.1 | PIPE-BD: PIPELINED PARALLEL BLOCKWISE DISTILLATION Speaker: Hongsun Jang, Seoul National University, KR Authors: Hongsun Jang1, Jaewon Jung2, Jaeyong Song3, Joonsang Yu4, Youngsok Kim3 and Jinho Lee1 1Seoul National University, KR; 2Yonsei university, KR; 3Yonsei University, KR; 4NAVER CLOVA, KR Abstract Training large deep neural network models is highly challenging due to their tremendous computational and memory requirements. Blockwise distillation provides one promising method towards faster convergence by splitting a large model into multiple smaller models. In state-of-the-art blockwise distillation methods, training is performed block-by-block in a data-parallel manner using multiple GPUs. To produce inputs for the student blocks, the teacher model is executed from the beginning until the current block under training. However, this results in a high overhead of redundant teacher execution, low GPU utilization, and extra data loading. To address these problems, we propose Pipe-BD, a novel parallelization method for blockwise distillation. Pipe-BD aggressively utilizes pipeline parallelism for blockwise distillation, eliminating redundant teacher block execution and increasing per-device batch size for better resource utilization. We also extend to hybrid parallelism for efficient workload balancing. As a result, Pipe-BD achieves significant acceleration without modifying the mathematical formulation of blockwise distillation. We implement Pipe-BD on PyTorch, and experiments reveal that Pipe-BD is effective on multiple scenarios, models, and datasets. |
16:33 CET | SE4.2 | LAYER-PUZZLE: ALLOCATING AND SCHEDULING MULTI-TASK ON MULTI-CORE NPUS BY USING LAYER HETEROGENEITY Speaker: Chengsi Gao, Chinese Academy of Sciences, CN Authors: Chengsi Gao, Ying Wang, Cheng Liu, Mengdi Wang, Weiwei Chen, yinhe han and Lei Zhang, Chinese Academy of Sciences, CN Abstract In this work, we propose Layer-Puzzle, a multi-task allocation and scheduling framework for multi-core NPUs. Based on the proposed latency-prediction model and dynamic parallelization scheme, Layer-Puzzle can generate near-optimal results for each layer under given hardware resources and traffic congestion levels. As an online scheduler, Layer-Puzzle performs a QoS-aware and dynamic scheduling method that picks the superior version from the previously compiled results and co-runs the selected tasks to improve system performance. Our experiments on MLPerf show that Layer-Puzzle can achieve up to 1.61X, 1.53X, and 1.95X improvement in ANTT, STP, and PE utilization, respectively. |
16:36 CET | SE4.3 | DYNAMIC TASK REMAPPING FOR RELIABLE CNN TRAINING ON RERAM CROSSBARS Speaker: Chung-Hsuan Tung, Duke University, TW Authors: Chung-Hsuan Tung1, Biresh Kumar Joardar2, Partha Pratim Pande3, Jana Doppa3, Hai (Helen) Li1 and Krishnendu Chakrabarty1 1Duke University, US; 2University of Houston, US; 3Washington State University, US Abstract A ReRAM crossbar-based computing system (RCS) can accelerate CNN training. However, hardware faults due to manufacturing defects and limited endurance impede the widespread adoption of RCS. We propose a dynamic task remapping-based technique for reliable CNN training on faulty RCS. Experimental results demonstrate that the proposed low-overhead method incurs only 0.85% accuracy loss on average while training popular CNNs such as VGGs, ResNets, and SqueezeNet with the CIFAR-10, CIFAR-100, and SVHN datasets in the presence of faults. |
16:39 CET | SE4.4 | MOBILE ACCELERATOR EXPLOITING SPARSITY OF MULTI-HEADS, LINES AND BLOCKS IN TRANSFORMERS IN COMPUTER VISION Speaker: Eunji Kwon, Pohang University of Science and Technology, KR Authors: Eunji Kwon, Haena Song, Jihye Park and Seokhyeong Kang, Pohang University of Science and Technology, KR Abstract It is difficult to employ transformer models for computer vision in mobile devices due to their memory- and computation-intensive properties. Accordingly, there is ongoing research on various methods for compressing transformer models, such as pruning. However, general computing platforms such as central processing units (CPUs) and graphics processing units (GPUs) are not energy-efficient to accelerate the pruned model due to their structured sparsity. This paper proposes a low-power accelerator for transformers in computer vision with various sizes of structured sparsity induced by pruning with different granularity. In this study, we can accelerate a transformer that has been pruned in a head-wise, line-wise, or block-wise manner. We developed a head scheduling algorithm to support head-wise skip operations and resolve the processing engine (PE) load imbalance problem caused by different amounts of computations in one head. Moreover, we implemented a sparse general matrix-to-matrix multiplication (sparse GEMM) that supports line-wise and block-wise skipping. As a result, when compared with a mobile GPU and mobile CPU respectively, our proposed accelerator achieved 6.1x and 13.6x improvements in energy efficiency for the detection transformer (DETR) model and achieved approximately 2.6x and 7.9x improvements in the energy efficiency on average for the vision transformer (ViT) models. |
16:42 CET | SE4.5 | RAWATTEN: RECONFIGURABLE ACCELERATOR FOR WINDOW ATTENTION IN HIERARCHICAL VISION TRANSFORMERS Speaker: Wantong Li, Georgia Tech, US Authors: Wantong Li, Yandong Luo and Shimeng Yu, Georgia Tech, US Abstract After the success of the transformer networks on natural language processing (NLP), the application of transformers to computer vision has followed suit to deliver unprecedented performance gains on vision tasks including image recognition and object detection. The multi-head self-attention (MSA) is the key component in transformers, allowing the models to learn the amount of attention paid to each input position. In particular, hierarchical vision transformers (HVTs) utilize window-based MSA to capture the benefits of the attention mechanism at various scales for further accuracy enhancements. Despite its strong modeling capability, MSA involves complex operations that make transformers prohibitively costly for hardware deployment. Existing hardware accelerators have mainly focused on the MSA workloads in NLP applications, but HVTs involve different parameter dimensions, input sizes, and data reuse opportunities. Therefore, we design the RAWAtten architecture to target the window-based MSA workloads in HVT models. Each w-core in RAWAtten contains near-memory compute engines for linear layers, MAC arrays for intermediate matrix multiplications, and a lightweight reconfigurable softmax. The w-cores can be combined at runtime to perform hierarchical processing to accommodate varying model parameters. Compared to the baseline GPU, RAWAtten at 40nm provides 2.4× average speedup for running the window-MSA workloads in Swin transformer models while consuming only a fraction of GPU power. In addition, RAWAtten achieves 2× area efficiency compared to prior ASIC accelerator for window-MSA. |
16:45 CET | SE4.6 | M5: MULTI-MODAL MULTI-TASK MODEL MAPPING ON MULTI-FPGA WITH ACCELERATOR CONFIGURATION SEARCH Speaker: Akshay Karkal Kamath, Georgia Tech, US Authors: Akshay Kamath, Stefan Abi-Karam, Ashwin Bhat and Cong "Callie" Hao, Georgia Tech, US Abstract Recent machine learning (ML) models have advanced from single-modality single-task to multi-modality multi-task (MMMT). MMMT models typically have multiple backbones of different sizes along with complicated connections, exposing great challenges for hardware deployment. For scalable and energy-efficient implementations, multi-FPGA systems are emerging as the ideal design choices. However, finding the optimal solutions for mapping MMMT models onto multiple FPGAs is non-trivial. Existing mapping algorithms focus on either streamlined linear deep neural network architectures or only the critical path of simple heterogeneous models. Direct extensions of these algorithms for MMMT models lead to sub-optimal solutions. To address these shortcomings, we propose M5, a novel MMMT Model Mapping framework for Multi- FPGA platforms. In addition to handling multiple modalities present in the models, M5 can flexibly explore accelerator configurations and possible resource sharing opportunities to significantly improve the system performance. For various computation-heavy MMMT models, experiment results demonstrate that M5 can remarkably outperform existing mapping methods and lead to an average reduction of 35%, 62%, and 70% in the number of low-end, mid-end, and high-end FPGAs required to achieve the same throughput, respectively. Code is available publicly. |
16:48 CET | SE4.7 | STEPPINGNET: A STEPPING NEURAL NETWORK WITH INCREMENTAL ACCURACY ENHANCEMENT Speaker: Wenhao Sun, TU Munich, DE Authors: Wenhao Sun1, Grace Li Zhang2, Xunzhao Yin3, Cheng Zhuo3, Huaxi Gu4, Bing Li1 and Ulf Schlichtmann1 1TU Munich, DE; 2TU Darmstadt, DE; 3Zhejiang University, CN; 4Xidian University, CN Abstract Deep neural networks (DNNs) have successfully been applied in many fields in the past decades. However, the increasing number of multiply-and-accumulate (MAC) operations in DNNs prevents their application in resource-constrained and resource-varying platforms, e.g., mobile phones and autonomous vehicles. In such platforms, neural networks need to provide acceptable results quickly and the accuracy of the results should be able to be enhanced dynamically according to the computational resources available in the computing system. To address these challenges, we propose a design framework called SteppingNet. SteppingNet constructs a series of subnets whose accuracy is incrementally enhanced with more MAC operations. Therefore, this design allows a trade-off between accuracy and latency. In addition, the larger subnets in SteppingNet are built upon smaller subnets, so that the results of the latter can directly be reused in the former without recomputation. This property allows SteppingNet to decide on-the-fly whether to enhance the inference accuracy by executing further MAC operations. Experimental results demonstrate that SteppingNet provides an effective incremental accuracy improvement and its inference accuracy consistently outperforms the state-of-the-art work under the same limit of computational resources. |
16:51 CET | SE4.8 | AIRCHITECT: AUTOMATING HARDWARE ARCHITECTURE AND MAPPING OPTIMIZATION Speaker: Ananda Samajdar, Georgia Tech / IBM Research, US Authors: Ananda Samajdar1, Jan Moritz Joseph2 and Tushar Krishna1 1Georgia Tech, US; 2RWTH Aachen University, DE Abstract Design space exploration and optimization is an essential but iterative step in custom accelerator design involving costly search based method to extract maximum performance and energy efficiency. State-of-the-art methods employ data centric approaches to reduce the cost of each iteration but still rely on search algorithms to obtain the optima. This work proposes a learned, constant time optimizer that uses a custom recommendation network called AIRCHITECT, which is capable of learning the architecture design and mapping space with a 94.3% test accuracy, and predicting optimal configurations, which achieve on an average (GeoMean) 99.9% of the best possible performance on a test dataset with 10^5 GEMM (GEneral Matrix- matrix Multiplication) workloads. |
Extended Abstracts
Time | Label | Presentation Title Authors |
---|---|---|
16:54 CET | SE4.9 | ACCELERATING INFERENCE OF 3D-CNN ON ARM MANY-CORE CPU VIA HIERARCHICAL MODEL PARTITION Speaker: Jiazhi Jiang, Sun Yat-sen University, CN Authors: Jiazhi Jiang, ZiJian Huang, Dan Huang, Jiangsu Du and Yutong Lu, Sun Yat-sen University, CN Abstract Many applications such as biomedical analysis and scientific data analysis involve analyzing volumetric data. This spawns huge demand for 3D CNN. Although accelerators such as GPU may provide higher throughput on deep learning applications, they may not be available in all scenarios. CPU, especially many-core CPU, remains an attractive choice for deep learning in many scenarios. In this paper, we propose a inference solution that targets on the emerging ARM many-core CPU platform. A hierarchical partition approach is claimed to accelerate 3D-CNN inference by exploiting characteristics of memory and cache on ARM many-core CPU. |
16:54 CET | SE4.10 | CEST: COMPUTATION-EFFICIENT N:M SPARSE TRAINING FOR DEEP NEURAL NETWORKS Speaker: Wei Sun, Eindhoven University of Technology, NL Authors: Chao Fang1, Wei Sun2, Aojun Zhou3 and Zhongfeng Wang1 1Nanjing University, CN; 2Eindhoven University of Technology, NL; 3The Chinese University of Hongkong, HK Abstract N:M fine-grained structured sparsity has attracted attention due to its practical sparsity ratio and hardware-friendly pattern. However, the potential to accelerate N:M sparse deep neural network (DNN) training has not been fully exploited, and there is a lack of efficient hardware supporting N:M sparse training. To tackle these challenges, this paper presents a computation-efficient scheme for N:M sparse DNN training, called CEST. A bidirectional weight pruning method, dubbed BDWP, is firstly proposed to significantly reduce the computational cost while maintaining model accuracy. A sparse accelerator, namely SAT, is further developed to neatly support both the regular dense operations and N:M sparse operations. Experimental results show CEST significantly improves the training throughput by 1.89−12.49× and the energy efficiency by 1.86−2.76×. |
16:54 CET | SE4.11 | BOMP-NAS: BAYESIAN OPTIMIZATION MIXED PRECISION NAS Speaker: Floran de Putter, Eindhoven University of Technology, NL Authors: David van Son1, Floran de Putter1, Sebastian Vogel2 and Henk Corporaal1 1Eindhoven University of Technology, NL; 2NXP Semiconductors, NL Abstract Bayesian Optimization Mixed-Precision Neural Architecture Search (BOMP-NAS) is a method to quantization-aware neural architecture search that leverages both Bayesian optimization and mixed-precision quantization to efficiently search for compact, high performance deep neural networks. It is able to find neural networks that achieve state of the art accuracy with less search time. Compared to the closest related work, BOMP-NAS can find these neural networks in 6x less search time. |
16:54 CET | SE4.12 | A MACHINE-LEARNING-GUIDED FRAMEWORK FOR FAULT-TOLERANT DNNS Speaker: Marcello Traiola, Inria Rennes / IRISA Lab, FR Authors: Marcello Traiola1, Angeliki Kritikakou2 and Olivier Sentieys3 1Inria / IRISA, FR; 2Université de Rennes | Inria | CNRS | IRISA, FR; 3INRIA, FR Abstract Deep Neural Networks (DNNs) show promising performance in several application domains. Nevertheless, DNN results may be incorrect, not only because of the network intrinsic inaccuracy, but also due to faults affecting the hardware. Ensuring the fault tolerance of DNN is crucial, but common fault tolerance approaches are not cost-effective, due to the prohibitive overheads for large DNNs. This work proposes a comprehensive framework to assess the fault tolerance of DNN parameters and cost-effectively protect them. As a first step, the proposed framework performs a statistical fault injection. The results are used in the second step with classification-based machine learning methods to obtain a bit-accurate prediction of the criticality of all network parameters. Last, Error Correction Codes (ECCs) are selectively inserted to protect only the critical parameters, hence entailing low cost. Thanks to the proposed framework, we explored and protected two Convolutional Neural Networks (CNNs), each with four different data encoding. The results show that it is possible to protect the critical network parameters with selective ECCs while saving up to 79% memory w.r.t. conventional ECC approaches. |
SS3 Secure circuits and architectures
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 16:30 CET - 18:00 CET
Location / Room: Gorilla Room 1.5.1
Session chair:
Jo Vliegen, KU Leuven, BE
16:30 CET until 16:57 CET: Pitches of regular papers
16:57 CET until 18:00 CET: Interactive technical presentations by the authors of regular papers and extended abstracts
Regular Papers
Time | Label | Presentation Title Authors |
---|---|---|
16:30 CET | SS3.1 | ESTABLISHING DYNAMIC SECURE SESSIONS FOR ECQV IMPLICIT CERTIFICATES IN EMBEDDED SYSTEMS Speaker: Fikret Basic, TU Graz, AT Authors: Fikret Basic1, Christian Steger1 and Robert Kofler2 1TU Graz, AT; 2NXP Semiconductors Austria GmbH Co & KG, AT Abstract Implicit certificates are gaining ever more prominence in constrained embedded devices, in both the internet of things (IoT) and automotive domains. They present a resource-efficient security solution against common threat concerns. The computational requirements are not the main issue anymore, with the focus now shifting to determining a good balance between the provided security level and the derived threat model. A security aspect that often gets overlooked is the establishment of secure communication sessions, as most design solutions are based only on the use of static key derivation, and therefore lack the perfect forward secrecy. This leaves the transmitted data open for potential future exposures as keys are tied to the certificates rather than the communication sessions. We aim to close this gap and present a design that utilizes the Station to Station (STS) protocol with implicit certificates. In addition, we propose potential protocol optimization implementation steps and run a comprehensive study on the performance and security level between the proposed design and the state-of-the-art key derivation protocols. In our comparative study, we show that we are able to mitigate many session-related security vulnerabilities that would otherwise remain open with only a slight computational increase of 20% compared to a static elliptic curve digital signature algorithm (ECDSA) key derivation. |
16:33 CET | SS3.2 | CACHE SIDE-CHANNEL ATTACKS AND DEFENSES OF THE SLIDING WINDOW ALGORITHM IN TEES Speaker: Zili KOU, Hong Kong University of Science and Technology, CN Authors: Zili KOU1, Sharad Sinha2, Wenjian HE1 and Wei ZHANG1 1Hong Kong University of Science and Technology, HK; 2Indian Institute of Technology Goa, IN Abstract Trusted execution environments (TEEs) such as SGX on x86 and TrustZone on ARM are announced to protect trusted programs against even a malicious operation system (OS), however, they are still vulnerable to cache side-channel attacks. In the new threat model of TEEs, kernel-privileged attackers are more capable, thus the effectiveness of previous defenses needs to be carefully reevaluated. Aimed at the sliding window algorithm of RSA, this work analyzes the latest defenses from the TEE attacker's point of view and pinpoints their attack surfaces and vulnerabilities. The mainstream cryptography libraries are scrutinized, within which we attack and evaluate the implementations of Libgcrypt and Mbed TLS on a real-world ARM processor with TrustZone. Our attack successfully recovers the key of RSA in the latest Mbed TLS design when it adopts a small window size, despite Mbed TLS taking a significant role in the ecosystem of ARM TrustZone. The possible countermeasures are finally presented together with the corresponding costs. |
16:36 CET | SS3.3 | THE FIRST CONCEPT AND REAL-WORLD DEPLOYMENT OF A GPU-BASED THERMAL COVERT CHANNEL: ATTACK AND COUNTERMEASURES Speaker: Jeferson Gonzalez, Karlsruhe Institute of Technology, CR Authors: Jeferson Gonzalez-Gomez1, Kevin Cordero-Zuniga2, Lars Bauer1 and Joerg Henkel1 1Karlsruhe Institute of Technology, DE; 2ITCR, CR Abstract Thermal covert channel (TCC) attacks have been studied as a threat to CPU-based systems over recent years. In this paper, we propose a new type of TCC attack that for the first time leverages the Graphics Processing Unit (GPU) of a system to create a stealthy communication channel between two malicious applications. We evaluate our new attack on two different real-world platforms: a GPU-dedicated general computing platform and a GPU-integrated embedded platform. Our results are the first to show that a GPU-based thermal covert channel attack is possible. From our experiments, we obtain a transmission rate of up to 8.75 bps with a very low error rate of less than 2% for a 12-bit packet size, which is comparable to CPU-based TCCs in the state of the art. Moreover, we show how existing state-of-the-art countermeasures for TCCs need to be extended to tackle the new GPU-based attack, at the cost of added overhead. To reduce this overhead, we propose our own DVFS-based countermeasure which mitigates the attack, while causing 2x less performance loss than the state-of-the-art countermeasure on a set of compute-intensive GPU benchmark applications. |
16:39 CET | SS3.4 | SIGFUZZ: A FRAMEWORK FOR DISCOVERING MICROARCHITECTURAL TIMING SIDE CHANNELS Speaker: Chathura Rajapaksha, Boston University, US Authors: Chathura Rajapaksha, Leila Delshadtehrani, Manuel Egele and Ajay Joshi, Boston University, US Abstract Timing side channels can be inadvertently introduced into processor microarchitecture during the design process, mainly due to optimizations carried out to improve processor performance. These timing side channels have been used in various attacks including transient execution attacks on recent commodity processors. Hence, we need a tool to detect timing side channels during the design process. This paper presents SIGFuzz, a fuzzing-based framework for detecting microarchitectural timing side channels. A designer can use SIGFuzz to detect side channels early in the design flow and mitigate potential vulnerabilities associated with them. SIGFuzz generates a cycle-accurate microarchitectural trace for a program that executes on the target processor, it then uses two trace properties to identify side channels that would have been formed by the program. These two trace properties evaluate the effect of each instruction in the program, on the timing of its prior and later instructions, respectively. SIGFuzz also uses a statistical distribution of execution delays of instructions with the same mnemonic to flag potential side channels that manifest with different operands of an instruction. Furthermore, SIGFuzz automatically groups the detected side channels based on the microarchitectural activity trace (i.e. signature) of the instruction that triggered it. We evaluated SIGFuzz on two real-world open-source processor designs: Rocket and BOOM, and found three new side channels and two known side channels. We present a novel Spectre-style attack on BOOM based on one of the newly detected side channel. |
16:42 CET | SS3.5 | RUN-TIME INTEGRITY MONITORING OF UNTRUSTWORTHY ANALOG FRONT-ENDS Speaker: Heba Salem, University of Edinburgh, GB Authors: Heba Salem and Nigel Topham, University of Edinburgh, GB Abstract Recent advances in hardware attacks, such as cross talk and covert channel based attacks, expose the structural and operational vulnerability of analog and mixed-signal circuit elements to the introduction of malicious and untrustworthy behaviour at run-time, potentially leading to adverse physical, personal, and environmental consequences. One untrustworthy behaviour of concern, is the introduction of abnormal/unexpected frequencies to the signals at the analog/ digital interface of a SoC, realised through intermittent bit-flipping or stuck-at-faults in the middle and lower bits of these signals. In this paper, we study the impact of these actions and propose integrity monitoring of signals of concern based on analysing the temporal and arithmetic relations between their samples. This paper presents a hybrid software/ hardware machine-learning based framework that consists of two phases; a run-time monitoring phase, and a trustworthiness assessment phase. The framework is evaluated with three different applications and its effectiveness in detecting the untrustworthy behaviour of concern is verified. This framework is device, application, and architecture agnostic, and relies only on analysing the output of the analog front-end, allowing its implementation in SoCs with on-chip and custom analog front-ends as well as those with outsourced and commercial off-the-shelf (COTS) analog front-ends. |
16:45 CET | SS3.6 | SPOILER-ALERT: DETECTING SPOILER ATTACKS USING A CUCKOO FILTER Speaker: Jinhua Cui, Hunan University, CN Authors: Jinhua Cui, Yiyun Yin, Congcong Chen and Jiliang Zhang, Hunan University, CN Abstract Spoiler attacks leak physical address information, which is exploited to accelerate reverse engineering of virtual-to-physical address mapping, thus greatly boosting Rowhammer and cache attacks. However, existing approaches that detect data-leakage attacks no longer suit the requirements of identifying Spoiler. This paper proposes SPOILER-ALERT, the first hardware-level mechanism to detect the address-leakage Spoiler attacks in real time. It leverages a cuckoo filter module embedded into Memory Order Buffer component to screen buffer addresses on-the-fly. We further optimise the filtering algorithm to reduce false positives. We assess the effectiveness and performance based on prototype implementations, which achieve a detection rate of 99.99% and negligible performance loss. Finally, we discuss potential reactions of our detection mechanism after a Spoiler attack was discovered. |
16:48 CET | SS3.7 | HUNTER: HARDWARE UNDERNEATH TRIGGER FOR EXPLOITING SOC-LEVEL VULNERABILITIES Speaker: Farimah Farahmandi, University of Florida, US Authors: Sree Ranjani Rajendran1, Shams Tarek1, Benjamin Myers Hicks1, Hadi Mardani Kamali1, Farimah Farahmandi1 and Mark Tehranipoor2 1University of Florida, US; 2Intel Charles E. Young Preeminence Endowed Chair Professor in Cybersecurity, Associate Chair for Research and Strategic Initiatives, ECE Department, University of Florida, US Abstract Systems-on-chip (SoCs) have become increasingly large and complex, resulting in new threats and vulnerabilities, mainly related to system-level flaws. However, the system-level verification process, whose violation may lead to exploiting a hardware vulnerability, is not studied comprehensively due to the lack of decisive (security) requirements and properties from the SoC designer's perspective. To enable a more comprehensive verification for system-level properties, this paper presents HUnTer (Hardware Underneath Trigger), a framework for identifying sets (sequences) of instructions at the processor unit (PU) that unveils the underneath hardware vulnerabilities. The HUnTer framework automates (i) threat modeling, (ii) threat-based formal verification, (iii) generation of counterexamples, and (iv) generation of snippet code for exploiting the vulnerability. The HUnTer framework also defines a security coverage metric (HUnT_Coverage) to measure the performance and efficacy of the proposed approach. Using the HUnTer framework on a RISC-V-based open-source SoC architecture, we conduct a wide variety of case studies of Trust-HUB vulnerabilities to demonstrate the high effectiveness of the proposed framework. |
16:51 CET | SS3.8 | MAXIMIZING THE POTENTIAL OF CUSTOM RISC-V VECTOR EXTENSIONS FOR SPEEDING UP SHA-3 HASH FUNCTIONS Speaker: Huimin Li, TU Delft, NL Authors: Huimin Li1, Nele Mentens2 and Stjepan Picek3 1TU Delft, NL; 2KU Leuven, BE; 3Radboud University, NL Abstract SHA-3 is considered to be one of the most secure standardized hash functions. It relies on the Keccak-f[1,600] permutation, which operates on an internal state of 1,600 bits, mostly represented as a 5×5×64-bit matrix. While existing implementations process the state sequentially in chunks of typically 32 or 64 bits, the Keccak-f[1,600] permutation can benefit a lot from speedup through parallelization. This paper is the first to explore the full potential of parallelization of Keccak-f[1,600] in RISC-V based processors through custom vector extensions on 32-bit and 64-bit architectures. We analyze the Keccak-f[1,600] permutation, composed of five different step mappings, and propose ten custom vector instructions to speed up the computation. We realize these extensions in a SIMD processor described in SystemVerilog. We compare the performance of our designs to existing architectures based on vectorized application-specific instruction set processors (ASIP). We show that our designs outperform all related work in throughput due to our carefully selected custom vector instructions. |
16:54 CET | SS3.9 | PRIVACY-BY-SENSING WITH TIME-DOMAIN DIFFERENTIALLY-PRIVATE COMPRESSED SENSING Speaker: Steven Davis, University of Notre Dame, US Authors: Jianbo Liu, Boyang Cheng, Pengyu Zeng, Steven Davis, Muya Chang and Ningyuan Cao, University of Notre Dame, US Abstract With the ubiquitous IoT sensors and enormous real-time data generation, data privacy is becoming a critical societal concern. State-of-the-art privacy protection methods all demand significant hardware overhead due to computation-insensitive algorithm and divided sensor/security architecture. In this paper, we propose a generic time-domain circuit architecture that protects raw data by enabling a differentially-private compressed sensing (DP-CS) algorithm secured by physical unclonable functions (PUF). To address privacy concerns and hardware overhead at the same time, a robust unified PUF and time-domain mixed-signal (TD-MS) module is designed, where PUF enables private and secure entropy generation. To evaluate the proposed design against digital baseline, we performed experiments based on synthesized circuits and SPICE simulation, and measured a 2.9x area reduction and 3.2x energy gains. We also measured high-quality PUF generation with TD-MS circuit with a inter-die Hamming distance of 52% and a low intra-die Hamming distance of 2.8%. Furthermore, we performed attack and algorithm performance measurement demonstrating the proposed design preserves data privacy even under attack and the machine learning performance has minimal degradation (within 2%) compared to digital baseline. |
Extended Abstracts
Time | Label | Presentation Title Authors |
---|---|---|
16:57 CET | SS3.11 | ENERGY-EFFICIENT NTT DESIGN WITH ONE-BANK SRAM AND 2-D PE ARRAY Speaker: Jianan Mu, ICT,CAS, CN Authors: Jianan Mu1, HuaJie Tan2, Jiawen Wu2, Haotian Lu2, Chip-Hong Chang3, Shuai Chen4, Shengwen Liang1, Jing Ye1, Huawei Li1 and Xiaowei Li1 1ICT, CAS, CN; 2School of Microelectronics, Tianjin University, China, CN; 3School of Electrical and Electronic Engineering (EEE) of NTU, SG; 4Rock-solid Security Lab. of Binary Semiconductor Co., Ltd., CN Abstract In Number Theoretic Transform (NTT) operation, more than half of the active energy consumption stems from memory accesses. Here, we propose a generalized design method to improve the energy efficiency of NTT operation by considering the effect of processing element (PE) geometry and memory organization on the data flow between PEs and memory. To decrease the number of data bits that are required to be accessed from the memory, a two-dimensional (2-D) PE array architecture is used. A pair of ping-pong buffers are proposed to transposed swap the coefficients to enable a single bank of memory to be used with the 2-D PE array to reduce the average memory bit access energy without compromising the throughput. Our experimental results show that this design method can produce NTT accelerators with up to 69.8\% saving in average energy consumption compared with the existing designs based on multi-bank SRAM and one-bank SRAM with one-dimensional PE array with the same number of PEs and total memory size. |
16:57 CET | SS3.12 | COFHEE: A CO-PROCESSOR FOR FULLY HOMOMORPHIC ENCRYPTION EXECUTION Speaker: Homer Gamil, New York University Abu Dhabi, GR Authors: Mohammed Nabeel Thari Moopan1, Deepraj Soni2, Mohammed Ashraf3, Mizan Gebremichael4, Homer Gamil3, Eduardo Chielle3, Ramesh Karri5, Mihai Sanduleanu4 and Michail Maniatakos3 1New York University, AE; 2New York University Tandon School of Engineering, US; 3New York University Abu Dhabi, AE; 4Khalifa University, AE; 5New York University, US Abstract In this paper, we present the blueprint of a specialized co-processor for Fully Homomorphic Encryption, dubbed CoFHEE. With a small design area of 12mm^2, CoFHEE incorporates ASIC implementations of fundamental polynomial operations, such as polynomial addition and subtraction, Hadamard product, and Number Theoretic Transform, which are underneath all higher-level FHE primitives. CoFHEE has native support of polynomial degrees of up to n = 2^14 with a coefficient size of 128 bits. We evaluate our chip with performance and power experiments and compare it against state-of-the-art software implementations and other ASIC designs. A more elaborate description of the CoFHEE design can be found in [1]. |
16:57 CET | SS3.13 | A RAPID RESET 8-TRANSISTOR PHYSICALLY UNCLONABLE FUNCTION UTILISING POWER GATING Speaker: Yujin Zheng, Newcastle University, GB Authors: Yujin Zheng1, Alex Bystrov2 and Alex Yakovlev2 1Newcastle university, GB; 2Newcastle University, GB Abstract Physically Unclonable Functions (PUFs) need error correction whilst regenerating Secret Keys in cryptography. The proposed 8-Transistor (8T) PUF, which coordinates with the power gating technique, can significantly accelerate a single evaluation cycle 1000 times faster than 6T-SRAM PUF does with a 12.8% area increase. This design enables multiple evaluations even in the key regeneration phase in field, hence greatly reducing the number of errors and the hardware penalty for error correction. The 8T PUF derives from the 6T SRAM. It is built to eliminate data retention swiftly and maximise physical mismatches. And a two-phase power gating module is designed to provide controllable power-on/off cycles rapidly for the chosen PUF clusters in order to facilitate statistical measurements and curb the in-rush current, thereby enhancing PUF entropy and security. An architecture of the power-gated PUF is developed to accommodate fast multiple evaluations. Post-layout Monte Carlo simulations were performed with Cadence, and the extracted PUF Responses were processed with Matlab to evaluate the 8T PUF performance and statistical metrics for subsequent inclusion into PUF Responses. |
US2 Unplugged session
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 16:30 CET - 18:00 CET
Location / Room: Nightingale Room 2.6.1/2
Come join us for stimulating brainstorm discussions in small groups about the future of digital engineering. Our focus will be on the digital twinning paradigm where virtual instances are created of a system as it is operated, maintained, and repaired (e.g., each individual car of a certain model). We investigate how to take advantage of this paradigm in engineering systems and what new system engineering approaches and architectures (hardware/software) and design workflows are needed and become possible.
PARTY DATE Party
Add this session to my calendar
Date: Tuesday, 18 April 2023
Time: 19:30 CET - 23:00 CET
Location / Room: Horta
supported by
Time | Label | Presentation Title Authors |
---|---|---|
19:30 CET | PARTY.1 | ARRIVAL AND WELCOME Presenter: Ian O'Connor, Lyon Institute of Nanotechnology, FR Authors: Ian O'Connor1 and Robert Wille2 1Lyon Institute of Nanotechnology, FR; 2TU Munich, DE Abstract Arrival and Welcome at DATE Party |
20:00 CET | PARTY.2 | PRESENTATION OF AWARDS Speaker: Jürgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE Authors: Jan Madsen1, David Atienza2, Ian O'Connor3, Robert Wille4 and Jürgen Teich5 1TU Denmark, DK; 2EPFL, CH; 3Lyon Institute of Nanotechnology, FR; 4TU Munich, DE; 5Friedrich-Alexander-Universität Erlangen-Nürnberg, DE Abstract Presentation of Awards |
20:15 CET | PARTY.3 | BEER AND CHOCOLATE: A MATCH MADE IN BELGIUM! Presenter: Werner Callebaut, Bierolade, beer sommelier and chocolate expert, BE Author: Werner Callebaut, Bierolade, beer sommelier and chocolate expert, BE Abstract It's hard to meet a truer Belgian: Werner Callebaut from Bierolade. He has two passions: chocolate and beer. With his years of experience as beer sommelier and chocolate expert, he will explain you this evening how to experience Belgian beer & chocolate. A story full of lovely anecdotes, tips & tricks and off course delicious beers & chocolates. |
20:30 CET | PARTY.4 | DATE PARTY WITH DRINKS&FOODBARS Presenter: All Participants, DATE, BE Author: All Participants, DATE, BE Abstract Drinks and foodbars at DATE Party |
Wednesday, 19 April 2023
BPA10 Hardware Security
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 08:30 CET - 10:30 CET
Location / Room: Okapi Room 0.8.3
Session chair:
Johanna Sepulveda, Airbus, DE
Time | Label | Presentation Title Authors |
---|---|---|
08:30 CET | BPA10.1 | SOCFUZZER: SOC VULNERABILITY DETECTION USING COST FUNCTION ENABLED FUZZ TESTING Speaker: Mark Tehranipoor, University of Florida, US Authors: Muhammad Monir Hossain1, Arash Vafaei1, Kimia Zamiri Azar1, Fahim Rahman1, Farimah Farahmandi1 and Mark Tehranipoor2 1University of Florida, US; 2Intel Charles E. Young Preeminence Endowed Chair Professor in Cybersecurity, Associate Chair for Research and Strategic Initiatives, ECE Department, University of Florida, US Abstract The modern System-on-Chips (SoCs), with numerous complex and heterogeneous intellectual properties (IPs), and the inclusion of highly-sensitive assets, become the target of malicious attacks. However, security verification of these SoCs remains behind compared to the advances in functional verification, mostly because it is difficult to formally define the accurate threat model(s). Few recent studies have investigated the possibility of engaging fuzz testing for hardware-oriented vulnerability detection. However, they suffer from several limitations, i.e., lack of cross-layer co-verification, the need for expert knowledge, and the inability to capture detailed hardware interactions. In this paper, we propose SoCFuzzer, an automated SoC verification assisted by fuzz testing for detecting SoC security vulnerabilities. Unlike the previous HW-oriented fuzz testing studies, which mostly rely on traditional (code) coverage-based metrics, in SoCFuzzer, we develop (i) generic evaluation metrics for fuzzing the hardware domain, and (ii) security-oriented cost function. This relieves designers of making correlations between coverage metrics, test data, and possible vulnerabilities. The SoCFuzzer cost functions are defined high level, allowing us to follow the gray-box model, which requires less detailed and interactive information from the design-under-test. Our experiments on an open-source RISC-V based SoC show the efficiency of these metrics and cost functions on fuzzing for generating cornerstone inputs to trigger the vulnerability conditions with faster convergence. |
08:55 CET | BPA10.2 | NON-PROFILED SIDE-CHANNEL ASSISTED FAULT ATTACK: A CASE STUDY ON DOMREP Speaker: Shivam Bhasin, Nanyang Technological University Singapore, SG Authors: Sayandeep Saha, Prasanna Ravi, Dirmanto Jap and Shivam Bhasin, Nanyang Technological University, SG Abstract Recent work has shown that Side-Channel Attacks (SCA) and Fault Attacks (FA) can be combined, forming an extremely powerful adversarial model, which can bypass even some strongest protections against both FA and SCA. However, such strongest form of combined attack comes with some practical challenges -- 1) a profiled setting with multiple fault locations is needed; 2) fault models are restricted to single-bit set-reset/flips; 3) the input needs to be repeated several times. In this paper, we propose a new combined attack strategy called SCA-NFA that works in a non-profiled setting. Assuming knowledge of plaintexts/ciphertexts and exploiting bitsliced implementations of modern ciphers, we further relax the assumptions on the fault model, and the number of fault locations -- random multi-bit fault at a single fault location is sufficient for recovering several secret bits. Furthermore, the inputs are allowed to be varied, which is required in several practical use cases. The attack is validated on a recently proposed countermeasure called DOMREP, which individually provides SCA and FA protection of arbitrary order. Practical validation for an open-source masked implementation of GIMLI with DOMREP extension on STM32F407G, using electromagnetic fault and electromagnetic SCA, shows that SCA-NFA succeeds in around 10000 measurements. |
09:20 CET | BPA10.3 | EFFICIENT SOFTWARE MASKING OF AES THROUGH INSTRUCTION SET EXTENSIONS Speaker: Songqiao Cui, KU Leuven, BE Authors: Songqiao Cui and Josep Balasch, KU Leuven, BE Abstract Masking is a well-studied countermeasure to protect software implementations against side-channel attacks. For the case of AES, incorporating masking often requires to implement internal transformations using finite field arithmetic. This results in significant performance overheads, mostly due to finite field multiplications, which are even worsened when no lookup tables are used. In this work, we extend a RISC-V core with custom instructions to accelerate AES finite field arithmetic. With a 3.3% area increase, we measure 7.2x and 5.4x speed up over software-only implementations of first-order Boolean Masking and Inner Product Masking, respectively. We also investigate vectorized instructions capable of exploiting the intra-block and inter-block parallelism in the implementation. Our implementations avoid the use of lookup tables, run in constant time, and show no evidence of first-order leakage when evaluated on an FPGA. |
09:45 CET | BPA10.4 | INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Speaker: Authors of the session, DATE, BE Author: Session Chairs, DATE, BE Abstract Participants can freely interact with authors during their interactive technical presentations. |
BPA3 Efficient processing for NNs
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 08:30 CET - 10:30 CET
Location / Room: Gorilla Room 1.5.1
Session chair:
David Novo, LIRMM, University of Montpellier, CNRS, FR, FR
Time | Label | Presentation Title Authors |
---|---|---|
08:30 CET | BPA3.1 | AUTOMATED ENERGY-EFFICIENT DNN COMPRESSION UNDER FINE-GRAIN ACCURACY CONSTRAINTS Speaker: Ourania Spantidi, Southern Illinois University, US Authors: Ourania Spantidi and Iraklis Anagnostopoulos, Southern Illinois University Carbondale, US Abstract Deep Neural Networks (DNNs) are utilized in a variety of domains, and their computation intensity is stressing embedded devices that comprise limited power budgets. DNN compression has been employed to achieve gains in energy consumption on embedded devices at the cost of accuracy loss. Compression-induced accuracy degradation is addressed through fine-tuning or retraining, which can not always be feasible. Additionally, state-of-art approaches compress DNNs with respect to the average accuracy achieved during inference, which can be a misleading evaluation metric. In this work, we explore more fine-grain properties of DNN inference accuracy, and generate energy-efficient DNNs using signal temporal logic and falsification jointly through pruning and quantization. We offer the ability to control at run-time the quality of the DNN inference, and propose an automated framework that can generate compressed DNNs that satisfy tight fine-grain accuracy requirements. The conducted evaluation on the ImageNet dataset has shown over 30% in energy consumption gains when compared to baseline DNNs. |
08:55 CET | BPA3.2 | A SPEED- AND ENERGY-DRIVEN HOLISTIC TRAINING FRAMEWORK FOR SPARSE CNN ACCELERATORS Speaker: Yuanchen Qu, Shanghaitech University, CN Authors: Yuanchen Qu, Yu Ma and Pingqiang Zhou, Shanghaitech University, CN Abstract Sparse convolution neural network (CNN) accelerators have shown to achieve high processing speed and low energy consumption by leveraging zero weights or activations, which can be further optimized by finely tuning the sparse activation maps in training process. In this paper, we propose a CNN training framework targeting at reducing energy consumption and processing cycles in sparse CNN accelerators. We first model accelerator's energy consumption and processing cycles as functions of layer-wise activation map sparsity. Then we leverage the model and propose a hybrid regularization approximation method to further sparsify activation maps in the training process. The results show that our proposed framework can reduce the energy consumption of Eyeriss by 31.33%, 20.6% and 26.6% on MobileNet-V2, SqueezeNet and Inception-V3. In addition, the processing speed can be increased by 1.96X, 1.4X and 1.65X respectively. |
09:20 CET | BPA3.3 | HARDWARE EFFICIENT WEIGHT-BINARIZED SPIKING NEURAL NETWORKS Speaker: Chengcheng Tang, University of Alberta, CA Authors: Chengcheng Tang and Jie Han, University of Alberta, CA Abstract Abstract—The advancement in spiking neural networks (SNNs) provides a promising and alternative approach to conventional artificial neural networks (ANNs) with higher energy efficiency. However, the significant requirements on memory usage presents a performance bottleneck on resource constrained devices. Inspired by the notion of binarized neural networks (BNNs), we incorporate the design principles in BNNs into that of SNNs to reduce the stringent resource requirements. Specifically, the weights are binarized to 1 and −1 for implementing the functions of excitatory and inhibitory synapses. Hence, the proposed design is referred to as a weight-binarized spiking neural network (WB-SNN). In the WB-SNN, only one bit is used for the weight or a spike; for the latter, 1 and 0 indicate a spike and no spike, respectively. A priority encoder is used to identify the index of an active neuron as a basic unit to construct the WB-SNN. We further design a fully connected neural network that consists of an input layer, an output layer, and fully connected layers of different sizes. A counter is utilized in each neuron to complete the accumulation of weights. The WB-SNN design is validated by using a multi-layer perceptron on the MNIST dataset. Hardware implementations on FPGAs show that the WB-SNN attains a significant saving of memory with only a limited accuracy loss compared with its SNN and BNN counterparts. |
09:45 CET | BPA3.4 | INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Speaker: Authors of the session, DATE, BE Author: Session Chairs, DATE, BE Abstract Participants can freely interact with authors during their interactive technical presentations. |
BPA4 Hardware accelerators
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 08:30 CET - 10:30 CET
Location / Room: Okapi Room 0.8.2
Session chair:
Nima TaheriNejad, Heidelberg University, DE
Time | Label | Presentation Title Authors |
---|---|---|
08:30 CET | BPA4.1 | ACCELERATING GUSTAVSON-BASED SPMM ON EMBEDDED FPGAS WITH ELEMENT-WISE PARALLELISM AND ACCESS PATTERN-AWARE CACHES Speaker: Shiqing Li, Nanyang Technological University, SG Authors: Shiqing Li and Weichen Liu, Nanyang Technological University, SG Abstract The Gustavson's algorithm (i.e., the row-wise product algorithm) shows its potential as the backbone algorithm for sparse matrix-matrix multiplication (SpMM) on hardware accelerators. However, it still suffers from irregular memory accesses and thus its performance is bounded by the off-chip memory traffic. Previous works mainly focus on high bandwidth memory-based architectures and are not suitable for embedded FPGAs with traditional DDR. In this work, we propose an efficient Gustavson-based SpMM accelerator on embedded FPGAs with element-wise parallelism and access pattern-aware caches. First of all, we analyze the parallelism of the Gustavson's algorithm and propose to perform the algorithm with element-wise parallelism, which reduces the idle time of processing elements caused by synchronization. Further, we show a counter-intuitive example that the traditional cache leads to worse performance. Then, we propose a novel access pattern-aware cache scheme called SpCache, which provides quick responses to reduce bank conflicts caused by irregular memory accesses and combines streaming and caching to handle requests that access ordered elements of unpredictable length. Finally, we conduct experiments on the Xilinx Zynq-UltraScale ZCU106 platform with a set of benchmarks from the SuiteSparse matrix collection. The experimental results show that the proposed design achieves an average 1.62x performance speedup compared to the baseline. |
08:55 CET | BPA4.2 | GRAPHITE: ACCELERATING ITERATIVE GRAPH ALGORITHMS ON RERAM ARCHITECTURES VIA APPROXIMATE COMPUTING Speaker: Dwaipayan Choudhury, Washington State University, US Authors: Dwaipayan Choudhury, Ananth Kalyanaraman and Partha Pratim Pande, Washington State University, US Abstract ReRAM-based Processing-in-Memory (PIM) offers a promising paradigm for computing near data, making it an attractive platform of choice for graph applications that suffer from sparsity and irregular memory access. However, the performance of ReRAM-based graph accelerators is limited by two key challenges – significant storage requirements (particularly due to wasted zero cell storage of a graph's adjacency matrix), and significant amount of on-chip traffic between ReRAM-based processing elements. In this paper we present, GraphIte, an approximate computing-based framework for accelerating iterative graph applications on ReRAM-based architectures. GraphIte uses sparsification and approximate updates to achieve significant reductions in ReRAM storage and data movement. Our experiments on PageRank and community detection show that our proposed architecture outperforms a state-of-the-art ReRAM-based graph accelerator by up to 83.4% reduction in execution time while consuming up to 87.9% less energy for a range of graph inputs and workloads. |
09:20 CET | BPA4.3 | PEDAL: A POWER EFFICIENT GCN ACCELERATOR WITH MULTIPLE DATAFLOWS Speaker: Nishil Talati, University of Michigan, US Authors: Yuhan Chen, Alireza Khadem, Xin He, Nishil Talati, Tanvir Ahmed Khan and Trevor Mudge, University of Michigan, US Abstract Graphs are ubiquitous in many application domains due to their ability to describe structural relations. Graph Convolutional Networks (GCNs) have emerged in recent years and are rapidly being adopted due to their capability to perform Machine Learning (ML) tasks on graph-structured data. GCN exhibits irregular memory accesses due to the lack of locality when accessing graph-structured data. This makes it hard for general-purpose architectures like CPUs and GPUs to fully utilize their computing resources. In this paper, we propose PEDAL, a power-efficient accelerator for GCN inference supporting multiple dataflows. PEDAL chooses the best-fit dataflow and phase ordering based on input graph characteristics and GCN algorithm, achieving both efficiency and flexibility. To achieve both high power efficiency and performance, PEDAL features a light-weight processing element design. PEDAL achieves 144.5, 9.4, and 2.6 times speedup compared to CPU, GPU, and HyGCN, respectively, and 8856, 1606, 8.4, and 1.8 times better power efficiency compared to CPU, GPU, HyGCN, and EnGN, respectively. |
09:45 CET | BPA4.4 | INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Speaker: Authors of the session, DATE, BE Author: Session Chairs, DATE, BE Abstract Participants can freely interact with authors during their interactive technical presentations. |
FS7 Focus session: Sustainable chip production: can we align its carbon footprint on Paris Agreement 1.5°C pathways?
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 08:30 CET - 10:00 CET
Location / Room: Okapi Room 0.8.1
If we are serious about limiting climate change to a global warming of 1.5°C at the end of the century, every economic sector needs to reduce its carbon footprint in order to align the global greenhouse gas (GHG) emissions on Paris Agreement (PA) pathways. The awareness of this challenge has significantly grown over the last few years both in the chip-design and chip-fabrication communities. However, so far it is not clear how to reach sustained reduction of the carbon footprint for chip manufacturing at a rate of 8%/year that would be aligned with the 1.5°C PA pathway. In this session, we will rely on the IPAT/Kaya decomposition to discuss the feasibility of the various levers we have in our hand: offsetting of the carbon footprint through the enablement of reduction of GHG emissions in other economic sectors (kgCO2e), decarbonization of the energy used for chip production (kgCO2e/MJ), reduction of the energy intensity of chip production (MJ/cm²) and innovation slowdown with planned degrowth of chip production volumes (cm²). We will see that it is very unlikely that carbon footprint reduction can be fast enough without limiting the global production volume as measured in total wafer area.
Time | Label | Presentation Title Authors |
---|---|---|
08:30 CET | FS7.1 | HOW REALISTIC ARE CLAIMS ABOUT THE BENEFITS OF USING DIGITAL TECHNOLOGIES FOR GHG EMISSIONS MITIGATION? Presenter: Sophie Quinton, Inria, FR Author: Sophie Quinton, Inria, FR Abstract While the direct environmental impacts of digital technologies are now well documented, it is often said that they could also help reduce greenhouse gas (GHG) emissions significantly in many domains such as transportation, building, manufacturing, agriculture, and energy. Assessing such claims is essential to avoid delaying alternative action or research. This also applies to related claims about how much GHG emissions existing digital technologies are already avoiding. In this talk, we point out critical issues related to these topics in the state of the art and propose a set of guidelines that all studies on digital solutions for mitigating GHG emissions should satisfy. |
08:55 CET | FS7.2 | FROM SILICON SHIELD TO CARBON LOCK-IN? THE ENVIRONMENTAL FOOTPRINT OF ELECTRONIC COMPONENTS MANUFACTURING IN TAIWAN Presenter: Gauthier Roussilhe, Royal Melbourne Institute of Technology, AU Author: Gauthier Roussilhe, Royal Melbourne Institute of Technology, AU Abstract Taiwan plans to rapidly increase its industrial production capacity of electronic components while concurrently setting policies for its ecological transition. Given that the island is responsible for the manufacturing of a significant part of worldwide electronics components, the sustainability of the Taiwanese electronics industry is therefore of critical interest. Beyond relative efficiency gains this talk will present an assessment of the absolute environmental footprint of electronic components manufacturers, and its trend, at the national scale. Putting this assessment in perspective with geopolitical, energy and economic factors, this talk analyses what it means for Taiwan and for other countries pursuing development of a sub-10nm CMOS industrial landscape. |
09:20 CET | FS7.3 | ASSESSING THE ENVIRONMENTAL IMPACTS OF IC DEVICES THROUGH KAYA DECOMPOSITION: TRENDS AND IMPLICATIONS Presenter: Lieven Eeckhout, UGent, BE Author: Lieven Eeckhout, UGent, BE Abstract This talk will reformulate the well-known Kaya identity to understand the environmental and carbon footprint of manufacturing and using integrated circuits. By making a distinction between embodied and operational carbon emissions, we are able to understand (1) how the global carbon footprint of computing is likely to scale in the future, and (2) what we, as computer engineers, can do to reduce the environmental impact of computing. We conclude that computer engineers should first and foremost design smaller chips; reducing lifetime energy consumption is of secondary importance, yet still significant. |
09:45 CET | FS7.4 | WRAP-UP AND PERSPECTIVES FOR THE EUROPEAN CHIPS ACT Presenter: David Bol, Université catholique de Louvain, BE Author: David Bol, Université catholique de Louvain, BE Abstract Building on the scientific evidences from the three previous talks, we will discuss the context of planetary boundaries and GHG reduction pathways for the European Chips Act, highlighting the tension between efficiency, resiliency and sobriety. We will clarify that if the EU wants to reduce the carbon footprint of its electronic component sector, the additional semiconductor manufacturing capacity fostered by the European Chips Act should be deployed as a substitution (replacement) for some capacity in the rest of the world and not as an addition to it. |
FS8 Focus session: Supporting Design in the EU: the “Chips for Europe” Initiative
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 08:30 CET - 10:00 CET
Location / Room: Gorilla Room 1.5.3
Time | Label | Presentation Title Authors |
---|---|---|
08:30 CET | FS8.1 | INTERACTIVE SESSION Presenter: Marco CECCARELLI, European Commission, BE Authors: Marco CECCARELLI1 and Romano Hoofman2 1European Commission, BE; 2IMEC IC-link, BE Abstract The European Chips Act foresees over EUR 11 billion of public support for the "Chips for Europe” Initiative. Chip design is one of its key priorities. The initiative encompasses 5 lines of action: a cloud-based virtual design platform; pilot lines for prototyping and validation; tools and infrastructures for quantum chips; skills development and competence centres; a Chips Fund offering loans and equity investment solutions. This open, interactive session will focus particularly on the development of the envisaged virtual design platform, which will offer easy cloud-based access to tools, library and support services to accelerate development and reduce time-to-market. The new platform will build upon the successful experience of EUROPRACTICE, offering access to IC services, prototyping and fabrication. Further, it aims at enhancing collaboration among stakeholders for the development of European technology, IP and tools, including open-source. Interventions from the audience are encouraged to exchange views on how the proposed platform can lower entry barriers, stimulate IP creation and exchange, accelerate innovation. We encourage all interested parties to contribute to the discussion, thereby helping to shape this initiative to the benefit of the European design ecosystem. |
M05 NVMExplorer: A Framework for Cross-Stack Comparisons of Embedded, Non-Volatile Memory Solutions
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 08:30 CET - 10:00 CET
Location / Room: Toucan Room 2.7.1/2
Organisers:
Lillian Pentecost, Amherst College, US
Alexander Hankin, Intel Labs, US
Marco Donato, Tufts University, US
Mark Hempstead, Tufts University, US
David Brooks, Harvard University, US
Gu-Yeon Wei, Harvard University, US
The wide adoption of data-intensive algorithms to tackle today’s computational problems introduces new challenges in designing efficient computing systems to support these applications. In critical domains such as machine learning and graph processing, data movement remains a major performance and energy bottleneck. As repeated memory accesses to off-chip DRAM impose an overwhelming energy cost, we need to rethink the way embedded (i.e., on-chip) memory systems are built in order to increase storage density and energy efficiency beyond what is currently possible with SRAM. To address these challenges and empower future memory system design, we developed NVMExplorer: a design space exploration framework that addresses key cross-computing-stack design questions and reveals opportunities and optimizations for embedded NVMs under realistic system-level constraints, while providing a flexible interface and modular evaluation to empower further investigations.
This tutorial will describe and walk through hands-on design studies using our open-source code base (NVMExplorer, http://nvmexplorer.seas.harvard.edu/), highlighting the most up-to-date features of our suite of tools including integration with additional memory
characterization tools and system simulator results. We will also guide attendees to configure their own design studies based on research interests.
At the end of this tutorial, attendees will be able to use NVMExplorer to evaluate and compare the application-level power and performance impact of a variety of eNVM solutions, including different technology configurations, varying system settings and optimization targets, and a range of application memory traffic patterns.
SpD3 Special day on Personalised Medicine: Biological Computing
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 08:30 CET - 10:00 CET
Location / Room: Darwin Hall
Session chair:
Jan Madsen, TU Denmark, DK
Time | Label | Presentation Title Authors |
---|---|---|
08:30 CET | SpD3.1 | DESIGN OF A MICROFLUIDIC-BASED COMPUTING CHIP USING BACTERIA Presenter: Daniel Martins, Walton Institute for Information and Communication Systems Science, IE Author: Daniel Martins, Walton Institute for Information and Communication Systems Science, IE Abstract Biocomputing systems are being designed, using nanoscale and biologically based materials, to execute computational functions required by novel medical and environmental diagnostic tools. The complexity of such applications led required the use of biomolecular systems, e.g. bacteria, to provide more information processing capabilities to these biocomputing systems. Therefore, we propose a bacteria-based system that can be flexibly integrated into a microfluidic chip to compute biomolecules. This design combines electrochemical sensors, microfluidics and molecular communications to compute molecules emitted by bacterial populations. We assess the performance of the proposed biocomputing chip in terms of reliable logic computing based on the modelling of the communications processes occurring in this system and the detection thresholds for the electrochemical sensors. Our results shows the impacts of two bottlenecks for the proposed biocomputing system and lay the foundation for future bacteria-based diagnostic devices. |
09:00 CET | SpD3.2 | EFFICIENT AND LOW-COST METHODS TO DESIGN, TEST, AND OPERATE MICROFLUIDIC SYSTEMS Speaker: Tsun-Ming Tseng, TU Munich, DE Author: Ulf Schlichtmann, TU Munich, DE Abstract Microfluidics are widely considered to be a powerful lab-on-a-chip platform, the applications of which nowadays range from genome sequencing to wearable devices. Point of care diagnostic approaches can benefit significantly from microfluidics technology. The design of microfluidic systems, nevertheless, is still mainly carried out manually based on the designer's experience and knowledge. As microfluidic designs become more complex and intricate, design automation of microfluidics clearly has the potential to improve both design productivity and quality of results. This talk features our research at TUM about how to design, to test, and to operate microfluidic systems efficiently with low cost. |
09:30 CET | SpD3.3 | DEVELOPING BIOLOGICAL AI THROUGH GENE REGULATORY NEURAL NETWORK MODEL Presenter: Sasitharan Balasubramaniam, University of Nebraska-Lincoln, US Author: Sasitharan Balasubramaniam, University of Nebraska-Lincoln, US Abstract Artificial Intelligence (AI) and Machine Learning (ML) are weaving their way into the fabric of society, where they are playing a crucial role in numerous facets of our lives. As we witness the increased deployment of AI and ML in various types of devices, we benefit from their use of learning and interpreting information and providing key decision making. This widespread deployment has lead to a question as to whether AI algorithms can be deployed into non-silicon devices and materials. Recent research has started to see the emergence of Biological AI, where perceptron and neural network properties are formed from biological cells. This is based on both engineering of genetic circuits for creating single perceptrons to population-based communication of cells that leads to neural network behavior. This talk will start with a brief introduction on current state-of-the-art in Biological AI that are based on engineering systems. We will then investigate if non-engineered bacterial cells can also be exploited through natural neural network structures that are found in their gene regulation networks. Through controlled application of chemical agents that controls the operation of the GRN, we maybe able to exploit a neural network function. We will also briefly touch on other forms of ions-based molecular communication that we can possibly used to develop perceptron models. The talk will provide preliminary results on this research, and discuss possible healthcare applications for the future. |
W05 Hyperdimensional Computing and Vector Symbolic Architectures for Automation and Design in Technology and Systems
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 08:30 CET - 12:30 CET
Location / Room: Nightingale Room 2.6.1/2
Organiser:
Antonello Rosato, Sapienza University of Rome, IT
We are pleased to announce the workshop titled “Hyperdimensional Computing and Vector Symbolic Architectures for Automation and Design in Technology and Systems” to be held on Wednesday, 19 April 2023 at DATE 2023 conference in Antwerp, Belgium. This workshop will explore the interplay between different technologies and architectures that enable the automation of complex systems and enable design of efficient and effective solutions for real-world problems.
The objective of this workshop is to bring together leaders in the fields of hyperdimensional computing, vector symbolic architectures, and automation and design in technology and systems to discuss the latest developments and challenges in these areas, and to identify synergies between them.
We invite contributions from researchers, practitioners, industry, and government representatives who are working in the areas of hyperdimensional computing, vector symbolic architectures, and automation and design in technology and systems.
We are issuing a call for poster presentations to be presented at the workshop. The topics of interest include, but are not limited to:
- Automation and Design of technology and systems based on HDC/VSA
- New implementation of HDC and VSA concepts
- Advances in HDC and VSA use in practical applications
- Use of HDC and VSA in the field of ML and AI
Submissions are invited in the form of (extended) abstracts not exceeding two pages and must be submitted via EasyChair following this link:
https://easychair.org/my/conference?conf=w05hdc
Submissions should present innovative research and development results related to the topics listed above. Poster presentations should be self-contained, and include a brief abstract, key references, and contact information. The submission deadline is February 3rd, 2023. All accepted posters will be presented at the workshop.
Key Dates:
Submission deadline: February 10th EXTENDED
Notification of Acceptance: February 13th
Posters ready: March 27th
Workshop: April 19th 8:30-12:30
Technical Program
8:30-10 Invited Speakers
10-10:30 Coffee break
10:30 -12:30 Poster Session and Open Discussion
We look forward to your contributions and can’t wait to meet you for discussion!
If you have any questions, please email at antonello [dot] rosatouniroma1 [dot] it (antonello[dot]rosato[at]uniroma1[dot]it).
W05.1 Invited Talks
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 08:30 CET - 10:00 CET
Location / Room: Nightingale Room 2.6.1/2
Chair:
Antonello Rosato, Sapienza University of Rome, IT
Speakers:
Abbas Rahimi, IBM Research-Zurich, CH
Denis Kleyko, RISE Research Institutes of Sweden, SE
W05.2 Invited Talk
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 10:30 CET - 11:00 CET
Location / Room: Nightingale Room 2.6.1/2
Chair:
Antonello Rosato, Sapienza University of Rome, IT
Speaker:
Peer Neubert, University of Koblenz, DE
W05.3 Poster Session
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 11:00 CET - 12:30 CET
Location / Room: Nightingale Room 2.6.1/2
Chair:
Antonello Rosato, Sapienza University of Rome, IT
M06 Design, Programming, and Partial Reconfiguration of Heterogeneous SoCs with ESP
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 11:00 CET - 12:30 CET
Location / Room: Toucan Room 2.7.1/2
Organisers:
Luca Carloni, Columbia University , US
Joseph H. Zuckerman, Columbia University in the City of New York, US
Energy-efficient, high-performance computing requires the integration of specialized accelerators with general-purpose processors. Designing such systems, however, imposes a difficult set of challenges: integrating many components of different natures into a single SoC; designing new components targeting a particular application domain with a limited team size; dealing with ever-changing software; accelerating multiple applications with a fixed area and power budget. In this tutorial, we present ESP, an open-source platform to support research on the design and programming of heterogeneous SoC architectures. By combining a scalable, modular tile-based architecture with a flexible system-level design methodology, ESP simplifies the design of individual accelerators and automates their hardware/software integration into complete SoCs. In particular, we demonstrate several capabilities of ESP to meet the challenges described above. First, we show how to use the commercial Catapult HLS tool and the open-source Matchlib library to design an accelerator in SystemC; this is a new example of one of the design flows supported by the ESP methodology that simultaneously raise the level of abstraction in the design process and allow designers to conduct a broader design-space exploration. Next, we demonstrate how ESP simplifies the integration of the accelerator into a complete SoC and enables its functional and performance evaluation through rapid FPGA-based prototyping. Finally, we show how recent advances in ESP make it possible to reduce the amount of dark silicon in SOC architectures through fine-grained partial reconfiguration of accelerator tiles.
For more information please see:
- the ESP release on GitHub: https://github.com/sld-columbia/esp
- the ESP documentation: https://www.esp.cs.columbia.edu/docs/
- the ESP publications: https://www.esp.cs.columbia.edu/pubs/
- the ESP tutorials: https://www.esp.cs.columbia.edu/tutorials/
MPP3 Multi-partner projects
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 11:00 CET - 12:30 CET
Location / Room: Gorilla Room 1.5.3
Session chair:
Ernesto Sanchez, Politecnico di Torino, IT
Time | Label | Presentation Title Authors |
---|---|---|
11:00 CET | MPP3.1 | THE TEAMPLAY PROJECT: ANALYSING AND OPTIMISING TIME, ENERGY, AND SECURITY FOR CYBER-PHYSICAL SYSTEMS Speaker: Benjamin Rouxel, Unimore, IT Authors: Benjamin Rouxel1, Christopher Brown2, Emad Ebeid3, Kerstin Eder4, Heiko Falk5, Clemens Grelck6, Jesper Holst7, Shashank Jadhav5, Yoann Marquer8, Marcos Martinez De Alejandro9, Kris Nikov4, Ali Sahafi3, Ulrik Schultz3, Adam Seewald10, Vangelis Vassalos11, Simon Wegener12 and Olivier Zendra13 1Unimore, IT; 2University of St.Andrews, GB; 3University of Southern Denmark, DK; 4University of Bristol, GB; 5Hamburg University of Technology (TUHH), DE; 6University of Amsterdam, NL; 7SkyWatch A/S, DK; 8University of Luxembourg, LU; 9Thales Alenia Space, ES; 10Yale University, US; 11Irida Labs AE, GR; 12AbsInt Angewandte Informatik GmbH, DE; 13INRIA, University of Rennes, CNRS, IRISA, FR Abstract Non-functional properties, such as energy, time, and security (ETS) are becoming increasingly important in Cyber- Physical Systems (CPS) programming. This article describes TeamPlay, a research project funded under the EU Horizon 2020 programme between January 2018 and June 2021. TeamPlay aimed to provide the system designer with a toolchain for developing embedded applications where ETS properties are first-class citizens, allowing the developer to reflect directly on energy, time and security properties at the source code level. In this paper we give an overview of the TeamPlay methodology, introduce the challenges and solutions of our approach and summarise the results achieved. Overall, applying our TeamPlay methodology led to an improvement of up to 18% performance and 52% energy usage over traditional approaches. |
11:03 CET | MPP3.2 | HARDWARE AND SOFTWARE SUPPORT FOR MIXED PRECISION COMPUTING: A ROADMAP FOR EMBEDDED AND HPC SYSTEMS Speaker: William Fornaciari, Politecnico di Milano, IT Authors: William Fornaciari, Giovanni Agosta, Davide Zoni, Andrea Galimberti, Gabriele Magnani, Lev Denisov and Daniele Cattaneo, Politecnico di Milano, IT Abstract Mixed precision is an approximate computing technique that can be used to trade-off computation accuracy for performance and/or energy. It can be applied to many error-tolerant applications, but manual precision tuning is both tedious and error-prone. Furthermore, the effectiveness of the technique heavily depends on hardware characteristics. Therefore, a hardware/software co-design approach is necessary for an effective exploitation of precision tuning opportunities offered by the applications. In this paper, we propose, based on the state of the art of precision tuning software and mixed precision hardware, a roadmap for the evolution of hardware designs and compiler-based precision tuning support, which is ongoing in the context of the European projects TEXTAROSSA (EuroHPC) and APROPOS (ITN). |
11:06 CET | MPP3.3 | REAL TIME ACOUSTIC PERCEPTION FOR AUTOMOTIVE APPLICATIONS Speaker: Jun Yin, KU Leuven, BE Authors: Jun Yin1, Stefano Damiano1, Marian Verhelst1, Toon Waterschoot1 and Andre Guntoro2 1KU Leuven, BE; 2Robert Bosch GmbH, DE Abstract In recent years the automotive industry has been strongly promoting the development of smart cars, equipped with multi-modal sensors to gather information about the surroundings, in order to aid human drivers or make autonomous decisions. While the focus has mostly been on visual sensors, also acoustic events are crucial to detect situations that require a change in the driving behavior, such as a car honking, or the sirens of approaching emergency vehicles. In this paper, we summarize the results achieved so far in the Marie Skłodowska-Curie Actions (MSCA) European Industrial Doctorates (EID) project "Intelligent Ultra Low-Power Signal Processing for Automotive (I-SPOT)". On the algorithmic side, the I-SPOT Project aims to enable detecting, localizing and tracking environmental audio signals by jointly developing microphone array processing and deep learning techniques that specifically target automotive applications. Data generation software has been developed to cover the I-SPOT target scenarios and research challenges. This tool is currently being used to develop low-complexity deep learning techniques for emergency sound detection. On the hardware side, the goal impels workflows for hardware-algorithm co-design to ease the generation of architectures that are sufficiently flexible towards algorithmic evolutions without giving up on efficiency, as well as enable rapid feedback of hardware implications of algorithmic decision. This is pursued though a hierarchical workflow that breaks the hardware-algorithm design space into reasonable subsets, which has been tested for operator-level optimizations on state-of-the-art robust sound source localization for edge devices. Further, several open challenges towards an end-to-end system are clarified for the next stage of I-SPOT. |
11:09 CET | MPP3.4 | HERMES: QUALIFICATION OF HIGH PERFORMANCE PROGRAMMABLE MICROPROCESSOR AND DEVELOPMENT OF SOFTWARE ECOSYSTEM Speaker: Fabrizio Ferrandi, Politecnico di Milano, IT Authors: Nadia Ibellaatti1, Edouard LEPAPE1, Alp Kilic1, Kaya AKYEL1, Kassem CHOUAYAKH1, Fabrizio Ferrandi2, Claudio Barone2, Serena Curzel2, Michele Fiorito2, Giovanni Gozzi2, Miguel MASMANO3, Ana Risquez Navarro3, Manuel Munoz3, Vicente Nicolau Gallego3, Patricia LOPEZ CUEVA4, Jean-noel LETRILLARD5 and Franck WARTEL6 1NanoXplore, FR; 2Politecnico di Milano, IT; 3FENT INNOVATIVE SOFTWARE SOLUTIONSSL - FENTISS, ES; 4THALES ALENIA SPACE FRANCE SAS, FR; 5STMICROELECTRONICS GRENOBLE 2 SAS - STGNB 2 SAS, FR; 6AIRBUS DEFENCE AND SPACE SAS, FR Abstract European efforts to boost competitiveness in the sector of space services promote the research and development of advanced software and hardware solutions. The EU-funded HERMES project contributes to the effort by qualifying radiation-hardened, high-performance programmable microprocessors, and by developing a software ecosystem that facilitates the deployment of complex applications on such platforms. The main objectives of the project include reaching a technology readiness level of 6 (i.e., validated and demonstrated in relevant environment) for the rad-hard NG-ULTRA FPGA with its ceramic hermetic package CGA 1752, developed within projects of the European Space Agency, French National Centre for Space Studies and the European Union. An equally important share of the project is dedicated to the development and validation of tools that support multicore software programming and FPGA acceleration, including Bambu for High-Level Synthesis and the XtratuM hypervisor with a level one boot loader for virtualization. |
11:12 CET | MPP3.5 | A STEP TOWARD SAFE UNATTENDED TRAIN OPERATIONS: A PIONEER VITAL CONTROL MODULE Speaker: Grazia Mascellaro, Politecnico di Bari, IT Authors: Giovanni Mezzina1, Arturo Amendola2, Mario Barbareschi3, Salvatore De Simone2, Grazia Mascellaro1, Alberto Moriconi2, Cataldo Luciano Saragaglia1, Diana Serra2 and Daniela De Venuto1 1Politecnico di Bari, IT; 2Rete Ferroviaria Italiana S.p.A., IT; 3Università Degli Studi di Napoli Federico II, IT Abstract Although the Automatic Train Operation (ATO) is consolidated in urban railways, its use on mainlines is still unexplored. Currently, the first prototypes of train with ATO capable of running on mainlines equipped with specific control systems (e.g., ETCS/ERTMS in Europe) have been realized. However, they require the active presence of staff on board. Recent research in innovative solutions for railway efficiency has opened to the possibility of extending the ATO concept to the Unattended Train Operation (UTO), i.e., the full automation of infrastructures and vehicles. In this context, a project based on synergistic collaboration between academia and the national railway industry has led to the definition of a new Vital Control module (VC). VC includes a PCB, managed by a reliable and safe hard Real-Time Operating System (RTOS). The hardware consists of a Eurocard-sized PCB that houses an Ultrazed-EG System on Module as computing core and embeds several communication interfaces to favor the inclusion in existing apparatus. The VC RTOS runs an application logic that acts as a real-time control core for the assessment of the on-cabin equipment operativity. VC is also responsible for detecting UTO-related hazardous situations by intervening with emergency braking. Both VC hardware and software are developed to be compliant with related safety standards. The proposed VC has been included in an automatic testbed to recreate real-time hazardous scenarios. In this context, VC system has proven to be able to mitigate these scenarios ~2 times faster than current ATO protection system. |
11:15 CET | MPP3.6 | THE POST-PANDEMIC EFFECTS ON IOT FOR SAFETY: THE SAFE PLACE PROJECT Speaker: Luigi Capogrosso, Università di Verona, IT Authors: Federico Cunico1, Luigi Capogrosso1, Alberto Castellini2, Francesco Setti1, Patrik Pluchino3, Filippo Zordan3, Valeria Santus3, Anna Spagnolli3, Stefano Cordibella4, Giambattista Gennari5, Alberto Sozza6, Stefano Troiano1, Roberto Flor1, Andrea Zanella3, Alessandro Farinelli1, Luciano Gamberini3 and Marco Cristani1 1Università di Verona, IT; 2Verona University, IT; 3University of Padua, IT; 4EDALAB s.r.l., IT; 5Motorola Solutions, IT; 6Rete di Impresa Luce in Veneto, IT Abstract COVID-19 had substantial effects on the IoT community which designs systems for safety: the urge to face masks worn by everyone, the analysis of crowds to avoid the spread of the disease, and the sanitization of public environments has led to exceptional research acceleration and fast engineering of the related solutions. Now that the pandemic is losing power, some applications are becoming less important, while others are proving to be useful regardless of the criticality of COVID-19. The SAFE PLACE project is a prime example of this situation (DATE23 MPP category: final stage). SAFE PLACE is an Italian 3M euro regional industrial/academic project, financed by European funds, created to ensure a multidisciplinary choral reaction to COVID-19 in critical environments such as rest homes and public places. SAFE PLACE consortium was able to understand what is no longer useful in this post-pandemic period, and what instead is potentially attractive for the market. For example, the detection of face masks has little importance, while sanitization does have much. This paper shares such analysis, which emerged through a co-design process of three public SAFE PLACE project demonstrators, involving heterogeneous figures spanning from scientists to lawyers. |
11:18 CET | MPP3.7 | INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Speaker: Authors of the session, DATE, BE Author: Session Chairs, DATE, BE Abstract Participants can freely interact with authors during their interactive technical presentations. |
SA2 Application specific circuits and systems
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 11:00 CET - 12:30 CET
Location / Room: Marble Hall
Session chair:
Akash Kumar, TU Dresden, DE
11:00 CET until 11:21 CET: Pitches of regular papers
11:21 CET until 12:30 CET: Interactive technical presentations by the authors of regular papers and extended abstracts
Regular Papers
Time | Label | Presentation Title Authors |
---|---|---|
11:00 CET | SA2.1 | A DECENTRALIZED FRONTIER QUEUE FOR IMPROVING SCALABILITY OF BREADTH-FIRST-SEARCH ON GPUS Speaker: Chou-Ying Hsieh, National Taiwan University, TW Authors: Chou-Ying Hsieh, Po-Hsiu Cheng, Chia-Ming Chang and Sy-Yen Kuo, National Taiwan University, TW Abstract Breath-first-search (BFS) algorithm is the fundamental building block of broad applications from the electronic design automation (EDA) field to social network analysis. With the targeting data set size growing considerable, researchers have turned to developing parallel BFS (PBFS) algorithms and accelerating them with graph processing units (GPUs). The frontier queue, the core idea among state-of-the-art designs of PBFS, opens the door to neighbor visiting parallelism. However, the traditional centralized frontier queue in PBFS suffers from a dramatic collision when excessive threads simultaneously operate on it. Furthermore, the growing size of the graph puts considerable pressure on memory space. Therefore, we first identify the challenges of current frontier queue implementations. To solve these challenges, we proposed the decentralized frontier queue (DFQ), which separates a centralized queue into multiple tiny sub-queues for scattering the atomic operation collision on these queues. We also developed the novel overflow-free enqueue and asynchronous sub-queue drain methods to avoid the overflow issue on the naive sub-queue design. With these two optimizations, the memory consumption of the frontier queue can be constant rather than exponentially growing along with the vertex number of the graph. In our experiments, we showed that our design could have better scalability and grain averagely 1.04x speedup on the execution in the selected benchmark suit with considerable memory space efficiency. |
11:03 CET | SA2.2 | TIMELY FUSION OF SURROUND RADAR/LIDAR FOR OBJECT DETECTION IN AUTONOMOUS DRIVING SYSTEMS Speaker: Wenjing Xie, City University of Hong Kong, CN Authors: Wenjing XIE1, Tao Hu1, Neiwen Ling2, Guoliang Xing2, Shao-Shan Liu3 and Nan Guan1 1City University of Hong Kong, HK; 2The Chinese University of Hong Kong, HK; 3BeyonCa, CN Abstract Fusion of multiple sensor modalities, such as camera, Lidar and Radar, are commonly used in autonomous driving systems to fully utilize the complementary advantages of different sensors. Surround Radar/Lidar can provide 360-degree view sampling with the minimal cost, which are promising sensing hardware solutions for autonomous driving systems. However, due to the intrinsic physical constraints, the rotating speed (i.e., the frequency to generate data frames) of surround Radar is much lower than surround Lidar, and existing Radar/Lidar fusion methods have to work at the low frequency of surround Radar, which cannot meet the high responsiveness requirement of autonomous driving systems. This paper develops techniques to fuse surround Radar/Lidar with working frequency only limited by the faster surround Lidar instead of the slower surround Radar, based on the state-of-the-art Radar/Lidar DNN model MVDNet. The basic idea of our approach is simple: we let MVDNet work with temporally unaligned data from Radar/Lidar, so that fusion can take place at any time when a new Lidar data frame arrives, instead of waiting for the slow Radar data frame. However, directly applying MVDNet to temporally unaligned Radar/Lidar data greatly degrades its object detection accuracy. The key information revealed in this paper is that we can achieve high output frequency with little accuracy loss by enhancing the training procedure to explore the temporal redundancy in fusion procedure of MVDNet so that it can tolerate the temporal unalignment of the input data. We explore several different ways of training enhancement and compare them quantitatively with experiments. |
11:06 CET | SA2.3 | A LIGHTWEIGHT AND ADAPTIVE CACHE ALLOCATION SCHEME FOR CONTENT DELIVERY NETWORKS Speaker: Ke Liu, Wuhan National Laboratory for Optoelectronics, CN | Huazhong University of Science & Technology, CN Authors: Ke Liu1, Hua Wang2, Ke Zhou1 and Cong Li3 1Wuhan National Laboratory for Optoelectronics (WNLO) of Huazhong University of Science and Technology (HUST), CN; 2Huazhong University of Science & Technology, CN; 3Tencent, CN Abstract Content delivery networks (CDNs) caching systems use multi-tenant shared caching due to their operational simplicity. However, this approach often results in significant space waste and requires timely space allocation. On the one hand, the accuracy and reliability of the existing static allocation schemes are not high. On the other hand, due to the large number of tenants in CDNs, the dynamic allocation schemes based on miss ratio curve (MRC) for cache space allocation will cause high computational overheads and performance fluctuations. As a result, none of these existing solutions can be used directly in the CDN caching system. In this paper, we propose a lightweight and adaptive cache allocation scheme for CDNs (LACA). Rather than configuring near-optimal configurations for each tenant, LACA detects in real time whether any tenants are using cache space inefficiently, and then constructs local MRCs for those tenants. Finally, the space to be adjusted is calculated from the local MRCs. We have deployed LACA in Company-T's CDN system and LACA can reduce the miss ratio by 27.1% and reduce the average user access latency by 28.5ms. Then, in terms of the accuracy of constructing local MRCs, LACA is compared with several current more advanced schemes. Experimental results demonstrate that LACA constructs a higher-accuracy local MRC with little overhead. In addition, LACA can adjust the space as fast as once a minute. |
11:09 CET | SA2.4 | TBERT: DYNAMIC BERT INFERENCE WITH TOP-K BASED PREDICTORS Speaker: Zejian Liu, Chinese Academy of Sciences, CN Authors: Zejian Liu1, kun zhao2 and Jian Cheng2 1Chinese Academy of Sciences, CN; 2Institute of Automation, CN Abstract Dynamic inference is a compression method that adaptively prunes unimportant components according to the input at the inference stage, which can achieve a better trade-off between computational complexity and model accuracy than static compression methods. However, there are two limitations in previous works. The first one is that they usually need to search the threshold on the evaluation dataset to achieve the target compression ratio, but the search process is non-trivial. The second one is that these methods are unstable. Their performance will be significantly degraded on some datasets, especially when the compression ratio is high. In this paper, we propose TBERT, a simple yet stable dynamic inference method. TBERT utilizes the top-k-based pruning strategy which allows accurate control of the compression ratio. To enable stable end-to-end training of the model, we carefully design the structure of the predictor. Moreover, we propose adding auxiliary classifiers to help the model's training. Experimental results on the GLUE benchmark demonstrate that our method achieves higher performance than previous state-of-the-art methods. |
11:12 CET | SA2.5 | TOKEN ADAPTIVE VISION TRANSFORMER WITH EFFICIENT DEPLOYMENT FOR FINE-GRAINED IMAGE RECOGNITION Speaker: Chonghan Lee, Pennsylvania State University, US Authors: Chonghan Lee1, Rita Brufau2, Ke Ding2 and Vijaykrishnan Narayanan1 1Pennsylvania State University, US; 2Intel Labs, US Abstract Fine-grained Visual Classification (FGVC) aims to distinguish object classes belonging to the same category, e.g., different bird species or models of vehicles. The task is more challenging than ordinary image classification due to the subtle inter-class differences. Recent works proposed deep learning models based on the vision transformer (ViT) architecture with its self-attention mechanism to locate important regions of the objects and derive global information. However, deploying them on resource-restricted internet of things (IoT) devices is challenging due to their intensive computational cost and memory footprint. Energy and power consumption varies in different IoT devices. To improve their inference efficiency, previous approaches require manually designing the model architecture and training a separate model for each computational budget. In this work, we propose Token Adaptive Vision Transformer (TAVT) that dynamically drops out tokens and can be used for various inference scenarios across many IoT devices after training the model once. Our adaptive model can switch among different token drop configurations at run time, providing instant accuracy-efficiency trade-offs. We train a vision transformer with a progressive token pruning scheme, eliminating a large number of redundant tokens in the later layers. We then conduct a multi-objective evolutionary search with the overall number of floating point operations (FLOPs) as its efficiency constraint that could be translated to energy consumption and power to find the token pruning schemes that maximize accuracy and efficiency under various computational budgets. Empirical results show that our proposed TAVT dramatically speeds up the inference latency by up to 10x and reduces memory requirements and FLOPs by up to 5.5 x and 13x respectively while achieving competitive accuracy compared to prior ViT-based state-of-the-art approaches. |
11:15 CET | SA2.6 | END-TO-END OPTIMIZATION OF HIGH-DENSITY E-SKIN DESIGN: FROM SPIKING TAXEL READOUT TO TEXTURE CLASSIFICATION Speaker: Jiaqi Wang, KU Leuven, CN Authors: Jiaqi Wang, Mark Daniel Alea, Jonah Van Assche and Georges Gielen, KU Leuven, BE Abstract Spiking readout architectures are a promising low- power solution for high-density e-skins. This paper proposes the end-to-end model-based optimization of a high-density neuro- morphic e-skin solution, from the taxel readout to the texture classification. Architectural explorations include the spike coding and preprocessing, and the neural network used for classification. Simple rate coding preprocessing to spiking outputs from a modeled low-resolution on-chip spike encoder is demonstrated to achieve a comparable texture classification accuracy of 90 % at lower power consumption compared to the state of art. The modeling has also been extended from single-channel sensor recording to time-shifted multi-taxel readout. Applying this optimization to an actual tactile sensor array, the classification accuracy is boosted by 63 % for a low-cost FFNN using multi-taxel data. The proposed Spike-based SNR (SSNR) and Spike Time Error (STE) metrics for the taxel readout circuitry are shown to be good predictors of the accuracy. |
11:18 CET | SA2.7 | TOWARDS DEEP LEARNING-BASED OCCUPANCY DETECTION VIA WIFI SENSING IN UNCONSTRAINED ENVIRONMENTS Speaker: Cristian Turetta, Università di Verona, IT Authors: Cristian Turetta1, Geri Skenderi1, Luigi Capogrosso1, Florenc Demrozi2, Philipp H. Kindt3, Alejandro Masrur4, Franco Fummi1, Marco Cristani1 and Graziano Pravadelli1 1Università di Verona, IT; 2Department of Electrical Engineering and Computer Science, University of Stavanger, NO; 3Lehrstuhl für Realzeit-Computersysteme (RCS), TU München (TUM), DE; 4TU Chemnitz, DE Abstract In the context of smart buildings and smart cities, the design of low-cost and privacy-aware solutions for recognizing the presence of humans and their activities is becoming of great interest. Existing solutions exploiting wearables and video-based systems have several drawbacks, such as high cost, low usability, poor portability, and privacy-related issues. Consequently, more ubiquitous and accessible solutions became the focus of attention, such as WiFi sensing. However, at the current state of the art, WiFi sensing is subject to low accuracy and poor generalization, primarily affected by environmental factors, such as humidity and temperature variations and furniture position changes. Such issues are partially solved at the cost of complex data preprocessing pipelines. In this paper, we present a highly accurate, resource-efficient occupancy detection solution based on deep learning, which is resilient to variations in humidity and temperature. The approach is tested on an extensive benchmark, where people are free to move and the furniture layout does change. In addition, based on a consolidated algorithm of explainable AI, we quantify the importance of the WiFi signal w.r.t. humidity and temperature for the proposed approach. Notably, humidity and temperature can indeed be predicted based on WiFi signals; this promotes the expressivity of the WiFi signal, and at the same time the need for a non-linear model to properly deal with it. |
Extended Abstracts
Time | Label | Presentation Title Authors |
---|---|---|
11:21 CET | SA2.8 | CONTENT- AND LIGHTING-AWARE ADAPTIVE BRIGHTNESS SCALING FOR IMPROVED MOBILE USER EXPERIENCE Speaker: Samuel Isuwa, University of Southampton, GB Authors: Samuel Isuwa1, David Amos2, Amit Kumar Singh3, Bashir Al-Hashimi1 and Geoff Merrett1 1University of Southampton, GB; 2University of Maiduguri, NG; 3University of Essex, GB Abstract For an improved user experience, the display subsystem is expected to provide superior resolution and optimal brightness despite its impact on battery life. Existing brightness scaling approaches set the display brightness statically or adaptively in response to predefined events such as low-battery or ambient light of the environment, which are independent of the displayed content. Approaches that consider the displayed content are either limited to video content or do not account for the user's expected battery life, thereby failing to maximise the user experience. This paper proposes content- and ambient lighting-aware adaptive brightness scaling in mobile devices that maximises user experience while meeting battery life expectations. The approach employs a content- and ambient lighting-aware profiler that learns and classifies each sample into predefined clusters at runtime by leveraging insights on user perceptions of content and ambient luminance variations. We maximise user experience through adaptive scaling of the display's brightness using an energy prediction model that determines appropriate brightness levels while meeting expected battery life. The evaluation of the proposed approach on a commercial smartphone improves Quality of Experience (QoE) by up to 24.5% compared to the state-of-the-art. |
11:21 CET | SA2.9 | TOWARDS SMART CATTLE FARMS: AUTOMATED INSPECTION OF CATTLE HEALTH WITH REAL-LIFE DATA Speaker: Yigit Tuncel, University of Wisconsin-Madison, US Authors: Yigit Tuncel1, Toygun Basaklar1, Mackenzie Smithyman2, Vinicius Nunes de Gouvea3, Joao Dorea1, Younghyun Kim4 and Umit Ogras1 1University of Wisconsin - Madison, US; 2New Mexico State University, US; 3Texas A&M University, US; 4University of Wisconsin-Madison, US Abstract Cattle health problems, such as Bovine Respiratory Disease (BRD), constitute a significant source of economic loss for the agriculture industry. The current management practices to diagnose and select cattle for treatment is a widespread clinical scoring system called DART (Depression, Appetite, Respiration, and Temperature). DART requires significant manual human labor since animal evaluation is done individually. We propose a novel wearable accelerometer-based IoT system that predicts the DART scores to provide real-time animal health monitoring and hence, to reduce labor and costs associated with manual animal inspection and intervention. The proposed system first processes the accelerometer data to construct features that encode cattle's daily behavior. Then, it uses a lightweight decision-tree classifier to predict the DART score. We evaluate our approach on a dataset that consists of accelerometer data and veterinarian-approved DART scores for 54 animals. According to the results, the proposed system can classify healthy and sick animals with 78% accuracy. Furthermore, our approach outperforms 13 commonly used state-of-the-art time-series classifiers in terms of both accuracy and computational complexity. With 1 KB SRAM usage and less than 29 uJ energy consumption in a day, it enables an easily deployable IoT solution for smart farms. |
11:21 CET | SA2.10 | TIME SERIES-BASED DRIVING EVENT RECOGNITION FOR TWO WHEELERS Speaker: Sai Usha Goparaju, International Institute of Information Technology, IN Authors: Sai Usha Goparaju1, Lakshmanan L2, Abhinav Navnit2, Rahul Biju1, Lovish Bajaj3, Deepak Gangadharan4 and Dr. Aftab Hussain4 1International Institute of Information and Technology, IN; 2International Institute of Information Technology, IN; 3Manipal Acadamy of Higher education, IN; 4IIIT Hyderabad, IN Abstract Classification of a motorcycle's driving events can provide deep insights to detect issues related to driver safety. Safety in two wheelers is a less studied problem, and we are attempting to address this gap by providing a learning based solution to classify driving events. Firstly, we developed a hardware system with 3-D accelerometer/gyroscope sensors that can be deployed on a motorcycle. The data obtained from these sensors is used to identify various driving events. We have investigated several machine learning (ML) models to classify driving events. However, in this process, we identified that though the overall accuracy of these traditional ML models is decent enough, the class-wise accuracy of these models is poor. Hence, we have developed time-series-based classification algorithms using LSTM and Bi-LSTM to classify various driving events. We have also embedded an attention mechanism in the architecture of these models for enhanced feature learning, thus improving the accuracy of event recognition. The experiments conducted have demonstrated that the proposed models have surpassed the state-of-the-art models in the context of driving event recognition with reasonable class-wise accuracies. We have also deployed these models on the edge devices like Raspberry pi and ESP32 and successfully reproduced the prediction accuracies in the devices. The experiments demonstrated that the proposed Bi-LSTM model showed a minimum of 87\% accuracy and a maximum of 99\% accuracy in class-wise prediction on a 2-wheeler driving dataset. |
SD1 System modelling, simulation, and validation
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 11:00 CET - 12:30 CET
Location / Room: Gorilla Room 1.5.1
Session chair:
Christian Pilato, Politecnico di Milano, IT
11:00 CET until 11:24 CET: Pitches of regular papers
11:24 CET until 12:30 CET: Interactive technical presentations by the authors of regular papers and extended abstracts
Regular Papers
Time | Label | Presentation Title Authors |
---|---|---|
11:00 CET | SD1.1 | SPATIO-TEMPORAL MODELING FOR FLASH MEMORY CHANNELS USING CONDITIONAL GENERATIVE NETS Speaker: Paul Siegel, University of California, San Diego, US Authors: Simeng Zheng, Chih-Hui Ho, Wenyu Peng and Paul Siegel, University of California, San Diego, US Abstract Modeling spatio-temporal read voltages with complex distortions arising from the write and read mechanisms in flash memory devices is essential for the design of signal processing and coding algorithms. In this work, we propose a data-driven approach to modeling NAND flash memory read voltages in both space and time using conditional generative networks. This generative flash modeling (GFM) method reconstructs read voltages from an individual memory cell based on the program levels of the cell and its surrounding cells, as well as the time stamp. We evaluate the model over a range of time stamps using the cell read voltage distributions, the cell level error rates, and the relative frequency of errors for patterns most susceptible to inter-cell interference (ICI) effects. Experimental results demonstrate that the model accurately captures the complex spatial and temporal features of the flash memory channel. |
11:03 CET | SD1.2 | EFFICIENT APPROXIMATION OF PERFORMANCE SPACES FOR ANALOG CIRCUITS VIA MULTI-OBJECTIVE OPTIMIZATION Speaker: Benedikt Ohse, Ernst-Abbe-Hochschule Jena, DE Authors: Benedikt Ohse1, David Schreiber2, Juergen Kampe1 and Christopher Schneider1 1Ernst-Abbe-Hochschule Jena, DE; 2University of Applied Sciences Jena, DE Abstract This paper presents an adaptation of the well-known normal boundary intersection (NBI) method for approximating complete feasible performance spaces of analog integrated circuits. Those spaces provide accurate information about all feasible combinations of competing performance parameters in a circuit. While the NBI-method is originally designed for computing the so-called Pareto front of a multi-objective optimization problem only, it can be adapted for approximating the complete performance space with some modifications. A scalarization into single-objective optimization problems is performed within our developed tool, which can be connected to any Spice-based simulator. Besides presenting the algorithm and its adaptations, the focus lies on investigating parallelization techniques and their effect on decreasing the computational time. Numerical experiments show the computed approximations of two- and three-dimensional performance spaces of several OTAs and compare the efficiencies of different parallelization schemes. |
11:06 CET | SD1.3 | MULTIDIMENSIONAL FEATURES HELPING PREDICT FAILURES IN PRODUCTION SSD-BASED CONSUMER STORAGE SYSTEMS Speaker: Xinyan Zhang, Huazhong University of Science & Technology, CN Authors: Xinyan Zhang1, Zhipeng Tan1, Dan Feng1, Qiang He1, Ju Wan1, Hao Jiang2, Ji Zhang2, Lihua Yang1 and Wenjie Qi1 1Wuhan National Laboratory for Optoelectronics, CN | Huazhong University of Science & Technology, CN; 2Huawei Technologies, CN Abstract As SSD failures seriously lead to data loss and service interruption, proactive failure prediction is often used to improve system availability. However, the unidimensional SMART-based prediction models hardly predict all drive failures. Some other features applied in data centers and enterprise storage systems are not readily available in consumer storage systems (CSS). To further analyze related failures in production SSD-based CSS, we study nearly 2.3 million SSDs from 12 drive models based on a dataset of SMART logs, trouble tickets, and error logs. We discover that SMART, FirmwareVersion, WindowsEvent, and BlueScreenofDeath (SFWB) are closely related to SSD failures. We further propose a multidimensional-based failure prediction approach (MFPA), which is portable in algorithms, SSD vendors, and PC manufacturers. Experiments on the datasets show that MFPA achieves a high true positive rate (98.18%) and low false positive rate (0.56%), which is 4% higher and 86% lower than the SMART-based model. It is robust and can continuously predict for 2-3 months without iteration, substantially improving the system availability. |
11:09 CET | SD1.4 | PAR-GEM5: PARALLELIZING GEM5'S ATOMIC MODE Speaker: Niko Zurstraßen, RWTH Aachen University, DE Authors: Niko Zurstraßen1, Jose Cubero-Cascante2, Jan Moritz Joseph2, Rainer Leupers2, Xie Xinghua3 and Li Yichao3 1RWTH Aachen Institute for Communication Technologies and Embedded Systems, DE; 2RWTH Aachen University, DE; 3Huawei Technologies, CN Abstract While the complexity of MPSoCs continues to grow exponentially, their often sequential simulations could only benefit from a linear performance gain since the end of Dennard scaling. As a result, each new generation of MPSoCs requires ever longer simulation times. In this paper, we propose a solution to this problem: par-gem5 - the first universally parallelized version of the Full-System Simulator (FSS) gem5. It exploits the host system's multi-threading capabilities using a modified conservative, quantum-based Parallel Discrete Event Simulation. Compared to other parallel approaches, par-gem5 uses relaxed causality constraints, allowing temporal errors to occur. Yet, we show that the system's functionality is retained, and the inaccuracy of simulation statistics, such as simulation time or cache miss rate, can be kept within a single-digit percentage. Furthermore, we extend par-gem5 by a temporal error estimation that assesses the accuracy of a simulation without a sequential reference simulation. Our experiments reached speedups of 24.7x when simulating a 128-core ARM-based MPSoC on a 128-core host system. |
11:12 CET | SD1.5 | FAST BEHAVIOURAL RTL SIMULATION OF 10B TRANSISTOR SOC DESIGNS WITH METRO-MPI Speaker: Guillem López Paradís, BSC & UPC, ES Authors: Guillem López-Paradís1, Brian Li2, Adrià Armejach3, Stefan Wallentowitz4, Miquel Moreto1 and Jonathan Balkind2 1BSC, ES; 2University of California, Santa Barbara, US; 3BSC & UPC, ES; 4Munich University of Applied Sciences, DE Abstract Chips with tens of billions of transistors have become today's norm. These designs are straining our electronic design automation tools throughout the design process, requiring ever more computational resources. In many tools, parallelisation has improved both latency and throughput for the designer's benefit. However, tools largely remain restricted to a single machine and in the case of RTL simulation, we believe that this leaves much potential performance on the table. We introduce Metro-MPI to improve RTL simulation for modern 10 billion transistor-scale chips. Metro-MPI exploits the natural boundaries present in chip designs to partition RTL simulations and leverage High Performance Computing (HPC) techniques to extract parallelism. For chip designs that scale in size by exploiting latency-insensitive interfaces like networkson-chip and AXI, Metro-MPI offers a new paradigm for RTL simulation scalability. Our implementation of Metro-MPI in OpenPiton+Ariane delivers 2.7 MIPS of RTL simulation throughput for the first time on a design with more than 10 billion transistors and 1,024 Linux-capable cores, opening new avenues for distributed RTL simulation of emerging system-on-chip designs. Compared to sequential and multithreaded RTL simulations of smaller designs, Metro-MPI achieves up to 135.98× and 9.29× speedups. Similarly, for a representative regression run, Metro-MPI reduces energy consumption by up to 2.53× and 2.91×. |
11:15 CET | SD1.6 | DYNAMIC REFINEMENT OF HARDWARE ASSERTION CHECKERS Speaker: Hasini Dilanka Witharana, University of Florida, US Authors: Hasini Witharana, Sahan Sanjaya and Prabhat Mishra, University of Florida, US Abstract Post-silicon validation is a vital step in System-on-Chip (SoC) design cycle. A major challenge in post-silicon validation is the limited observability of internal signal states using trace buffers. Hardware assertions are promising to improve the observability during post-silicon debug. Unfortunately, we cannot synthesize thousands (or millions) of pre-silicon assertions as hardware checkers (coverage monitors) due to hardware overhead constraints. Prior efforts considered synthesis of a small set of checkers based on design constraints. However, these design constraints can change dynamically during the device lifetime due to changes in use-case scenarios as well as input variations. In this paper, we explore dynamic refinement of hardware checkers based on changing design constraints. Specifically, we propose a cost-based assertion selection framework that utilizes non-linear optimization as well as machine learning. Experimental results demonstrate that our machine learning model can accurately predict the area (less than 5% error) and power consumption (less than 3% error) of hardware checkers at runtime. This accurate prediction enables close-to-optimal dynamic refinement of checkers based on design constraints. |
11:18 CET | SD1.7 | STSEARCH: STATE TRACING-BASED SEARCH HEURISTICS FOR RTL VALIDATION Speaker: Ziyue Zheng, Hong Kong University of Science and Technology, CN Authors: Ziyue Zheng and Yangdi Lyu, Hong Kong University of Science and Technology, CN Abstract Branch coverage is important in the functional validation of Register-Transfer-Level (RTL) models. While random tests can cover the majority of easy-to-reach branches, there are still many hard-to-activate branches in today's industrial designs. These remaining corner branches are typically the source of bugs and hardware Trojans. Directed test generation approaches using formal methods effectively activate a specific branch but are limited by the state explosion problem. Semi-formal methods, such as concolic testing, improve the scalability by exploring one path at a time. This paper presents a novel concolic testing framework to exercise the corner branches through state tracing-based search heuristics (STSearch). The proposed approach heuristically generates and evaluates input sequences based on a novel heuristic indicator that evaluates the distance between the current state and the target branch condition. The heuristic indicator is designed to utilize both the static structural property of the design and the state from dynamic simulation. Compared to the existing concolic testing approaches, where a full new path is generated in each round by solving path constraints, the cycle-based heuristic search in the proposed approach is more effective and efficient. Experimental results show that our approach significantly outperforms the state-of-the-art approaches in both running time and memory usage. |
11:21 CET | SD1.8 | EF2LOWSIM: SYSTEM-LEVEL SIMULATOR OF EFLASH-BASED COMPUTE-IN-MEMORY ACCELERATORS FOR CONVOLUTIONAL NEURAL NETWORKS Speaker: Jooho Wang, Department of Electrical and Electronics Engineering, Konkuk University, Memory Business, Samsung Electronics, Inc., KR Authors: Jooho Wang, Sunwoo Kim, Junsu Heo and Chester Park, Konkuk University, KR Abstract A new system-level simulator, eF2lowSim, is proposed to estimate the bit-accurate and cycle-accurate performance of eFlash compute-in-memory (CIM) accelerators for convolutional neural networks. The eF2lowSim can predict the inference accuracy by considering the impact of circuit nonideality such as program disturbance. Moreover, the eF2lowSim can also evaluate the system-level performance of dataflow strategies that have a significant impact on hardware area and performance of eFlash CIM accelerators. The simulator helps to find the optimal dataflow strategy of an eFlash CIM accelerator for each convolutional layer. It is shown that the improvement of area efficiency amounts to 26.8%, 21.2% and 17.9% in the case of LeNet-5, VGG-9 and ResNet-18, respectively. |
Extended Abstracts
Time | Label | Presentation Title Authors |
---|---|---|
11:24 CET | SD1.9 | STRUCTURAL GENERATION OF VIRTUAL PROTOTYPES FOR SMART SENSOR DEVELOPMENT IN SYSTEMC-AMS FROM SIMULINK MODELS Speaker: Alexandra Küster, Bosch Sensortec GmbH, DE Authors: Alexandra Kuester1, Rainer Dorsch1 and Christian Haubelt2 1Bosch Sensortec GmbH, DE; 2University of Rostock, DE Abstract We present a flow to reuse system-level analog/mixed-signal (AMS) models developed in MATLAB/Simulink for the extension of virtual prototypes in SystemC. To prevent time-consuming co-simulation, our flow translates the Simulink model into an equivalent SystemC-AMS model. Translation is supported either by wrapping code generated by MATLAB's Embedded Coder or by instantiating previously generated models. Thus, a one-to-one mapping of the model's hierarchy is possible which allows deep insights into the architecture and good traceability. The conducted case study on an accelerometer model shows the applicability of our approach. The generated hierarchical model is half as fast as a monolithic version but allows better observability and traceability of the system. It is tens of times faster than simulation in Simulink, thus especially faster than co-simulation. The extended virtual prototype aims to support software engineers during development and validation of firmware in smart sensors. |
11:24 CET | SD1.10 | A HARDWARE-SOFTWARE COOPERATIVE INTERVAL-REPLAYING FOR FPGA-BASED ARCHITECTURE EVALUATION Speaker: Hongwei Cui, School of Computer Science, Peking University, Beijing, CN Authors: Hongwei Cui, Shuhao Liang, Yujie Cui, Weiqi Zhang, Honglan Zhan, Chun Yang, Xianhua Liu and Xu Cheng, The School of Computer Science, Peking University, CN Abstract Open-source processors and FPGA provide more real and accurate results of the new microarchitecture design, but the long execution time for running large benchmarks on FPGA boards still hinders researchers. This paper proposes a hardware-software cooperative interval-replaying. It uses simulators to create checkpoints for arbitrary program intervals and provides an extensible and portable checkpoint loader to re-execute selected intervals. In addition, this paper extends RISC-V ISA and proposes an event-based sampling design to find hot program intervals with more representative microarchitecture characteristics. By using checkpoints in hot regions, researchers can quickly verify the effectiveness of microarchitecture designs on FPGA and alleviate the speed bottleneck of FPGA. The correctness and effectiveness of the checkpoint scheme and the event-based sampling design are evaluated on FPGA. The experimental results show that the solution is effective. |
11:24 CET | SD1.11 | FELOPI: A FRAMEWORK FOR SIMULATION AND EVALUATION OF POST-LAYOUT FILE AGAINST OPTICAL PROBING Speaker: Sajjad Parvin, University of Bremen, DE Authors: Sajjad Parvin1, Mehran Goli1, Frank Sill Torres2 and Rolf Drechsler3 1University of Bremen, DE; 2German Aerospace Center, DE; 3University of Bremen | DFKI, DE Abstract Optical Probing (OP) has been shown to be capable of retrieving intellectual property of the chips. However, to design a robust circuit against OP, the chip must be designed, fabricated, and optically probed in an experimental setup to determine the OP robustness of the design which is time consuming. To mitigate the aforementioned problems, we propose a simulation framework, namely FELOPi, which takes the layout file format of a design as an input and then performs OP on it. FELOPi can help designers to design robust circuits toward OP attacks before fabricating the chip. Hence, utilizing FELOPi results in tremendous time and cost reduction. |
11:24 CET | SD1.12 | QUO VADIS SIGNAL? AUTOMATED DIRECTIONALITY EXTRACTION FOR POST-PROGRAMMING VERIFICATION OF A TRANSISTOR-LEVEL PROGRAMMABLE FABRIC Speaker: Apurva Jain, University of Texas at Dallas, US Authors: Apurva Jain, Thomas Broadfoot, Yiorgos Makris and Carl Sechen, University of Texas at Dallas, US Abstract We discuss the challenges related with developing a post-programming verification solution for a TRAnsistor-level Programmable fabric (TRAP). Toward achieving high density, the TRAP architecture employs bidirectionally-operated pass transistors in the implementation of its logic and interconnect networks. While it is possible to model such transistors through appropriate primitives of hardware description languages (HDL) to enable simulation-based validation, Logic Equivalence Checking (LEC) methods and tools do not support such primitives. As a result, formally verifying the functionality programmed by a given bit-stream on TRAP is not innately possible. To address this limitation, we introduce a method for automatically determining the signal flow direction through bidirectional pass transistors for a given bit-stream and subsequently converting the HDL describing the programmed fabric to consist only of unidirectional transistors. Thereby, commercial EDA tools can be used to check logic equivalence between the transistor-level HDL describing the programmed fabric and the post-synthesis gate-level netlist. |
SD8 Future memories
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 11:00 CET - 12:30 CET
Location / Room: Okapi Room 0.8.3
Session chair:
Li Zhang, TU Darmstadt, DE
11:00 CET until 11:24 CET: Pitches of regular papers
11:24 CET until 12:30 CET: Interactive technical presentations by the authors of regular papers and extended abstracts
Regular Papers
Time | Label | Presentation Title Authors |
---|---|---|
11:00 CET | SD8.1 | OVERLAPIM: OVERLAP OPTIMIZATION FOR PROCESSING IN-MEMORY NEURAL NETWORK ACCELERATION Speaker: Xuan Wang, University of California, San Diego, US Authors: Minxuan Zhou, Xuan Wang and Tajana Rosing, University of California, San Diego, US Abstract Processing in-memory (PIM) can accelerate neural networks (NNs) for its extensive parallelism and data movement minimization. The performance of NN acceleration on PIM heavily depends on software-to-hardware mapping, which indicates the order and distribution of operations across the hardware resources. Previous works optimize the mapping problem by exploring the design space of per-layer and cross-layer data layout, achieving speedup over manually designed mappings. However, previous works do not consider computation overlapping across consecutive layers. By overlapping computation, we can process a layer before its preceding layer fully completes, decreasing the execution latency of the whole network. The mapping optimization without overlap analysis can result in sub-optimal performance. In this work, we propose OverlaPIM, a new framework that integrates the overlap analysis with the DNN mapping optimization on PIM architectures. OverlaPIM adopts several techniques to enable efficient overlap analysis and optimization for the whole network mapping on PIM architectures. We test OverlaPIM on popular DNN networks and compare the results to nonoverlap optimization. Our experiments show that OverlaPIM can efficiently produce mappings that are 2.10× to 4.11× faster than the state-of-the-art mapping optimization framework. |
11:03 CET | SD8.2 | TAM: A COMPUTING IN MEMORY BASED ON TANDEM ARRAY WITHIN STT-MRAM FOR ENERGY-EFFICIENT ANALOG MAC OPERATION Speaker: Jinkai Wang, Beihang University, CN Authors: Jinkai Wang, Zhengkun Gu, Hongyu Wang, Zuolei Hao, Bojun Zhang, Weisheng Zhao and Yue Zhang, Beihang University, CN Abstract Computing in memory (CIM) has been demonstrated promising for energy efficient computing. However, the dramatic growth of the data scale in neural network processors has aroused a demand for CIM architecture of higher bit density, for which the spin transfer torque magnetic RAM (STT-MRAM) with high bit density and performance arises as an up-and-coming candidate solution. In this work, we propose an analog CIM scheme based on tandem array within STT-MRAM (TAM) to further improve energy efficiency while achieving high bit density. First, the resistance summation based analog MAC operation minimizes the effect of low tunnel magnetoresistance (TMR) by the serial magnetic tunnel junctions (MTJs) structure in the proposed tandem array with smaller area overhead. Moreover, a read scheme of resistive-to-binary is designed to achieve the MAC results accurately and reliably. Besides, the data-dependent error caused by MTJs in series has been eliminated with a proposed dynamic selection circuit. Simulation results of a 2Kb TAM architecture show 113.2 TOPS/W and 63.7 TOPS/W for 4-bit and 8-bit input/weight precision, respectively, and reduction by 39.3% for bit-cell area compared with existing array of MTJs in series. |
11:06 CET | SD8.3 | OUT-OF-CHANNEL DATA PLACEMENT FOR BALANCING WEAR-OUT AND IO WORKLOADS IN RAID-ENABLED SSDS Speaker: Zhouxuan Peng, Wuhan National Laboratory for Optoelectronics, Key Laboratory of Information Storage System, MOE, Huazhong University of Science and Technology, CN Authors: Fan Yang, Chenqi Xiao, Jun Li, Zhibing Sha, Zhigang Cai and Jianwei Liao, Southwest university, CN Abstract Channel-level RAID implementation SSDs can fight against channel failures inside SSDs, but greatly suffer from imbalanced wear-out (i.e. erase) and I/O workloads across all SSD channels, due to the nature of in-channel updates on data/parity chunks of data stripes. This paper proposes exchanging channel locations of data/parity chunks belonging to the same stripe when satisfying update (write) requests, termed as out-of-channel data placement. Consequently, it can smooth wear-out and I/O workloads across SSD channels, thus reducing I/O response time. Through a series of emulation experiments on several realistic disk traces, we show that our proposal can greatly improve I/O performance, as well as noticeably balance the wear-out and I/O workloads, in contrast to related methods. |
11:09 CET | SD8.4 | AGDM:AN ADAPTIVE GRANULARITY DATA MIGRATION STRATEGY FOR HYBRID MEMORY SYSTEMS Speaker: Zhouxuan Peng, Wuhan National Laboratory for Optoelectronics, Key Laboratory of Information Storage System of MoE, Huazhong University of Science and Technology., CN Authors: Zhouxuan Peng, Dan Feng, Jianxi Chen, Jing Hu and Chuang Huang, Wuhan National Laboratory for Optoelectronics, Key Laboratory of Information Storage System, MOE, Huazhong University of Science and Technology, Hubei, China, CN Abstract Hybrid memory systems show strong potential to satisfy the growing memory demands of modern applications by combining different memory technologies. Due to the different performance characteristics of hybrid memories, a data migration strategy that migrates hot data to a faster memory is critical to the overall performance. Prior works have focused on identifying hot data and migration decisions. However, we find that the fix-sized global migration granularity in existing data migration schemes results in suboptimal performance on most workloads. The key observation is that the optimal migration granularity varies with access patterns. This paper proposes AGDM, an access-pattern-aware Adaptive Granularity Data Migration strategy for hybrid memory systems. AGDM tracks memory access patterns in runtime and accordingly adopts the most appropriate migration mode and granularity. The novel remapping-migration decoupled metadata organization enables AGDM to set local optimal granularities for memory regions with different access patterns. Our evaluation shows that, compared to the state-of-the-art scheme, AGDM gets an average performance improvement of 20.06% with 29.98% energy savings. |
11:12 CET | SD8.5 | P-PIM: A PARALLEL PROCESSING-IN-DRAM FRAMEWORK ENABLING ROWHAMMER PROTECTION Speaker: Shaahin Angizi, New Jersey Institute of Technology, US Authors: Ranyang Zhou1, Sepehr Tabrizchi2, Mehrdad Morsali1, Arman Roohi3 and Shaahin Angizi1 1New Jersey Institute of Technology, US; 2University of Nebraska–Lincoln, US; 3University of Nebraska - Lincoln, US Abstract In this work, we propose a Parallel Processing-In-DRAM architecture named P-PIM leveraging the high density of DRAM to enable fast and flexible computation. P-PIM enables bulk bit-wise in-DRAM logic between operands in the same bit-line by elevating the analog operation of the memory sub-array based on a novel dual-row activation mechanism. With this, P-PIM can opportunistically perform a complete and inexpensive in-DRAM RowHammer (RH) self-tracking and mitigation technique to protect the memory unit against such a challenging security vulnerability. Our results show that P-PIM achieves ~72% higher energy efficiency than the fastest charge-sharing-based designs. As for the RH protection, with a worst-case slowdown of ~0.8%, P-PIM archives up to 71% energy-saving over the SRAM/CAM-based frameworks and about 90% saving over DRAM-based frameworks. |
11:15 CET | SD8.6 | PRIVE: EFFICIENT RRAM PROGRAMMING WITH CHIP VERIFICATION FOR RRAM-BASED IN-MEMORY COMPUTING ACCELERATION Speaker: Jae-sun Seo, Arizona State University, US Authors: Wangxin He1, Jian Meng1, Sujan Gonugondla2, Shimeng Yu3, Naresh Shanbhag4 and Jae-sun Seo1 1Arizona State University, US; 2Amazon, US; 3Georgia Tech, US; 4University of Illinois at Urbana-Champaign, US Abstract As deep neural networks (DNNs) have been successfully developed in many applications with continuously increasing complexity, the number of weights in DNNs surges, leading to consistent demands for denser memories than SRAMs. RRAM-based in-memory computing (IMC) achieves high density and energy-efficiency for DNN inference, but RRAM programming remains to be a bottleneck due to high write latency and energy consumption. In this work, we present the Progressive-wRite In-memory program-VErify (PRIVE) scheme, which we verify with an RRAM testchip for IMC-based hardware acceleration for DNNs. We optimize the progressive write operations on different bit positions of RRAM weights to enable error compensation and reduce programming latency/energy, while achieving high DNN accuracy. For 5-bit precision DNNs, PRIVE reduces the RRAM programming energy by 1.82X, while maintaining high accuracy of 91.91% (VGG-7) and 71.47% (ResNet-18) on CIFAR-10 and CIFAR-100 datasets, respectively. |
11:18 CET | SD8.7 | END-TO-END DNN INFERENCE ON A MASSIVELY PARALLEL IN-MEMORY COMPUTING ARCHITECTURE Speaker: Nazareno Bruschi, Università di Bologna, IT Authors: Nazareno Bruschi1, Giuseppe Tagliavini1, Angelo Garofalo1, Francesco Conti1, Irem Boybat2, Luca Benini3 and Davide Rossi1 1Università di Bologna, IT; 2IBM Research Europe - Zurich, CH; 3ETH Zurich, CH | Università di Bologna, IT Abstract The demand for computation resources and energy efficiency of Convolutional Neural Networks (CNN) applications requires a new paradigm to overcome the "Memory Wall”. Analog In-Memory Computing (AIMC) is a promising paradigm since it performs matrix-vector multiplications, the critical kernel of many ML applications, in-place in the analog domain within memory arrays structured as crossbars of memory cells. However, several factors limit the full exploitation of this technology, including the physical fabrication of the crossbar devices, which constrain the memory capacity of a single array. Multi-AIMC architectures have been proposed to overcome this limitation, but they have been demonstrated only for tiny and custom CNNs or performing some layers off-chip. In this work, we present the full inference of an end-to-end ResNet-18 DNN on a 512-cluster heterogeneous architecture coupling a mix of AIMC cores and digital RISC-V cores, achieving up to 20.2 TOPS. Moreover, we analyze the mapping of the network on the available non-volatile cells, compare it with state-of-the-art models, and derive guidelines for next-generation many-core architectures based on AIMC devices. |
11:21 CET | SD8.8 | UHS: AN ULTRA-FAST HYBRID STORAGE CONSOLIDATING NVM AND SSD IN PARALLEL Speaker: QingSong Zhu, Huazhong University of Science & Technology, CN Authors: Qingsong Zhu, Qiang Cao and Jie Yao, Huazhong University of Science & Technology, CN Abstract Non-Volatile Memory (NVM) with persistent and near-DRAM performance has been commonly used as first-level fast storage atop Solid-State Drives (SSDs) and Hard Disk Drives (HDDs), constituting classic hierarchy architecture with high cost-performance. However, the NVM/SSD tiered storage overuses primary NVM with limited actual performance and under-utilizes secondary SSD with increasing bandwidth. Besides, NVM and SSD exhibit distinguished I/O characteristics, but are complementary in I/O pattern. It motivate us to design a superior hybrid storage to fully exploit NVM and SSD simultaneously. In this paper, we propose UHS, an Ultra-fast Hybrid Storage consolidating NVM and SSD to reap their own merits with key enabled techniques. First, UHS builds a uniform yet heterogenous block-level storage view for the upper applications, e.g., file systems or key-value stores. UHS provides static address-mapping to explicitly partition the global block-space into coarse-grain NVM-zones and SSD-zones, which mainly serve the metadata and file data respectively. Second, UHS proposes a fine-grain request-level NVM buffer to dynamically absorb small file-writes in runtime and then migrates them to the SSDs in the background. Third, UHS designs I/O-affinity write allocation and hash-based buffer indexing to trade off write gain and read cost of the NVM-buffer. Finally, UHS designs a multi-thread I/O model to take full advantage of parallelism in both NVM and SSD. We implement UHS and evaluate it under a variety of workloads. The experiments show that UHS outperforms SSD, NVM, Bcache-writeback (representative hierarchy storage), and Device-Mapper (state-of-the-art hybrid storage) up to 8X, 1.5X, 3.5X, and 6X respectively. |
Extended Abstracts
Time | Label | Presentation Title Authors |
---|---|---|
11:24 CET | SD8.10 | OPTIMIZING DATA MIGRATION FOR GARBAGE COLLECTION IN ZNS SSDS Speaker: Zhenhua Tan, College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, CN Authors: Zhenhua Tan1, Linbo Long2, Renping Liu3, Congming Gao4, Yi Jiang3 and Yan Liu3 1College of Computer Science and Technology of Chongqing University of Posts and Telecommunications, CN; 2College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, CN; 3Chongqing University of Posts and Telecommunications, CN; 4Xiamen University, CN Abstract ZNS SSDs shift the responsibility of garbage collection (GC) to the host. However, data migration in GC needs to move data to the host's buffer first and write back to the new location, resulting in an unnecessary end-to-end transfer overhead. Moreover, due to the pre-configured mapping between zones and blocks, GC needs to perform a large number of unnecessary block-to-block data migrations between zones. To address these issues, this paper proposes a simple and efficient data migration method, called IS-AR, with in-storage data migration and address remapping. Based on a full-stack SSD emulator, our evaluation shows that IS-AR reduces GC latency by 6.78× and improves SSD lifetime by 1.17× on average. |
11:24 CET | SD8.11 | ENASA: TOWARDS EDGE NEURAL ARCHITECTURE SEARCH BASED ON CIM ACCELERATION Speaker: Shixin Zhao, Chinese Academy of Sciences, CN Authors: Shixin Zhao, Songyun Qu, Ying Wang and yinhe han, Chinese Academy of Sciences, CN Abstract This work proposes a ReRAM-based Computing-in-Memory (CIM) architecture for Neural Architecture Search (NAS) acceleration, ENASA, so that the compute-intensive NAS technology can be applied to various edge devices to customize the most suitable individual solution for their cases. In the popular one-shot NAS process, the system must repetitively evaluate the sampled sub-network within a large-scale supernet before converging to the best sub-network architecture. Thereby, how to map these iterative network inference tasks onto the CIM arrays makes a big difference in system performance. To realize efficient in-memory supernet sampling and evaluation, we design a novel mapping method that tactically executes a group of sub-nets in the CIM arrays, not only to boost the sub-net concurrency but also to eliminate the repetitive operations shared by these subnets. Meanwhile, to further enhance the subnet-level operation concurrency and sharing in the CIM arrays, we also tailor a novel CIM-friendly one-shot NAS algorithm that purposely samples those operation-sharing subnets in each iteration while still maintaining the convergence performance of NAS. According to the experimental results, our CIM NAS accelerator achieves an improvement of 196.6× and 1200× in performance speedup and energy saving respectively compared to the CPU+GPU baseline. |
SpD4 Special day on Personalised Medicine: Intelligent and Autonomous Insideable Devices
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 11:00 CET - 12:30 CET
Location / Room: Darwin Hall
Session chair:
Oliver Bringmann, University of Tübingen, DE
Insideable devices aim to improve diagnosis and tailor treatment for individual preventive health decisions and improve quality of life. This special session highlights three different approaches to insideable devices for the human body. The first talk will focus on the design of soft-bodied millirobots with adaptive locomotion in complex environments towards minimally invasive medical applications. The second talk will present AI-quipped ultra-low power diagnostic capsules for autonomous endoscopy. The last talk will discuss techniques for neuro-implants for seizure detection and epilepsy management.
Time | Label | Presentation Title Authors |
---|---|---|
11:00 CET | SpD4.1 | DESIGNING MINIATURE ROBOTS TOWARDS MEDICAL APPLICATIONS Presenter: Alp Karacakol, Max Planck Institute for Intelligent Systems, DE Author: Alp Karacakol, Max Planck Institute for Intelligent Systems, DE Abstract Miniature robots have the unique ability to navigate, manipulate and remain in risky and currently difficult-to-access small confined spaces within the human body, making them ideal candidates for the next generation of biomedical applications for drug delivery, embolization, clot lysis, and hyperthermia. This talk will first present a plethora of intelligently designed micro- to millimeter-scale robots, ranging from micro-rollers to soft robots, with medical application potential, followed by the challenges faced in fabricating, actuating, controlling, and designing these miniature robots to operate in realistic environments. The talk will continue with how these challenges are being addressed to bridge the gap between the presented state-of-the-art miniature robots and their intended real-world applications. The talk will conclude with a vision of how automation of design and control can facilitate the holistic approach to make the inaccessible accessible through miniature robots. |
11:30 CET | SpD4.2 | SMART DIAGNOSTIC CAPSULES FOR AUTONOMOUS ENDOSCOPY Presenter: Sebastian Schostek, Ovesco Endoscopy AG, DE Author: Sebastian Schostek, Ovesco Endoscopy AG, DE Abstract After more than 20 years of availability in the market, capsule endoscopy has become an indispensable part of clinical diagnostics in the gastrointestinal tract. It has also emerged as a very vibrant and innovative field for academic and industrial research. The pipeline of the market competitors and new entrants into this markets is full of innovative and more powerful / better performing solutions that utilize the technology advancements made during the last two decades. Putting the increasing availability and acceptance of AI and IoT solutions in clinical use in the mix, the field of capsule endoscopy is at a point at which technology leaps and disruptive innovations of both devices and clinical workflow are to be expected in the near future. This presentation illustrates the evolution of capsule endoscopy covering the past, the present and future. |
12:00 CET | SpD4.3 | WEARABLES AND IMPLANTABLES FOR INTELLIGENT AMBULATORY EPILEPSY MANAGEMENT Presenter: Christian Meisel, Charité Berlin, Department of Neurology with Experimental Neurology, DE Author: Christian Meisel, Charité Berlin, Department of Neurology with Experimental Neurology, DE Abstract Ambulatory monitoring in diseases like epilepsy is challenging as essential diagnostic tools, including methods for reliable seizure detection and seizure risk evaluation to monitor treatment efficacy, are still missing. This lack of objective diagnostics constitutes a significant barrier to better treatments, raises methodological concerns about the antiseizure medication evaluation process and, to patients, is a main issue contributing to the disease burden. Recent years have seen rapid progress towards novel implantable and wearable sensor systems that meet these needs of epilepsy patients and clinicians. These novel sensors, however, require intelligent and robust analytics applicable to multimodal, long-term data. In this talk I will discuss some of the recent developments in the field of ambulatory epilepsy monitoring with focus on implantable and wearable sensors systems and their related artificial intelligence methods. |
ST1 Design and Test of Mixed-Signal Circuits and Memories
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 11:00 CET - 12:30 CET
Location / Room: Okapi Room 0.8.2
Session chair:
Haralampos Stratigopoulos, Sorbonne, FR
11:00 CET until 11:24 CET: Pitches of regular papers
11:24 CET until 12:30 CET: Interactive technical presentations by the authors of regular papers and extended abstracts
Regular Papers
Time | Label | Presentation Title Authors |
---|---|---|
11:00 CET | ST1.1 | POST-SILICON OPTIMIZATION OF A HIGHLY PROGRAMMABLE 64-MHZ PLL ACHIEVING 2.7-5.7µW Speaker: Marco Gonzalez, ICTEAM Institute, UCLouvain, BE Authors: Marco Gonzalez and David Bol, UCLouvain, BE Abstract Hierarchical optimization methods used in the design of complex mixed-signal systems require accurate behavioral models to avoid the long simulation times of transistor-level SPICE simulations of the whole system. However, robust behavioral models that accurately model circuit non-idealities and their complex interactions must be very complex themselves and are hardly achievable. Post-silicon tuning, which is already widely used for the calibration of analog building blocks, is an interesting alternative to speed up the optimization of these complex systems. However, post-silicon tuning usually focuses on single-objective problems in blocks with a limited number of degrees of freedom. In this paper, we propose a post-silicon "hardware-in-the-loop” optimization method to solve multi-objective problems in mixed-signal systems with numerous degrees of freedom. We use this method to optimize the noise-power trade-off of a 64-MHz phase-locked loop (PLL) based on a back-bias-controlled ring oscillator. A genetic algorithm was run based on measurements of the 22-nm fully-depleted silicon-on-insulator prototype to find the Pareto-optimal configurations in terms of power and long-term jitter. The obtained Pareto front gives a range of power consumption between 2.7 and 5.7 μW, corresponding to an RMS long-term jitter between 88 and 45 ns. Whereas the simulation-based optimization would require more than a year using the genetic algorithm based on SPICE simulations, we conducted the post-silicon optimization in only 17 h. |
11:03 CET | ST1.2 | ANALOG COVERAGE-DRIVEN SELECTION OF SIMULATION CORNERS FOR AMS INTEGRATED CIRCUITS Speaker: Pallab Dasgupta, Indian Institute of Technology Kharagpur,, IN Authors: Sayandeep Sanyal1, Aritra Hazra1, Pallab Dasgupta1, Scott Morrison2, Sudhakar S3, Lakshmanan Balasubramanian4 and Moshiur Rahman2 1IIT Kharagpur, IN; 2Texas Instruments, US; 3Texas Instruments, IN; 4Texas Instruments (India) Pvt. Ltd., IN Abstract Integrated circuit designs are evaluated at various corners defined by choices of the design and process parameters. Considering the large number of corners and the simulation cost of covering all the corners of a large design, it is desirable to identify a subset of the corners that can potentially expose corner case bugs. In an integrated analog coverage management framework, this choice may be influenced by those corners that take one or more component analog IPs close to their individual specification boundaries. Since the admissible state space of an analog IP is multi-dimensional, the same corner may not reach the extreme behaviors for each attribute of the specification, and one needs to identify a subset that covers the extremity. This paper shows that the underlying problem is NP-hard and presents an automated methodology for selecting the corners. A formal analog coverage specification is leveraged by our algorithm, which uses a Satisfiability Modulo Theory (SMT) solver to identify the appropriate corners from the output of multiple Monte Carlo (MC) simulations. The efficacy of the proposed approach is demonstrated over industrial test cases. |
11:06 CET | ST1.3 | FAST PERFORMANCE EVALUATION METHODOLOGY FOR HIGH-SPEED MEMORY INTERFACES Speaker: Taehoon Kim, Seoul National University, KR Authors: Taehoon Kim, Yoona Lee and Woo-Seok Choi, Seoul National University, KR Abstract An increase in the data rate of memory interfaces causes higher inter-symbol interference (ISI). To mitigate ISI, recent high-speed memory interfaces have started employing complex datapath, utilizing equalization techniques such as continuous-time linear equalizer and decision-feedback equalizer. This incurs huge overhead for design verification with conventional methods using transient simulation. This paper proposes a fast and accurate verification methodology to evaluate the voltage and timing margin of the interface, based on the impulse sensitivity function. To take nonlinear circuit behavior into account, the small- and large-signal responses were separately calculated to improve accuracy, using the data obtained from the periodic AC and periodic steady-state analyses. This approach achieves high accuracy, with shmoo similarity rates of over 95%, while also significantly reducing verification time, up to 23x faster. Moreover, two different methods are proposed for evaluating the multi-stage Rx performance, providing a trade-off between accuracy and efficiency that can be tailored to the specific purpose, e.g., the verification or design process. |
11:09 CET | ST1.4 | EQUIVALENCE CHECKING OF SYSTEM-LEVEL AND SPICE-LEVEL MODELS OF STATIC NONLINEAR CIRCUITS Speaker: Kemal Çağlar Coşkun, Institute of Computer Science, University of Bremen, DE Authors: Kemal Çağlar Coşkun1, Muhammad Hassan2 and Rolf Drechsler1 1Institute of Computer Science, University of Bremen, DE; 2DFKI, DE Abstract Recently, Signal Flow Graphs (SFGs) have been successfully leveraged to show equivalence for linear analog circuits at system-level and SPICE-level. However, this is clearly not sufficient as the true complexity stems from nonlinear analog circuits. In this paper, we go beyond linear analog circuits, i.e., we extend the SFGs and develop the Modified Signal-Flow Graph (MSFG), to show equivalence between system-level and SPICE-level representations of static nonlinear analog circuits. First, we map the nonlinear circuits to MSFGs. Afterwards, graph simplification and functional approximation (in particular Legendre polynomials) techniques are used to create minimal MSFG and canonical MSFG. This enables us to compare the MSFGs even if they have vastly different structures. Finally, we propose a similarity metric that calculates the similarity between SPICE-level and system-level models. By successfully applying the proposed equivalence checking technique to benchmark circuits, we demonstrate its applicability. |
11:12 CET | ST1.5 | ELECTROMIGRATION-AWARE DESIGN TECHNOLOGY CO-OPTIMIZATION FOR SRAM IN ADVANCED TECHNOLOGY NODES Speaker: Mahta Mayahinia, KIT university, DE Authors: Mahta Mayahinia1, Hsiao-Hsuan Liu2, Subrat Mishra2, Zsolt Tokei2, Francky Catthoor2 and Mehdi Tahoori1 1Karlsruhe Institute of Technology, DE; 2IMEC, BE Abstract Static RAM (SRAM) is one of the critical components in advanced VLSI systems whose performance, capacity, and reliability have a decisive impact on the entire system. It offers the fastest memory in the storage hierarchy of modern computer systems. By moving toward the smaller CMOS technology nodes, the back end of the line (BEoL) interconnects are also fabricated in tighter pitch size. Hence, besides the power lines, SRAM word- and bit-line (WL and BL) are also susceptible to electromigration (EM). Therefore, EM reliability of SRAM's WL and BL needs to be analyzed during design technology co-optimization (DTCO) cycle. In this work, we investigate the impact of technology scaling on SRAM designs and perform a detailed analysis on the trend of their EM reliability and energy consumption. Our analysis shows that although scaling down the CMOS technology can result in a 2.68x improvement in the energy efficiency of the SRAM module, it increases the EM-induced hydrostatic stress by 2.53x. |
11:15 CET | ST1.6 | SMART HAMMERING: A PRACTICAL METHOD OF PINHOLE DETECTION IN MRAM MEMORIES Speaker: Sina Bakhtavari Mamaghani, Karlsruhe institute of technology, DE Authors: Sina Bakhtavari Mamaghani1, Christopher Muench1, Jongsin Yun2, Martin Keim3 and Mehdi Baradaran Tahoori1 1Karlsruhe Institute of Technology, DE; 2Siemens, US; 3siemens, US Abstract As we move toward the commercialization of Spin-Transfer Torque Magnetic Random Access Memories (STT-MRAM), cost-effective testing and in-field reliability have become more prominent. Among STT-MRAM manufacturing defects, pinholes are one of the important ones. Pinholes are defects on the surface of the oxide layer which degrade the resistive values and, in some cases, cause an oxide breakdown. Some moderate levels of pinhole defects can remain undetected during standard functional tests and may cause a field failure. A stress test of the whole memory, including multiple cycles of long writes, has been suggested to detect candidate pinhole defects. However, this test not only causes extra costs but also degrades the reliability of MRAM for the entire array. In this paper, we have statistically studied the behavior of pinholes and proposed a cost-effective testing scheme to capture pinhole defects and increase the reliability of the end product. Our method limits the number of test candidate cells that need to be hammered, providing a reduced test time of up to 96.42% for our case studies compared to existing methods. This is while the advantages of standard tests are all preserved with our method. The proposed approach is compatible with memory-built-in self-test (MBIST) schemes. |
11:18 CET | ST1.7 | MA-OPT: REINFORCEMENT LEARNING-BASED ANALOG CIRCUIT OPTIMIZATION USING MULTI-ACTORS Speaker: Youngchang Choi, Pohang University of Science and Technology, KR Authors: Youngchang Choi1, Minjeong Choi2, Kyongsu Lee1 and Seokhyeong Kang1 1Pohang University of Science and Technology, KR; 2Samsung Advanced Institute of Technology (SAIT), KR Abstract Analog circuit design requires significant human efforts and expertise; therefore, electronic design automation (EDA) tools for analog design are needed. This study presents MA-Opt that is an analog circuit optimizer using reinforcement learning (RL)-inspired framework. MA-Opt using multiple actors is proposed to provide various predictions of optimized circuit designs in parallel. Sharing a specific memory that affects the loss function of network training is proposed to exploit multiple actors effectively, accelerating circuit optimization. Moreover, we devise a novel method to tune the most optimized design in previous simulations into a more optimized design. To demonstrate the efficiency of the proposed framework, MA-Opt was simulated for three analog circuits and the results were compared with those of other methods. The experimental results indicated the strength of using multiple actors with a shared elite solution set and the near-sampling method. Within the same number of simulations, while satisfying all given constraints, MA-Opt obtained minimum target metrics up to 24% better than DNN-Opt. Furthermore, MA-Opt obtained better Figure of Merits (FoMs) than DNN-Opt at the same runtime. |
11:21 CET | ST1.8 | AUXCELLGEN: A FRAMEWORK FOR AUTONOMOUS GENERATION OF ANALOG AND MEMORY UNIT CELLS Speaker: Sachin Sapatnekar, University of Minnesota, US Authors: Sumanth Kamineni1, Arvind Sharma2, Ramesh Harjani2, Sachin S. Sapatnekar2 and Benton H. Calhoun1 1University of Virginia, US; 2University of Minnesota, US Abstract Recent advances in auto-generating analog and mixed-signal (AMS) circuits use standard digital tool flows to compose AMS circuits from a combination of digital standard cells and a set of auxiliary cells (auxcells). Until now, generating auxcell layouts for each new PDK was the last manual step in the flow for auto-generating AMS components, which limited the available auxcells and reduced the optimality of the auto-generated AMS designs. To solve this, we propose AuxcellGen, a framework to auto-generate auxcell layouts and performance models. AuxcellGen generates a parasitic-aware auxcell performance model using a neural network (NN), auto-sizes and optimizes auxcell schematics for a given design target, and auto-generates auxcell layouts. The framework is demonstrated by auto-generating tri-state buffer auxcells for PLLs and sense-amplifier auxcells for SRAM across a range of user specifications that are compatible with standard cell and memory bitcell pitch. |
Extended Abstracts
Time | Label | Presentation Title Authors |
---|---|---|
11:24 CET | ST1.9 | DEBUGGING LOW POWER ANALOG NEURAL NETWORKS FOR EDGE COMPUTING Speaker: Sascha Schmalhofer, University Frankfurt, DE Authors: Sascha Schmalhofer, Marwin Moeller, Nikoletta Katsaouni, Marcel Schulz and Lars Hedrich, Goethe University Frankfurt, DE Abstract In this paper we present a method to debug and analyze large synthesized ANNs enabling a systematic comparison of the transistor netlist, behavioral model and the implementation. With that an insight into the behavior of the analog netlist is easily gained and errors during generation or badly designed cells are quickly uncovered. An overall judgement of the accuracy is also presented. We demonstrate the functionality on several examples from small ANNs to ANNs consisting of more than 10000 of cells implementing a medical application. |
11:24 CET | ST1.10 | HIGH PERFORMANCE AND DNU-RECOVERY SPINTRONIC RETENTION LATCH FOR HYBRID MTJ/CMOS TECHNOLOGY Speaker: Zhen Zhou, Anhui University, CN Authors: Aibin Yan1, Zhen Zhou1, Liang Ding1, Jie Cui1, Zhengfeng Huang2, Xiaoqing Wen3 and Patrick Girard4 1Anhui University, CN; 2Hefei University of Technology, CN; 3Kyushu Institute of Technology, JP; 4LIRMM / CNRS, FR Abstract With the advancement of CMOS technologies, circuits have become more vulnerable to soft errors, such as single-node-upsets (SNUs) and double-node-upsets (DNUs). To effectively provide nonvolatility as well as tolerance against DNUs caused by radiation, this paper proposes a nonvolatile and DNU resilient latch that mainly comprises two magnetic tunnel junction (MTJ), two inverters and eight C-elements. Since two MTJs are used and all internal nodes are interlocked, the latch can provide nonvolatility and recovery from all possible DNUs. Simulation results demonstrate the nonvolatility, DNU recovery and high performance of the proposed latch. |
11:24 CET | ST1.11 | MINIMUM UNIT CAPACITANCE CALCULATION FOR BINARY-WEIGHTED CAPACITOR ARRAYS Speaker: Nibedita Karmokar, Research Assistant, US Authors: Nibedita Karmokar, Ramesh Harjani and Sachin S. Sapatnekar, University of Minnesota, US Abstract Abstract—The layout area and power consumption of a binaryweighted capacitive digital-to-analog converter (DAC) increases exponentially with the number of bits. To meet linearity targets, unit capacitors should be large enough to limit errors caused by various sources of noise and those due to mismatch. This work proposes a systematic approach for minimizing the unit capacitance value that optimizes the linearity metrics of a DAC, accounting for multiple factors that contribute to mismatch, as well as the impact of flicker and thermal noise. |
LK3 Special Day Lunchtime Keynote
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 13:00 CET - 14:00 CET
Location / Room: Darwin Hall
Session chair:
Jan Madsen, TU Denmark, DK
Session co-chair:
Oliver Bringmann, University of Tübingen, DE
Time | Label | Presentation Title Authors |
---|---|---|
13:00 CET | LK3.1 | ANALYZE THE PATIENT, ENGINEER THE THERAPY Presenter: Liesbet Lagae, IMEC, BE Author: Liesbet Lagae, IMEC, BE Abstract The complexity, cycle time and cost of new precision therapy workflow are a major challenge to overcome in order to achieve clinical implementation of this revolutionary type of treatments. For example, car T cells use the patient's own immune (T-)cell adapted in a way to better fight cancer. Chip technology can help to make these therapies more efficient, precise, and cost-effective. Over the last few decades, the semiconductor industry has grown exponentially, poised to increase value to the end-user while driving down costs by scaling. The result is the world's highest standard in precision and high-volume production of nanoelectronics chip-based sensor solutions. Imec has used its semiconductor process expertise and infrastructure to make significant innovations in single-use silicon biochip and microfluidic technology, creating toolboxes of on-chip functions spanning DNA sequencing, cell sorting, single cell electroporation, integrated biosensor arrays. The solutions have until now mostly served the diagnostic market. Chip based microfluidics is a toolbox that brings its own design challenges, especially in relation to not having to reinvent the wheel every time. Hence, we try to make maximal reuse of generic fluidic building blocks developed for the diagnostic market, and we will explain how these building blocks are equally adapted for addressing the challenges in immune therapy. These existing demonstrations on chip could enable to provide smarter solutions for discrete unit operations and quality monitoring to even complete workflow integration. Solving these challenges would enable more patients to access and benefit from the next most anticipated class of life changing therapies. |
ES Executive session
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 14:00 CET - 15:30 CET
Location / Room: Okapi Room 0.8.1
Time | Label | Presentation Title Authors |
---|---|---|
14:00 CET | ES.1 | MAKECHIP: AN INNOVATIVE HOSTED DESIGN SERVICE PLATFORM FOR CUTTING-EDGE CHIP DESIGNS Presenter: Florian Bilstein, Racyics, DE Author: Florian Bilstein, Racyics, DE Abstract In the field of chip design, a reliable IT infrastructure is essential for realizing complex System on Chips. Setup and maintenance of EDA tools, technology data and design flow and ensuring compatibility and efficiency can be both time-consuming and challenging. With makeChip, an innovative Hosted Design Service Platform (HDSP), Racyics offers a central gateway to design integrated circuits based on advanced semiconductor technologies without upfront invest in design environment and design methodology targeted for start-ups, SMEs, research institutes and universities. The platform provides reliable IT infrastructure with a full set of EDA tool installations and technology data setup, i.e. PDKs, foundation IP, complex IP. All tools and design data are linked by Racyics' silicon proven design flow and project management system. The turnkey environment enables any makeChip customer to realize complex System on Chips in the most advanced technology nodes. For rapid silicon prototyping of research circuits in GF® 22FDX®, Rapid Adaption Kits are available. Furthermore, Racyics supports makeChip customers by on-demand design services, such as digital layout generation, tape-out sign-off execution and many more. On top of all, access to Racyics ABX IPs for GF 22FDX® is provided for reliable and predictable ultra-low voltage operation down to 0.4V. For non-commercial academic projects, makeChip access includes a complete suite of advanced Cadence EDA tool licenses without additional costs. In this presentation, the concept and structure of makeChip is outlined in greater detail and the unique benefits for academia, start-ups, and SMEs are explored. Furthermore, already realized projects, such as SpiNNaker 2, are presented and how makeChip helped to boost their development and tackled design challenges. Finally, an outlook on makeChip and ongoing discussions with the European Union on establishing an EU Design Platforms is given. |
14:30 CET | ES.2 | HETEROGENEOUS INTEGRATION OF CHIPLETS BRINGS A NEW TWIST TO SIP Presenter: Heiko Dudek, Siemens Digital Industries Software - EDA, DE Author: Heiko Dudek, Siemens Digital Industries Software - EDA, DE Abstract The semiconductor industry is facing an inflection point as higher cost, lower yield, and reticle size limitations drive the need for viable alternatives to traditional monolithic solutions, which have hit the limits of physics. This is driving an emerging trend to disaggregate what typically would be implemented as an SoC into solid, fabricated IP blocks, otherwise known as chiplets. These chiplets typically include just a couple of functions implemented at the optimal process node, when combined with other chiplets, memory and often a custom ASIC results in a multi-die heterogeneous integrated implementation that typically utilizes a high-performance substrate ushering in a new generation of system-in-package and with it a new set of design challenges that this session will explore. |
15:00 CET | ES.3 | ADDRESSING THE CHALLENGES OF ISO26262 FOR IP, USING LOGICBIST WITH OBSERVATION SCAN TECHNOLOGY Speaker: Nicolas Leblond, Siemens EDA, FR Authors: Lee Harrison1 and Prashant Kulkarni2 1Siemens EDA, GB; 2ARM Inc, GB Abstract With the increased use of complex IP within automotive IC applications, it is vitally important that commercial IP is delivered fit for purpose. This paper steps through the process with a arm commercial IP to ensure that ISO26262 certification can be achieved., A full reference architecture is created which implements periodic LogicBIST controlled by an on-chip safety manager. Full details of the solution giving the infrastructure used to monitor and manage the BIST based safety mechanisms, in an in-life periodic configuration. Including the complex scheduling required to meet the specified DTI is explained. We also take a look at how the reference flow can be enhanced to use Tessent LogicBIST with observation scan technology to significantly simplify the overall implementation at the same time as dramatically reducing the overall DTI. We also review the process to certify the complete IP to ISO26262. |
LKS4 Later … with the keynote speakers
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 14:00 CET - 15:30 CET
Location / Room: Darwin Hall
Session chair:
Jan Madsen, TU Denmark, DK
Session co-chair:
Oliver Bringmann, University of Tübingen, DE
SA1 Power-efficient and Smart Energy Systems
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 14:00 CET - 15:30 CET
Location / Room: Gorilla Room 1.5.3
Session chair:
Dolly Sapra, UVA, NL
14:00 CET until 14:21 CET: Pitches of regular papers
14:21 CET until 15:30 CET: Interactive technical presentations by the authors of regular papers and extended abstracts
Regular Papers
Time | Label | Presentation Title Authors |
---|---|---|
14:00 CET | SA1.1 | SPARSEMEM: ENERGY-EFFICIENT DESIGN FOR IN-MEMORY SPARSE-BASED GRAPH PROCESSING Speaker: Mahdi Zahedi, Delft University of Technology (TU Delft), NL Authors: Mahdi Zahedi1, Geert Custers1, Taha Shahroodi1, Georgi Gaydadjiev2, Stephan Wong1 and Said Hamdioui1 1TU Delft, NL; 2Maxeler / Imperial College, GB Abstract Performing analysis on large graph datasets in an energy-efficient manner has posed a significant challenge; not only due to excessive data movements and poor locality, but also due to the non-optimal use of high sparsity of such datasets. The latter leads to a waste of resources as the computation is also performed on zero's operands which do not contribute to the final result. This paper designs a novel graph processing accelerator, SparseMEM, targeting sparse datasets by leveraging the computing-in-memory (CIM) concept; CIM is a promising solution to alleviate the overhead of data movement and the inherent poor locality of graph processing. The proposed solution stores the graph information in a compressed hierarchical format inside the memory and adjusts the workflow based on this new mapping. This vastly improves resource utilization, leading to higher energy and permanence efficiency. The experimental results demonstrate that SparseMEM outperforms a GPU-based platform and two state-of-the-art in-memory accelerators on speedup and energy efficiency by one and three orders of magnitude, respectively. |
14:03 CET | SA1.2 | HULK-V: A HETEROGENEOUS ULTRA-LOW-POWER LINUX CAPABLE RISC-V SOC Speaker: Luca Valente, Università di Bologna, IT Authors: Luca Valente1, Yvan Tortorella1, Mattia Sinigaglia1, Giuseppe Tagliavini1, Alessandro Capotondi2, Luca Benini3 and Davide Rossi1 1Università di Bologna, IT; 2Università di Modena e Reggio Emilia, IT; 3ETH Zurich, CH Abstract IoT applications span a wide range in performance and memory footprint, under tight cost and power constraints. High-end applications rely on power-hungry Systems-on-Chip (SoCs) featuring powerful processors, large LPDDR/DDR3/4/5 memories, and supporting full-fledged Operating Systems (OS). On the contrary, low-end applications typically rely on Ultra-Low-Power microcontrollers with a "close to metal" software environment and simple micro-kernel-based runtimes. Emerging applications and trends of IoT require the "best of both worlds": cheap and low-power SoC systems with a well-known and agile software environment based on full-fledged OS (e.g., Linux), coupled with extreme energy efficiency and parallel digital signal processing capabilities. We present HULK-V: an open-source Heterogeneous Linux-capable RISC-V-based SoC coupling a 64-bit RISC-V processor with an 8-core Programmable Multi-Core Accelerator (PMCA), delivering up to 13.8 GOps, up to 157 GOps/W and accelerating the execution of complex DSP and ML tasks by up to 112x over the host processor. HULK-V leverages a lightweight, fully digital memory hierarchy based on HyperRAM IoT DRAM that exposes up to 512 MB of DRAM memory to the host CPU. Featuring HyperRAMs, HULK-V doubles the energy efficiency without significant performance loss compared to featuring power-hungry LPDDR memories, requiring expensive and large mixed-signal PHYs. HULK-V, implemented in Global Foundries 22nm FDX technology, is a fully digital ultra-low-cost SoC running a 64-bit Linux software stack with OpenMP host-to-PMCA offload within a power envelope of just 250 mW. |
14:06 CET | SA1.3 | HIGH-SPEED AND ENERGY-EFFICIENT SINGLE-PORT CONTENT ADDRESSABLE MEMORY TO ACHIEVE DUAL-PORT OPERATION Speaker: Honglan Zhan, School of Computer Science, Peking University, Beijing, China, CN Authors: Honglan Zhan, Chenxi Wang, Hongwei Cui, Xianhua Liu, Feng Liu and Xu Cheng, The Department of Computer Science and Technology, Peking University, Beijing, China., CN Abstract Abstract—High-speed and energy-efficient multi-port content addressable memory (CAM) is very important to modern superscalar processors. In order to overcome the disadvantages of multi-port CAM and improve the performance of searching stage, a high-speed and energy-efficient single-port (SP) CAM is introduced to achieve dual-port (DP) operation. For different bit cell topologies – the traditional 9T CAM cell and 6T SRAM cell, two novel peripheral schemes – CShare and VClamp are proposed. The proposed schemes are verified using all possible corners, a wide range of temperature and detailed Monte-Carlo variation analysis. With 65-nm process and 1.2 V supply, the search delay of CShare and VClamp is 0.55 ns and 0.6 ns, respectively, a reduction of approximately 87% compared to the state-of-the-art works. In addition, compared with the recently proposed 10T BCAM, CShare and VClamp can provide 84.9% and 85.1% energy reduction in the TT corner, respectively. Experimental results in an 8 Kb CAM at 1.2 V supply and across different corners show that the energy efficiency is improved by 45.56% (CShare) and 45.64% (VClamp) on average in comparison with DP CAM. Keywords—Superscalar processors, content-addressable memory (CAM), dual-port |
14:09 CET | SA1.4 | ENERGY-EFFICIENT HARDWARE ACCELERATION OF SHALLOW MACHINE LEARNING APPLICATIONS Speaker: Ziqing Zeng, University of Minnesota, US Authors: Ziqing Zeng and Sachin S. Sapatnekar, University of Minnesota, US Abstract ML accelerators have largely focused on building general platforms for deep neural networks (DNNs), but less so on shallow machine learning (SML) algorithms. This paper proposes Axiline, a compact, configurable, template-based generator for SML hardware acceleration. Axiline identifies computational kernels as templates that are common to these algorithms and builds a pipelined accelerator for efficient execution. The dataflow graphs of individual ML instances, with different data dimensions, are mapped to the pipeline stages and then optimized by customized algorithms. The approach generates energy-efficient hardware for training and inference of various ML algorithms, as demonstrated with post-layout FPGA and ASIC results. |
14:12 CET | SA1.5 | STATEFUL ENERGY MANAGEMENT FOR MULTI-SOURCE ENERGY HARVESTING TRANSIENT COMPUTING SYSTEMS Speaker: Domenico Balsamo, Newcastle University, GB Authors: Sergey Mileiko1, Oktay Cetinkaya2, Rishad Shafik1 and Domenico Balsamo1 1Newcastle University, GB; 2Oxford e-Research Centre, GB Abstract The intermittent and varying nature of energy harvesting (EH) entails dedicated energy management with large energy storage, which is a limiting factor for low-power/cost systems with small form factors. Transient computing allows system operations to be performed in the presence of power outages by saving the system state into a non-volatile memory (NVM), thereby reducing the size of this storage. These systems are often designed with a task-based strategy, which requires the storage to be sized for the most energy consuming task. That is, however, not ideal for most systems since their tasks/components have varying energy requirements, i.e., energy storage size and operating voltage. Hence, to overcome this issue, this paper proposes a novel energy management unit (EMU) tailored for multi-source EH transient systems that allows selecting the storage size and operating voltage for the next task at run-time, thereby optimizing task-specific energy needs and startup times based on application requirements. For the first time in literature, we adopted a hybrid NVM+VM approach allowing our EMU to reliably and efficiently retain its internal state, i.e., stateful EMU, under even the most severe EH conditions. Extensive empirical evaluations validated the operation of the proposed stateful EMU at a small overhead (0.07mJ of energy to update the EMU state and a ≃4μA of static current consumption of the EMU). |
14:15 CET | SA1.6 | FULLY ON-BOARD LOW-POWER LOCALIZATION WITH MULTIZONE TIME-OF-FLIGHT SENSORS ON NANO-UAVS Speaker: Hanna Müller, ETH Zürich, CH Authors: Hanna Mueller1, Nicky Zimmerman2, Tommaso Polonelli3, Jens Behley2, Michele Magno1, Cyrill Stachniss2 and Luca Benini4 1ETH Zurich, CH; 2Uni Bonn, DE; 3Center for Project-Based Learning, ETH Zurich, CH; 4ETH Zurich, CH | Università di Bologna, IT Abstract Nano-size unmanned aerial vehicles (UAVs) hold enormous potential to perform autonomous operations in complex environments, such as inspection, monitoring, or data collection. Moreover, their small size allows safe operation close to humans and agile flight. An important part of autonomous flight is localization, which is a computationally intensive task, especially on a nano-UAV that usually has strong constraints in sensing, processing and memory. This work presents a real-time localization approach with low-element-count multizone range sensors for resource-constrained nano-UAVs. The proposed approach is based on a novel miniature 64-zone time-of-flight sensor from ST Microelectronics and a RISC-V-based parallel ultra-low-power processor to enable accurate and low latency Monte Carlo localization on-board. Experimental evaluation using a nano-UAV open platform demonstrated that the proposed solution is capable of localizing on a 31.2m^2 map with 0.15m accuracy and an above 95% success rate. The achieved accuracy is sufficient for localization in common indoor environments. We analyze tradeoffs in using full and half-precision floating point numbers as well as a quantized map and evaluate the accuracy and memory footprint across the design space. Experimental evaluation shows that parallelizing the execution for 8 RISC-V cores brings a 7x speedup and allows us to execute the algorithm on-board in real-time with a latency of 0.2-30ms (depending on the number of particles) while only increasing the overall drone power consumption by 3-7%. Finally, we provide an open-sourced implementation of our approach. |
14:18 CET | SA1.7 | ENERGY-EFFICIENT WEARABLE-TO-MOBILE OFFLOAD OF ML INFERENCE FOR PPG-BASED HEART-RATE ESTIMATION Speaker: Matteo Risso, Politecnico di Torino, IT Authors: Alessio Burrello1, Matteo Risso2, Noemi Tomasello2, Yukai Chen3, Luca Benini4, Enrico Macii2, Massimo Poncino2 and Daniele Jahier Pagliari2 1Politecnico di Torino and Università di Bologna, IT; 2Politecnico di Torino, IT; 3IMEC, BE; 4ETH Zurich, CH | Università di Bologna, IT Abstract Modern smartwatches often include photoplethysmographic (PPG) sensors to sense the contractions within the dense arteriovenous system. This information can be used to measure heartbeats or blood pressure through complex algorithms that fuse PPG data with other signals. However, these approaches are often too complex to be deployed on microcontroller units (MCUs) such as the ones embedded in a smartwatch. In this work, we propose a collaborative inference approach that uses both a smartwatch and a connected smartphone to maximize the performance of heart rate (HR) tracking while also maximizing the smartwatch's battery life. In particular, we first analyze the trade-offs between running on-device HR tracking or offloading the work to the smartphone. Then, thanks to an additional step to evaluate the difficulty of the upcoming HR prediction, we demonstrate that we can smartly dispatch the workload between smartwatch and smartphone, maintaining a low mean absolute error (MAE) while reducing energy consumption. To benchmark our approach, we employed a custom smartwatch prototype which includes the STM32WB55 MCU for processing and Bluetooth Low-Energy (BLE) communication and a Raspberry Pi3 as a proxy for the smartphone. With our Collaborative Heart Rate Inference System (CHRIS), we obtain a set of Pareto-optimal configurations demonstrating the same MAE as State-of-Art (SoA) algorithms while consuming less energy. For instance, we can achieve approximately the same MAE of TimePPG-Small (5.54 BPM MAE vs. 5.60 BPM MAE) while reducing the energy by 2.03x, with a configuration that offloads 80% of the predictions to the phone. Furthermore, accepting a performance degradation to 7.16 BPM of MAE, we can achieve an energy consumption of 179 uJ per prediction, 3.03x less than running TimePPG-Small on the smartwatch, and 1.82x less than streaming all the input data to the phone. |
Extended Abstracts
Time | Label | Presentation Title Authors |
---|---|---|
14:21 CET | SA1.8 | A COUPLED BATTERY STATE OF CHARGE AND VOLTAGEMODEL FOR OPTIMAL CONTROL APPLICATIONS Speaker: Sajad Shahsavari, ∗Department of Computing, University of Turku, Turku, Finland, FI Authors: Masoomeh Karami1, Sajad Shahsavari1, Eero Immonen2, Hashem Haghbayan1 and Juha Plosila1 1University of Turku, FI; 2Turku University of Applied Sciences, FI Abstract Optimal control of electric vehicle (EV) batteries for maximal energy efficiency, safety and lifespan requires that the Battery Management System (BMS) has accurate real-time information on both the battery State-of-Charge (SoC) and its dynamics, i.e. long-term and short-term energy supply capacity, at all times. However, these quantities cannot be measured directly from the battery, and, in practice, only SoC estimation is typically carried out. In this article, we propose a novel parametric algebraic voltage model coupled to the well-known Manwell-McGowan dynamic Kinetic Battery Model (KiBaM), which is able to predict both battery SoC dynamics and its electrical response. Numerical simulations, based on laboratory measurements, are presented for prismatic Lithium-Titanate Oxide (LTO) battery cells. Such cells are prime candidates for modern heavy offroad EV applications. |
14:21 CET | SA1.9 | ADEE-LID: AUTOMATED DESIGN OF ENERGY-EFFICIENT HARDWARE ACCELERATORS FOR LEVODOPA-INDUCED DYSKINESIA CLASSIFIERS Speaker: Martin Hurta, Faculty of Information Technology, Brno University of Technology, CZ Authors: Martin Hurta, Vojtech Mrazek, Michaela Drahosova and Lukas Sekanina, Brno University of Technology, CZ Abstract Levodopa, a drug used to treat symptoms of Parkinson's disease, is connected to side effects known as Levodopa-induced dyskinesia (LID). LID is difficult to classify during a physician's visit. A wearable device allowing long-term and continuous classification would significantly help with dosage adjustments. This paper deals with an automated design of energy-efficient hardware accelerators for such LID classifiers. The proposed accelerator consists of a feature extractor and a classifier co-designed using genetic programming. Improvements are achieved by introducing a variable bit width for arithmetic operators, eliminating redundant registers, and using precise energy consumption estimation for Pareto front creation. Evolved solutions reduce energy consumption while maintaining classification accuracy comparable to the state of the art. |
SD5 Approximate computing
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 14:00 CET - 15:30 CET
Location / Room: Okapi Room 0.8.2
Session chair:
Jie Han, University of Alberta, CA, CA
14:00 CET until 14:24 CET: Pitches of regular papers
14:24 CET until 15:30 CET: Interactive technical presentations by the authors of regular papers and extended abstracts
Regular Papers
Time | Label | Presentation Title Authors |
---|---|---|
14:00 CET | SD5.1 | MAXIMIZING COMPUTING ACCURACY ON RESOURCE-CONSTRAINED ARCHITECTURES Speaker: Olivier Sentieys, Inria/Irisa, FR Authors: Van-Phu Ha and Olivier Sentieys, INRIA, FR Abstract With the growing complexity of applications, designers need to fit more and more computing kernels into a limited energy or area budget. Therefore, improving the quality of results of applications in electronic devices with a constraint on its cost is becoming a critical problem. Word Length Optimization (WLO) is the process of determining bit-width for variables or operations represented using fixed-point arithmetic to trade-off between quality and cost. State-of-the-art approaches mainly solve WLO given a quality (accuracy) constraint. In this paper, we first show that existing WLO procedures are not adapted to solve the problem of optimizing accuracy given a cost constraint. It is then interesting and challenging to propose new methods to solve this problem. Then, we propose a Bayesian optimization based algorithm to maximize the quality of computations under a cost constraint (i.e., energy in this paper). Experimental results indicate that our approach outperforms conventional WLO approaches by improving the quality of the solutions by more than 170 %. |
14:03 CET | SD5.2 | MECALS: A MAXIMUM ERROR CHECKING TECHNIQUE FOR APPROXIMATE LOGIC SYNTHESIS Speaker: Chang Meng, Shanghai Jiao Tong University, CN Authors: Chang Meng, Jiajun Sun, Yuqi Mai and Weikang Qian, Shanghai Jiao Tong University, CN Abstract Approximate computing is an effective computing paradigm to improve energy efficiency for error-tolerant applications. Approximate logic synthesis (ALS) methods are designed to generate approximate circuits under certain error constraints. This paper focuses on ALS methods under the maximum error constraint and proposes MECALS, a maximum error checking technique for ALS. MECALS models maximum error using partial Boolean difference and performs fast error checking with SAT sweeping. Based on MECALS, we design an efficient ALS flow. Our experimental results show that compared to a state-of-the-art ALS method, our flow is 13× faster and improves area and delay reduction by 39.2% and 26.0%, respectively. |
14:06 CET | SD5.3 | COMPACT: CO-PROCESSOR FOR MULTI-MODE PRECISION-ADJUSTABLE NON-LINEAR ACTIVATION FUNCTIONS Speaker: Wenhui Ou, School of Mechanical Science and Engineering, Huazhong University of Science and Technology, CN Authors: Wenhui Ou, Zhuoyu Wu, Zheng Wang, Chao Chen and Yongkui Yang, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, CN Abstract Non-linear activation functions imitating neuron behaviors are ubiquitous in machine learning algorithms for time series signals while also demonstrating significant gain in precision for conventional vision-based deep learning networks. State-of-the-art implementation of such functions on GPU-like devices incurs a large physical cost, whereas edge devices adopt either linear interpolation or simplified linear functions leading to degraded precision. In this work, we design COMPACT, a co-processor with adjustable precision for multiple non-linear activation functions including but not limited to exponent, sigmoid, tangent, logarithm, and mish. Benchmarking with state-of-the-arts, COMPACT achieves a 26% reduction in the absolute error on a 1.6x widen approximation range taking advantage of the triple decomposition technique inspired by Hajduk's formula of Padé approximation. A SIMD-ISA-based vector co-processor has been implemented on FPGA which leads to a 30% reduction in execution latency but the area overhead nearly remains the same with related designs. Furthermore, COMPACT is adjustable to 46% latency improvement when the maximum absolute error is tolerant to the order of 1E-3. |
14:09 CET | SD5.4 | DEEPCAM: A FULLY CAM-BASED INFERENCE ACCELERATOR WITH VARIABLE HASH LENGTHS FOR ENERGY-EFFICIENT DEEP NEURAL NETWORKS Speaker: Priyadarshini Panda, Yale University, US Authors: Duy-Thanh Nguyen, Abhiroop Bhattacharjee, Abhishek Moitra and Priyadarshini Panda, Yale University, US Abstract With ever increasing depth and width in deep neural networks to achieve state-of-the-art performance, deep learning computation has significantly grown, and dot-products remain dominant in overall computation time. Most prior works are built on conventional dot-product where weighted input summation is used to represent the neuron operation. However, another implementation of dot-product based on the notion of angles and magnitudes in the Euclidean space has attracted limited attention. This paper proposes DeepCAM, an inference accelerator built on two critical innovations to alleviate the computation time bottleneck of convolutional neural networks. The first innovation is an approximate dot-product built on computations in the Euclidean space that can replace addition and multiplication with simple bit-wise operations. The second innovation is a dynamic size content addressable memory-based (CAM-based) accelerator to perform bit-wise operations and accelerate the CNNs with a lower computation time. Our experiments on benchmark image recognition datasets demonstrate that DeepCAM is up to 523x and 3498x faster than Eyeriss and traditional CPUs like Intel Skylake, respectively. Furthermore, the energy consumed by our DeepCAM approach is 2.16x to 109x less compared to Eyeriss. |
14:12 CET | SD5.5 | DESIGN OF LARGE-SCALE STOCHASTIC COMPUTING ADDERS AND THEIR ANOMALOUS BEHAVIOR Speaker: Timothy Baker, University of Michigan, US Authors: Timothy Baker and John Hayes, University of Michigan, US Abstract Stochastic computing (SC) uses streams of pseudo-random bits to perform low-cost and error-tolerant numerical processing for applications like neural networks and digital filtering. A key operation in these domains is the summation of many hundreds of bit-streams, but existing SC adders are inflexible and unpredictable. Basic mux adders have low area but poor accuracy while other adders like accumulative parallel counters (APCs) have good accuracy but high area. This work introduces parallel sampling adders (PSAs), a novel weighted adder family that offers a favorable area-accuracy trade-off and provides great flexibility to large-scale SC adder design. Our experiments show that PSAs can sometimes achieve the same high accuracy as APCs, but at half the area cost. We also examine the behavior of large-scale SC adders in depth and uncover some surprising results. First, APC accuracy is shown to be sensitive to input correlation despite the common belief that APCs are correlation insensitive. Then, we show that mux-based adders are sometimes more accurate than APCs, which contradicts most prior studies. Explanations for these anomalies are given and a decorrelation scheme is proposed to improve APC accuracy by 4x for a digital filtering application. |
14:15 CET | SD5.6 | ACCURATE YET EFFICIENT STOCHASTIC COMPUTING NEURAL ACCELERATION WITH HIGH PRECISION RESIDUAL FUSION Speaker: Yixuan Hu, Institute of Microelectronics, Peking University, CN Authors: Yixuan Hu1, Tengyu Zhang1, Renjie Wei1, Meng Li2, Runsheng Wang1, Yuan Wang1 and Ru Huang1 1School of Integrated Circuits, Peking University, CN; 2Institute for Artificial Intelligence and School of Integrated Circuits, Peking University, CN Abstract Stochastic computing (SC) emerges as a fault-tolerant and area-efficient computing paradigm for neural acceleration. However, existing SC accelerators suffer from an intrinsic trade-off between inference accuracy and efficiency: accurate SC requires high precision computation but suffers from an exponential increase of bit stream length and inference latency. In this paper, we discover the high precision residual as a key remedy and propose to combine a low precision datapath with a high precision residual to improve inference accuracy with minimum efficiency overhead. We also propose to fuse batch normalization with the activation function to further improve the inference efficiency. The effectiveness of our proposed method is verified on a recently proposed SC accelerator. With extensive results, we show that our proposed SC-friendly network achieves 9.43% accuracy improvements compared to the baseline low precision networks with only 1.3% area-delay product (ADP) increase. We further show 3.01x ADP reduction compared to the baseline SC accelerator with almost iso-accuracy. |
14:18 CET | SD5.7 | PECAN: A PRODUCT-QUANTIZED CONTENT ADDRESSABLE MEMORY NETWORK Speaker: Jie Ran, University of Hong Kong, CN Authors: Jie Ran1, Rui Lin2, Jason Li1, JiaJun Zhou1 and Ngai Wong1 1University of Hong Kong, HK; 2University of Hong Hong, HK Abstract A novel deep neural network (DNN) architecture is proposed wherein the filtering and linear transform are realized solely with product quantization (PQ). This results in a natural implementation via content addressable memory (CAM), which transcends regular DNN layer operations and requires only simple table lookup. Two schemes are developed for the end-to-end PQ prototype training, namely, through angle- and distance-based similarities, which differ in their multiplicative and additive natures with different complexity-accuracy tradeoffs. Even more, the distance-based scheme constitutes a truly multiplier-free DNN solution. Experiments confirm the feasibility of such Product-Quantized Content Addressable Memory Network (PECAN), which has strong implication on hardware-efficient deployments especially for in-memory computing. |
14:21 CET | SD5.8 | XRING: A CROSSTALK-AWARE SYNTHESIS METHOD FOR WAVELENGTH-ROUTED OPTICAL RING ROUTERS Speaker: Zhidan Zheng, TU Munich, DE Authors: Zhidan Zheng, Mengchu Li, Tsun-Ming Tseng and Ulf Schlichtmann, TU Munich, DE Abstract Wavelength-routed optical networks-on-chip (WRONoCs) are well-known for supporting high-bandwidth communications with low power and latency. Among all WRONoC routers, optical ring routers have attracted great research interest thanks to their simple structure, which looks like concentric cycles formed by waveguides. Current ring routers are designed manually. When the number of network nodes increases or the position of network nodes changes, it can be difficult to manually determine the optimal design options. Besides, current ring routers face two problems. First, some signal paths in the routers can be very long and suffer high insertion loss; second, to connect the network nodes to off-chip lasers, waveguides in the power distribution network (PDN) have to intersect with the ring waveguides, which causes additional insertion loss and crosstalk noise. In this work, we propose XRing, which is the first design automation method to automatically synthesize optical ring routers based on the number and position of network nodes. In particular, XRing optimizes the waveguide connections between the network nodes with a mathematical modelling method. To reduce insertion loss and crosstalk noise, XRing constructs efficient shortcuts between the network nodes that suffer long signal paths and creates openings on ring waveguides so that the PDN can easily access the network nodes without causing waveguide crossings. The experimental results show that XRing outperforms other WRONoC routers in reducing insertion loss and crosstalk noise. In particular, more than 98% of signals in XRing do not suffer first-order crosstalk noise, which significantly enhances the signal quality. |
Extended Abstracts
Time | Label | Presentation Title Authors |
---|---|---|
14:24 CET | SD5.10 | EXPLOITING ASSERTIONS MINING AND FAULT ANALYSIS TO GUIDE RTL-LEVEL APPROXIMATION Speaker: Samuele Germiniani, Università di Verona, IT Authors: Alberto Bosio1, Samuele Germiniani2, Graziano Pravadelli2 and Marcello Traiola3 1Lyon Institute of Nanotechnology, FR; 2Università di Verona, IT; 3Inria / IRISA, FR Abstract In Approximate Computing (AxC), several design exploration approaches and metrics have been proposed so far to identify the approximation targets at the gate level, but only a few of them work on RTL descriptions. In addition, the possibility of combining the information derived from assertions and fault analysis is still under-explored. To fill in the gap, this paper proposes an automatic methodology to guide the AxC design exploration at the RTL level. Two approximation techniques are considered, bit-width and statement reduction, while fault injection is used to mimic their effect on the design under approximation. Assertions are then dynamically mined from the original RTL description and the variation of their truth values is evaluated with respect to fault injections. These variations are then used to rank and cluster different approximation alternatives, according to their estimated impact on the functionality of the target design. The experiments carried out on a case study, show that the proposed approach represents a promising solution toward the automatization of AxC design exploration at RTL. |
14:24 CET | SD5.11 | AN EFFICIENT FAULT INJECTION ALGORITHM FOR IDENTIFYING UNIMPORTANT FFS IN APPROXIMATE COMPUTING CIRCUITS Speaker: Yutaka Masuda, Nagoya University, JP Authors: Jiaxuan LU, Yutaka MASUDA and Tohru ISHIHARA, Nagoya University, JP Abstract Approximate computing (AC) has attracted much attention, contributing to energy saving and performance improvement by accurately performing the important computation and approximating others. In order to make AC circuits practical, we need to determine which computation is how important carefully, and thus need to appropriately approximate the unimportant computation for maintaining the required computational quality. In this paper, we focus on the importance of computations at the Flip-Flop (FF) level and propose a novel importance evaluation methodology. The key idea of the proposed methodology is a two-step fault injection algorithm to extract the near-optimal set of unimportant FFs in the circuit. In the first step, the proposed methodology derives the importance of each FF. Then, in the second step, the proposed methodology extracts the set of unimportant FFs in a binary search manner. Thanks to the two-step strategy, the proposed algorithm reduces the complexity of architecture exploration from an exponential order to a linear order without understanding the functionality and behavior of the target application program. In a case study of an image processing accelerator, the proposed algorithm identifies the candidates of unimportant FFs depending on the given constraints. The bit width scaling for extracted FFs with the proposed algorithm reduces the circuit area by 29.6% and saves power dissipation by 35.8% under the ASIC implementation. Under the FPGA implementation, the dynamic power dissipation is saved by 37.0% while satisfying the PSNR constraint. |
14:24 CET | SD5.12 | HARDWARE-AWARE AUTOMATED NEURAL MINIMIZATION FOR PRINTED MULTILAYER PERCEPTRONS Speaker: Argyris Kokkinis, Aristotle University of Thessaloniki, GR Authors: Argyris Kokkinis1, Georgios Zervakis2, Kostas Siozios3, Mehdi Tahoori4 and Joerg Henkel4 1Aristotle University of Thessaloniki, GR; 2University of Patras, GR; 3Department of Physics, Aristotle University of Thessaloniki, GR; 4Karlsruhe Institute of Technology, DE Abstract The demand of many application domains for flexibility, stretchability, and porosity cannot be typically met by the silicon VLSI technologies. Printed Electronics (PE) has been introduced as a candidate solution that can satisfy those requirements and enable the integration of smart devices on consumer goods at ultra low-cost enabling also in situ and on demand fabrication. However, the large features sizes in PE constraint those efforts and prohibit the design of complex ML circuits due to area and power limitations. Though, classification is mainly the core task in printed applications. In this work, we examine, for the first time, the impact of neural minimization techniques, in conjunction with bespoke circuit implementations, on the area-efficiency of printed Multilayer Perceptron classifiers. Results show that for up to 5% accuracy loss up to 8x area reduction can be achieved. |
SD9 Emerging design technologies for future computing
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 14:00 CET - 15:30 CET
Location / Room: Marble Hall
Session chair:
Aida Todri-Sanial, LIRMM, University of Montpellier, CNRS, FR, FR
14:00 CET until 14:24 CET: Pitches of regular papers
14:24 CET until 15:30 CET: Interactive technical presentations by the authors of regular papers and extended abstracts
Regular Papers
Time | Label | Presentation Title Authors |
---|---|---|
14:00 CET | SD9.1 | SCALABLE COHERENT OPTICAL CROSSBAR ARCHITECTURE USING PCM FOR AI ACCELERATION Speaker: Dan Sturm, University of Washington, US Authors: Dan Sturm and Sajjad Moazeni, University of Washington, US Abstract Recent advancements in artificial intelligence (AI) and machine learning (ML) have been challenging our conventional computing paradigms by demanding enormous computing power at a dramatically faster pace than Moore's law. Analog-based optical computing has been recently proposed as a new approach to achieve large compute powers (TOPS) at high energy-efficiency (TOPS/W), which makes them suitable for AI acceleration in datacenters and supercomputers. However, proposed implementations so far suffer from lack of scalability, large footprints and high power consumption, and lack of practical system-level architectures to become integrated within existing datacenter architecture for real-world applications. In this work, we present a truly scalable optical AI accelerator based on a crossbar architecture. We have considered all major roadblocks and address them in this design. Weights will be stored on chip using phase change material (PCM) that can be monolithically integrated in silicon photonic processes. This coherent crossbar architecture can be extended to large scales without the need for any multi-wavelength laser sources. All electro-optical components and circuit blocks are modeled based on measured performance metrics in monolithic silicon photonics 45nm for this work which can be co-packaged with advanced SoC and HBM memories. We also present a system-level modeling and analysis of our chip's performance for the Resnet-50 V1.5 neural network, considering all critical parameters, including memory size, array size, photonic losses, and energy consumption of peripheral electronics including ADC and DACs. Both on- chip SRAM and off-chip DRAM energy overheads have been considered in this modeling. We additionally address how using a dual-core crossbar design can eliminate programming time overhead at practical SRAM block sizes and batch sizes. Our results show that a 128 x 128 proposed architecture can achieve inference per second (IPS) similar to Nvidia A100 GPU at 15.4× lower power and 7.24× lower area. |
14:03 CET | SD9.2 | MIXED-SIGNAL MEMRISTOR-BASED ITERATIVE MONTGOMERY MODULAR MULTIPLICATION Speaker: Mehdi Kamal, University of Southern California, US Authors: Mehdi Kamal and Massoud Pedram, University of Southern California, US Abstract In this paper, we present a mixed-signal implementation of iterative Montgomery multiplication algorithm (called X-IMM) for using in large arithmetic word size (LAWS) computations. LAWS is mainly utilized in security applications such as lattice-based cryptography where the width of the input operands may be equal to or larger than 1,024 bits. The proposed architecture is based on the iterative implementation of Montgomery multiplication (MM) algorithm where some critical parts of the multiplication are computed in the analog domain by mapping them on the memristor crossbar. Using a memristor crossbar reduces the area usage and latency of the modular multiplication unit compared to its fully digital implementation. The devised mixed-signal MM implementation is scalable by cascading the smaller X-IMMs to support dynamically adjustable larger operand sizes at runtime. The effectiveness of the proposed MM structure is assessed in the 45nm technology and the comparative studies show that the proposed 1,024-bit Radix-4 (Radix-16) Montgomery multiplication architecture provides about 13% (22%) higher GOPS/mm^2 compared to the state-of-the-art digital implementations of the iterative Montgomery multipliers. |
14:06 CET | SD9.3 | ODLPIM: A WRITE-OPTIMIZED AND LONG-LIFETIME RERAM-BASED ACCELERATOR FOR ONLINE DEEP LEARNING Speaker: Heng Zhou, Huazhong University of Science & Technology, CN Authors: Heng Zhou, Bing Wu, Huan Cheng, Wei Zhao, Xueliang Wei, Jinpeng Liu, Dan Feng and Wei Tong, Huazhong University of Science & Technology, CN Abstract ReRAM-based Processing-In-Memory (PIM) architectures have demonstrated high energy efficiency and performance in deep neural network (DNN) acceleration. Most of the existing PIM accelerators for DNN focus on offline batch learning (OBL) which require the whole dataset to be available before training. However, in the real world, data instances arrive in sequential settings, and even the data pattern may change, which calls concept drift. OBL requires expensive retraining to solve concept drift, whereas online deep learning (ODL) is evidenced to be a better solution to keep the model evolving over streaming data. Unfortunately, when ODL optimizes models over a large-scale data stream in the PIM system, unbalanced writes are more severe than OBL, due to the heavier weight updates, resulting in the amplification of unbalanced writes and lifetime deterioration. In this work, we propose ODLPIM, an online deep learning PIM accelerator that extends the system lifetime through algorithm-hardware co-optimization. ODLPIM adopts a novel write-optimized parameter update (WARP) scheme that reduces the non-critical weight updates in hidden layers. Besides, a table-based inter-crossbar wear-leveling (TIWL) scheme is proposed and applied to the hardware controller to achieve wear-leveling between crossbars for lifetime improvement. Experiments show that WARP reduces weight updates on average to 15.25% and up to 24% compared to that without WARP, and eventually prolongs system lifetime on average to 9.65% and up to 26.81%, with a negligible rise in cumulative error rate (up to 0.31%). By combining WARP with TIWL, the lifetime of ODLPIM is improved by an average of 12.59X and up to 17.73X. |
14:09 CET | SD9.4 | SAT-BASED QUANTUM CIRCUIT ADAPTATION Speaker: Sebastian Brandhofer, University of Stuttgart, Institute of Computer Architecture and Computer Engineering and Center for Integrated Quantum Science and Technology, DE Authors: Sebastian Brandhofer1, Jinwoong Kim2, Siyuan Niu3 and Nicholas Bronn4 1University of Stuttgart, DE; 2TU Delft, NL; 3Université de Montpellier, FR; 4IBM Thomas J. Watson Research Center, US Abstract As the nascent field of quantum computing develops, an increasing number of quantum hardware modalities, such as superconducting electronic circuits, semiconducting spins, trapped ions, and neutral atoms, have become available for performing quantum computations. These quantum hardware modalities exhibit varying characteristics and implement different universal quantum gate sets that may e.g. contain several distinct two-qubit quantum gates. Adapting a quantum circuit from a, possibly hardware-agnostic, universal quantum gate set to the quantum gate set of a target hardware modality has a crucial impact on the fidelity and duration of the intended quantum computation. However, current quantum circuit adaptation techniques only apply a specific decomposition or allow only for local improvements to the target quantum circuit potentially resulting in a quantum computation with less fidelity or more qubit idle time than necessary. These issues are further aggravated by the multiple options of hardware-native quantum gates rendering multiple universal quantum gates sets accessible to a hardware modality. In this work, we developed a satisfiability modulo theories model that determines an optimized quantum circuit adaptation given a set of allowed substitutions and decompositions, a target hardware modality and the quantum circuit to be adapted. We further discuss the physics of the semiconducting spins hardware modality, show possible implementations of distinct two-qubit quantum gates, and evaluate the developed model on the semiconducting spins hardware modality. Using the developed quantum circuit adaptation method on a noisy simulator, we show the Hellinger fidelity could be improved by up to 40% and the qubit idle time could be decreased by up to 87% compared to alternative quantum circuit adaptation techniques. |
14:12 CET | SD9.5 | ULTRA-DENSE 3D PHYSICAL DESIGN UNLOCKS NEW ARCHITECTURAL DESIGN POINTS WITH LARGE BENEFITS Speaker: Tathagata Srimani, Stanford University, US Authors: Tathagata Srimani1, Robert Radway1, Jinwoo Kim2, Kartik Prabhu1, Dennis Rich1, Carlo Gilardi1, Priyanka Raina1, Max Shulaker3, Sung Kyu Lim2 and Subhasish Mitra1 1Stanford University, US; 2Georgia Tech, US; 3Massachusetts Institute of Technology, US Abstract This paper focuses on iso-on-chip-memory-capacity and iso-footprint Energy-Delay-Product (EDP) benefits of ultra-dense 3D, e.g., monolithic 3D (M3D), computing systems vs. corresponding 2D designs. Simply folding existing 2D designs into corresponding M3D physical designs yields limited EDP benefits (~1.4×). New M3D architectural design points that exploit M3D physical design are crucial for large M3D EDP benefits. We perform comprehensive architectural exploration and detailed M3D physical design using foundry M3D process design kit and standard cell library for front-end-of-line (FEOL) Si CMOS logic, on-chip back-end-of-line (BEOL) memory, and a single layer of on-chip BEOL FETs. We find new M3D AI/ML accelerator architectural design points that have iso-footprint, iso-on-chip-memory-capacity EDP benefits ranging from 5-11.5× vs. corresponding 2D designs (containing only FEOL Si CMOS and on-chip BEOL memory). We also present an analytical framework to derive architectural insights into these benefits, showing that our principles extend to many architectural design points across various device technologies. |
14:15 CET | SD9.6 | MEMRISTOR-SPIKELEARN: A SPIKING NEURAL NETWORK SIMULATOR FOR STUDYING SYNAPTIC PLASTICITY UNDER REALISTIC DEVICE AND CIRCUIT BEHAVIORS Speaker: Yuming Liu, University of Chicago, US Authors: Yuming Liu1, Angel Yanguas-Gil2, Sandeep Madireddy2 and Yanjing Li1 1University of Chicago, US; 2Argonne National Laboratory, US Abstract We present the Memristor-Spikelearn simulator (open-sourced), which is capable of incorporating detailed memristor and circuit models in simulation to enable thorough study of synaptic plasticity in spiking neural networks under realistic device and circuit behaviors. Using this simulator, we demonstrate that: (1) a detailed device model is essential for simulating synaptic plasticity workloads, because results obtained using a simplified model can be misleading (e.g., it can overestimate test accuracy by up to 21.9%); (2) detailed simulation helps to determine the proper range of conductance values to represent weights, which is critical in order to achieve the desired accuracy-energy tradeoff (e.g., increasing the conductance values by 10× can increase accuracy from 70% to 83% at the price of 20× higher energy); and (3) detailed simulation also helps to determine an optimized circuit structure, which is another important design parameter that can yield different accuracy-energy tradeoffs. |
14:18 CET | SD9.7 | EXPLOITING KERNEL COMPRESSION ON BNNS Speaker: Franyell Silfa, UPC, ES Authors: Franyell Silfa, Jose Maria Arnau and Antonio González, UPC, ES Abstract Binary Neural Networks (BNNs) are showing tremendous success on realistic image classification tasks. Notably, their accuracy is similar to the state-of-the-art accuracy obtained by full-precision models tailored to edge devices. In this regard, BNNs are very amenable to edge devices since they employ 1-bit to store the inputs and weights, and thus, their storage requirements are low. Moreover, BNNs computations are mainly done using xnor and pop-counts operations which are implemented very efficiently using simple hardware structures. Nonetheless, supporting BNNs efficiently on mobile CPUs is far from trivial since their benefits are hindered by frequent memory accesses to load weights and inputs. In BNNs, a weight or an input is stored using one bit, and aiming to increase storage and computation efficiency, several of them are packed together as a sequence of bits. In this work, we observe that the number of unique sequences representing a set of weights or inputs is typically low (i.e., 512). Also, we have seen that during the evaluation of a BNN layer, a small group of unique sequences is employed more frequently than others. Accordingly, we propose exploiting this observation by using Huffman Encoding to encode the bit sequences and then using an indirection table to decode them during the BNN evaluation. Also, we propose a clustering-based scheme to identify the most common sequences of bits and replace the less common ones with some similar common sequences. As a result, we decrease the storage requirements and memory accesses since the most common sequences are encoded with fewer bits. In this work, we extend a mobile CPU by adding a small hardware structure that can efficiently cache and decode the compressed sequence of bits. We evaluate our scheme using the ReAacNet model with the Imagenet dataset on an ARM CPU. Our experimental results show that our technique can reduce memory requirement by 1.32x and improve performance by 1.35x. |
14:21 CET | SD9.8 | AXI-PACK: NEAR-MEMORY BUS PACKING FOR BANDWIDTH-EFFICIENT IRREGULAR WORKLOADS Speaker: Chi Zhang, ETH Zurich, CH Authors: Chi Zhang1, Paul Scheffler1, Thomas Benz1, Matteo Perotti2 and Luca Benini1 1ETH Zurich, CH; 2ETH Zürich, CH Abstract Data-intensive applications involving irregular memory streams are inefficiently handled by modern processors and memory systems highly optimized for regular, contiguous data. Recent work tackles these inefficiencies in hardware through core-side stream extensions or memory-side prefetchers and accelerators, but fails to provide end-to-end solutions which also achieve high efficiency in on-chip interconnects. We propose AXI-Pack, an extension to ARM's AXI4 protocol introducing bandwidth-efficient strided and indirect bursts to enable end-to-end irregular streams. AXI-Pack adds irregular stream semantics to memory requests and avoids inefficient narrow-bus transfers by packing multiple narrow data elements onto a wide bus. It retains full compatibility with AXI4 and does not require modifications to non-burst-reshaping interconnect IPs. To demonstrate our approach end-to-end, we extend an open-source RISC-V vector processor to leverage AXI-Pack at its memory interface for strided and indexed accesses. On the memory side, we design a banked memory controller efficiently handling AXI-Pack requests. On a system with a 256-bit-wide interconnect running FP32 workloads, AXI-Pack achieves near-ideal peak on-chip bus utilizations of 87% and 39%, speedups of 5.4x and 2.4x, and energy efficiency improvements of 5.3x and 2.1x over a baseline using an AXI4 bus on strided and indirect benchmarks, respectively. |
Extended Abstracts
Time | Label | Presentation Title Authors |
---|---|---|
14:24 CET | SD9.9 | SIMSNN: A WEIGHT-AGNOSTIC RERAM-BASED SEARCH-IN-MEMORY ENGINE FOR SPIKING NEURAL NETWORK ACCELERATION Speaker: Fangxin Liu, Shanghai Jiao Tong University, CN Authors: Fangxin Liu, Xiaokang Yang and Li Jiang, Shanghai Jiao Tong University, CN Abstract Bio-plausible spiking neural networks (SNNs) have gained a great momentum due to its inherent efficiency of processing event-driven information. The dominant computation--matrix bit-wise And-Add operations--in SNN is naturally fit for process-in-memory architecture~(PIM). The long input spike train of SNN and the bit-serial processing mechanism of PIM, however, incur considerable latency and frequent analog-to-digital conversion, offsetting the performance gain and energy-efficiency. In this paper, we propose a novel Search-in-Memory~(SIM) architecture to accelerate the SNN inference, named SIMSnn. Rather than processing the input bit-by-bit over multiple time steps, SIMSnn can take in a sequence of spikes and search the result by parallel associative matches in the CAM crossbar. We explore the cascade search mechanism and the temporal pipeline design to enhance the parallelism of the search across time windows. The proposed SIMSnn can leverage the non-structured pruning mechanism, which is unusable for most PIM architecture, to further reduce the CAM overhead. As a weight-agnostic SNN accelerator, SIMSnn can adapt to various evolving SNNs without rewriting the crossbar array. Experimental results show that the proposed SIMSnn achieves 25.3x higher energy-efficiency and 13.7x speedup on average than the ISAAC-like design. Compared to the state-of-the-art PIM design, NEBULA, SIMSnn can also achieve up to 7.9x energy savings and 5.7x speedup. |
14:24 CET | SD9.10 | BOMIG: A MAJORITY LOGIC SYNTHESIS FRAMEWORK FOR AQFP LOGIC Speaker: Tsung-Yi Ho, The Chinese University of Hong Kong, CN Authors: Rongliang Fu1, Junying Huang2, Mengmeng Wang3, Yoshikawa Nobuyuki3, Bei Yu4, Tsung-Yi Ho4 and Olivia Chen5 1The Chinese University of Hong Kong, CN; 2State Key Lab of Processors, Institute of Computing Technology, CAS, Beijing, China, CN; 3Yokohama National University, JP; 4The Chinese University of Hong Kong, HK; 5Tokyo City University, JP Abstract Adiabatic quantum-flux-parametron (AQFP) logic, an energy-efficient superconductor logic with no static power consumption and ultra-low switching energy, is a promising candidate for energy-efficient computing systems. Due to the native majority function in AQFP logic, which can represent more complex logic with the same cost as the AND/OR function, the design of AQFP circuits differs from AND-OR-inverter-based logic circuits. Besides, AQFP logic has the path balancing requirement and fan-out limitation, making traditional majority-based logic optimization methods not applicable. This paper proposes a global optimization method over the majority-inverter graph (MIG) to minimize the JJ number and circuit depth of AQFP circuits. MIG-based transformation methods are first illustrated to construct the feasible domain. The normalized energy-delay-product (EDP), the product of the JJ number and circuit depth of AQFP circuits, is used as the objective function. Then, Bayesian optimization is used to explore the global optimal transformation sequence applied to AQFP MIG-based logic optimization. Experimental results show that the proposed method has a significant improvement in the JJ number and circuit depth compared with the state-of-the-art. |
SE1 Optimized software architecture towards an improved utilization of hardware features
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 14:00 CET - 15:30 CET
Location / Room: Okapi Room 0.8.3
Session chair:
Michele Lora, University of Southern California & University of Verona, IT
14:00 CET until 14:24 CET: Pitches of regular papers
14:24 CET until 15:30 CET: Interactive technical presentations by the authors of regular papers and extended abstracts
Regular Papers
Time | Label | Presentation Title Authors |
---|---|---|
14:00 CET | SE1.1 | MARB: BRIDGE THE SEMANTIC GAP BETWEEN OPERATING SYSTEM AND APPLICATION MEMORY ACCESS BEHAVIOR Speaker: Tianyue Lu, Chinese Academy of Sciences, CN Authors: Haifeng Li1, Ke Liu1, Ting Liang2, Zuojun Li3, Tianyue Lu2, Yisong Chang2, Hui Yuan4, Yinben Xia4, Yungang Bao5, Mingyu Chen5 and Yizhou Shan6 1ICT, CN; 2Chinese Academy of Sciences, CN; 3Institute of Computing Technology, CN; 4Huawei, CN; 5ICT, CAS, CN; 6Huawei Cloud, CN Abstract The virtual memory subsystem (VMS) is a long-standing and integral part of an operating system (OS). It plays a vital role in enabling remote memory systems over fast data center networks and is promising in terms of transparency and generality. Specifically, these systems use three VMS mechanisms: demand paging, page swapping, and page prefetching. However, the VMS inherent data path is costly, which takes a huge toll on performance. Despite prior efforts to propose page swapping and prefetching algorithms to minimize the occurrences of the data path, they still fall short due to the semantic gap between the OS and applications – the VMS has limited knowledge of its running applications' memory access behaviors. In this paper, orthogonal to prior efforts, we take a fundamentally different approach by building an efficient framework to collect full memory access traces at the local bus, and make them available to the OS through CPU cache. Consequently, the page swapping and page prefetching can use this trace to make better decisions, thereby improving the overall performance of systems. We implement a proof-of-concept prototype on commodity x86 servers using a hardware-based memory tracking tool. To showcase our framework's benefits, we integrate it with a state-of-the-art remote memory system and the default kernel page eviction subsystem. Our evaluation shows promising improvements. |
14:03 CET | SE1.2 | SAT-MAPIT: A SAT-BASED MODULO SCHEDULING MAPPER FOR COARSE GRAIN RECONFIGURABLE ARCHITECTURES Speaker: Cristian Tirelli, Università della Svizzera italiana, IT Authors: Cristian Tirelli1, Lorenzo Ferretti2 and Laura Pozzi1 1USI Lugano, CH; 2University of California, Los Angeles, US Abstract Coarse-Grain Reconfigurable Arrays (CGRAs) are emerging low-power architectures aimed at accelerating compute-intensive application loops. The acceleration that a CGRA can ultimately provide, however, heavily depends on the quality of the mapping, i.e. on how effectively the loop is compiled onto the given platform. State of the Art compilation techniques achieve mapping through modulo scheduling, a strategy which attempts to minimize the II (Iteration Interval) needed to execute a loop, and they do so usually through well known graph algorithms, such as Max-Clique Enumeration. We address the mapping problem through a SAT formulation, instead, and thus explore the solution space more effectively than current SoA tools. To formulate the SAT problem, we introduce an ad-hoc schedule called the kernel mobility schedule (MKS), which we use in conjunction with the data-flow graph and the architectural information of the CGRA in order to create a set of boolean statements that describe all constraints to be obeyed by the mapping for a given II. We then let the SAT solver efficiently navigate this complex space. As in other SoA techniques, the process is iterative: if a valid mapping does not exist for the given II, the II is increased and a new KMS and set of constraints is generated and solved. Our experimental results show that SAT-MapIt obtains better results compared to SoA alternatives in 47.72% of the benchmarks explored: sometimes finding a lower II, and others even finding a valid mapping when none could previously be found. |
14:06 CET | SE1.3 | LIVENESS-AWARE CHECKPOINTING OF ARRAYS FOR EFFICIENT INTERMITTENT COMPUTING Speaker: Youngbin Kim, ETRI, KR Authors: Youngbin Kim, Yoojin Lim and Chaedeok Lim, ETRI, KR Abstract Intermittent computing enables computing under environments that may experience frequent and unpredictable power failures, such as energy harvesting systems. It relies on checkpointing to preserve computing progress between power cycles, which often incurs significant overhead due to energyexpensive writes to Non-Volatile Memory (NVM). In this paper, we present LACT (Liveness-Aware CheckpoinTing), as an approach to reducing the size of checkpointed data by exploiting the liveness of memory objects: excluding dead memory objects from checkpointing does not affect the correctness of the program. Especially, LACT can analyze the liveness of arrays, which take up most of the memory space but are not analyzable by existing methods for detecting the liveness of scalar objects. Using the liveness information of arrays, LACT determines the minimized checkpoint range for the arrays at compile time without any runtime addition. Our evaluation shows that LACT achieves an additional reduction of checkpointed data size of 37.8% on average over the existing state-of-the-art technique. Also, our experiments on a real energy harvesting environment show that LACT can reduce the execution time of applications by 27.7% on average. |
14:09 CET | SE1.4 | SERICO: SCHEDULING REAL-TIME I/O REQUESTS IN COMPUTATIONAL STORAGE DRIVES Speaker: Yun HUANG, City University of Hong Kong, CN Authors: Yun HUANG1, Nan Guan2, Shuhan BAI3, Tei-Wei Kuo4 and Jason Xue2 1City University of Hong Kong, CN; 2City University of Hong Kong, HK; 3City University of Hong Kong; Huazhong University of Science and Technology, CN; 4National Taiwan University, TW Abstract The latency and energy consumption caused by I/O accesses are significant in data-centric computing systems. Computational Storage Drive (CSD) can largely reduce data movement, and thus reduce I/O latency and energy consumption by performing near-data processing, i.e., offloading some data processing to processors inside the storage device. In this paper, we study the problem of how to efficiently utilize the limited processing and memory resources of CSD to simultaneously serve multiple I/O requests from different applications with different real-time requirements. We proposed SERICO, a novel technique for scheduling computational I/O requests in CSD. The key idea of SERICO is to perform admission control of real-time computational I/O requests by online schedulability analysis, to avoid wasting the processing capacity of CSD of doing meaningless work for those requests deemed to violate the timing constraints anyway. Each admitted computational I/O request is served in a controlled manner with carefully designed parameters, to meet its timing constraint with minimal memory cost. We evaluate SERICO with both synthetic workloads on simulators and representative applications on a realistic CSD platform. Experiment results show that SERICO significantly outperforms the baseline method currently used by the CSD device and the standard deadline-driven scheduling approach. |
14:12 CET | SE1.5 | REGION-BASED FLASH CACHING WITH JOINT LATENCY AND LIFETIME OPTIMIZATION IN HYBRID SMR STORAGE SYSTEMS Speaker: Zhengang Chen, Capital Normal University, CN Authors: Zhengang Chen1, Guohui Wang1, Zhiping Shi2, Yong Guan3 and Tianyu Wang4 1College of Information Engineering,Capital Normal University, CN; 2Beijing Key Laboratory of Electronic System Reliability Technology,Capital Normal University, CN; 3International Science and Technology Cooperation Base of Electronic System Reliability and Mathematical Interdisciplinary,Capital Normal University, CN; 4The Chinese University of Hong Kong, HK Abstract The frequent Read-Modify-Write operations (RMWs) in Shingled Magnetic Recording (SMR) disks severely degrade the random write performance of the system. Although the adoption of persistent cache (PC) and built-in NAND flash cache alleviates some of the RMWs, when the cache is full, the triggered write-back operations still prolong I/O response time and the erasure of NAND flash also sacrifices its lifetime. In this paper, we propose a Region-based Co-optimized strategy named Multi-Regional Collaborative Management (MCM) to optimize the average response time by separately managing sequential/random and hot/cold data and extend the NAND flash lifetime by a region-aware wear leveling strategy. The experimental results show that our MCM reduces 71% of the average response time and 96% of RMWs on average compared with the Skylight (baseline). For the comparison with the state-of-art flash-based cache (FC) approach, we can still save the average response time and flash erase operations by 17.2% and 33.32%, respectively. |
14:15 CET | SE1.6 | GEM-RL: GENERALIZED ENERGY MANAGEMENT OF WEARABLE DEVICES USING REINFORCEMENT LEARNING Speaker: Toygun Basaklar, University of Wisconsin - Madison, US Authors: Toygun Basaklar1, Yigit Tuncel1, Suat Gumussoy2 and Umit Ogras1 1University of Wisconsin - Madison, US; 2Siemens Corporate Technology, US Abstract Energy harvesting (EH) and management (EM) have emerged as enablers of self-sustained wearable devices. Since EH alone is not sufficient for self-sustainability due to uncertainties of ambient sources and user activities, there is a critical need for a user-independent EM approach that does not rely on expected EH predictions. We present a generalized energy management framework (GEM-RL) using multi-objective reinforcement learning. GEM-RL learns the trade-off between utilization and the battery energy level of the target device under dynamic EH patterns and battery conditions. It also uses a lightweight approximate dynamic programming (ADP) technique that utilizes the trained MORL agent to optimize the utilization of the device over a longer period. Thorough experiments show that, on average, GEM-RL achieves Pareto front solutions within 5.4% of the offline Oracle for a given day. For a 7-day horizon, it achieves utility up to 4% within the offline Oracle and up to 50% higher utility compared to baseline EM approaches. The hardware implementation of GEM-RL on a wearable device shows negligible execution time (1.98 ms) and energy consumption (23.17 uJ) overhead. |
14:18 CET | SE1.7 | VIX: ANALYSIS-DRIVEN COMPILER FOR EFFICIENT LOW-PRECISION DIFFERENTIABLE INFERENCE Speaker: Ashitabh Misra, University of Illinois at Urbana Champaign, US Authors: Ashitabh Misra, Jacob Laurel and Sasa Misailovic, University of Illinois at Urbana-Champaign, US Abstract As large quantities of stochastic data are processed onboard tiny edge devices, these systems must constantly make decisions under uncertainty. This challenge necessitates principled embedded compiler support for time- and energy-efficient probabilistic inference. However, compiling probabilistic inference to run on the edge is significantly nderstudied, and the existing research is limited to computationally expensive MCMC algorithms. Hence, these works cannot leverage faster variational inference algorithms which can better scale to larger data sizes that are epresentative of realistic workloads in the edge setting. However, naively writing code for differentiable inference on resource-constrained edge devices is challenging due to the need for expensive floating point computations. Even when using reduced precision, a developer still faces the challenge of choosing the right quantization scheme, as gradients can be notoriously unstable in the face of low-precision. To address these challenges, we propose ViX which is the first compiler for low-precision probabilistic programming with variational inference. ViX generates optimized variational inference code in reduced precision by automatically exploiting Bayesian domain knowledge and analytical mathematical properties to ensure that low-precision gradients can still be effectively used. ViX can scale inference to much larger data-sets than previous compilers for resource-constrained probabilistic programming while attaining both high accuracy and significant speedup. Our evaluation of ViX across 7 benchmarks shows that ViX generated code is up to 8.15× faster than performing the same variational inference in 32-bit floating point and also up to 22.67× faster than performing the variational inference in 64-bit double precision, all with minimal accuracy loss. Further, on a subset of our benchmarks, ViX can scale inference to data sizes between 16 − 80× larger than the existing state-of-the-art tool Statheros. |
14:21 CET | SE1.8 | CHAMELEON: DUAL MEMORY REPLAY FOR ONLINE CONTINUAL LEARNING ON EDGE DEVICES Speaker: Shivam Aggarwal, National University of Singapore, SG Authors: Shivam Aggarwal, Kuluhan Binici and Tulika Mitra, National University of Singapore, SG Abstract Once deployed on edge devices, a deep neural network model should dynamically adapt to newly discovered environments and personalize its utility for each user. The system must be capable of continual learning, i.e., learning new information from a temporal stream of data in situ without forgetting previously acquired knowledge. However, the prohibitive intricacies of such a personalized continual learning framework stand at odds with limited compute and storage on edge devices. Existing continual learning methods rely on massive memory storage to preserve the past data while learning from the incoming data stream. We propose Chameleon, a hardware-friendly continual learning framework for user-centric training with dual replay buffers. The proposed strategy leverages the hierarchical memory structure available on most edge devices, introducing a short-term replay store in the on-chip memory and a long-term replay store in the off-chip memory to acquire new information while retaining past knowledge. Extensive experiments on two large-scale continual learning benchmarks demonstrate the efficacy of our proposed method, achieving better or comparable accuracy than existing state-of-the-art techniques while reducing the memory footprint by roughly 16x. Our method achieves up to 7x speedup and energy efficiency on edge devices such as ZCU102 FPGA, NVIDIA Jetson Nano and Google's EdgeTPU. Our code is available at https://github.com/ecolab-nus/Chameleon. |
Extended Abstracts
Time | Label | Presentation Title Authors |
---|---|---|
14:24 CET | SE1.9 | FAGC: FREE SPACE FRAGMENTATION AWARE GC SCHEME BASED ON OBSERVATIONS OF ENERGY CONSUMPTION Speaker: Ying Yuan, Huazhong University of Science & Technology, CN Authors: Lihua Yang1, Zhipeng Tan2, Fang Wang2, Yang Xiao2, Wei Zhang1 and Biao He3 1National University of Defense Technology, CN; 2Huazhong University of Science & Technology, CN; 3Huawei Technologies Co., LTD, CN Abstract Smartphones are everyday necessities with limited power supply. Charging a smartphone twice a day or more affects user experience. Flash friendly file system (F2FS) is a widely-used log-structured file system for smartphones. Free space fragmentation of F2FS consists of invalid blocks mainly causing performance degradation. F2FS reclaims invalid blocks by garbage collection (GC). We explore the energy consumption of GC and the effect of GC on reducing free space fragments. We observe the energy consumption of one background GC is large but its effect on reducing free space fragments is limited. These motivate us to improve the energy efficiency of GC. We reassess how much free space is a free space fragment based on data analysis, use the free space fragmentation factor to measure the degree of free space fragmentation quickly. We suggest the free space {f}ragmentation {a}ware {GC} scheme (FAGC) that optimizes the selection for victim segments and the migration for valid blocks. Experiments on real platform show that FAGC reduces GC count by 82.68% and 74.51% respectively than traditional F2FS and the latest GC optimization of F2FS, ATGC. FAGC reduces the energy consumption by 164.37 J and 100.64 J compared to traditional F2FS and ATGC respectively for a synthetic benchmark. |
14:24 CET | SE1.10 | TRANSLIB: A LIBRARY TO EXPLORE TRANSPRECISION FLOATING-POINT ARITHMETIC ON MULTI-CORE IOT END-NODES Speaker: Seyed Ahmad Mirsalari, Università di Bologna, IT Authors: Seyed Ahmad Mirsalari1, Giuseppe Tagliavini1, Davide Rossi1 and Luca Benini2 1Università di Bologna, IT; 2ETH Zurich, CH | Università di Bologna, IT Abstract Reduced-precision floating-point (FP) arithmetic is being widely adopted to reduce memory footprint and execution time on battery-powered Internet of Things (IoT) end-nodes. However, reduced precision computations must meet end-do-end precision constraints to be acceptable at the application level. This work introduces TransLib, an open-source kernel library based on transprecision computing principles, which provides knobs to exploit different FP data types (i.e., float, float16, and bfloat16), also considering the trade-off between homogeneous and mixed-precision solutions. We demonstrate the capabilities of the proposed library on PULP, a 32-bit microcontroller (MCU) coupled with a parallel, programmable accelerator. On average, TransLib kernels achieve an IPC of 0.94 and a speed-up of 1.64× using 16-bit vectorization. The parallel variants achieve a speed-up of 1.97×, 3.91×, and 7.59× on 2, 4, and 8 cores, respectively. The memory footprint reduction is between 25% and 50%. Finally, we show that mixed-precision variants increase the accuracy by 30× at the cost of 2.09× execution time and 1.35× memory footprint compared to float16 vectorized. |
14:24 CET | SE1.11 | CFU PLAYGROUND: WANT A FASTER ML PROCESSOR? DO IT YOURSELF! Speaker: Shvetank Prakash, Harvard, US Authors: Shvetank Prakash1, Timothy Callahan2, Joseph Bushagour3, Colby Banbury4, Alan Green2, Pete Warden5, Tim Ansell2 and Vijay Janapa Reddi1 1Harvard University, US; 2Google, US; 3Purdue University, US; 4Harvard, US; 5Stanford University, US Abstract The rise of machine learning (ML) has necessitated the development of innovative processing engines. However, development of specialized hardware accelerators can incur enormous one-time engineering expenses that should be avoided in low-cost embedded ML systems. In addition, embedded systems have tight resource constraints that prevent them from affording the ``full-blown'' machine learning (ML) accelerators seen in many cloud environments. In embedded situations, a custom function unit (CFU) that is more lightweight is preferable. We offer CFU Playground, an open-source toolchain for accelerating embedded machine learning (ML) on FPGAs through the use of CFUs. |
SS2 Physical attacks and countermeasures
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 14:00 CET - 15:30 CET
Location / Room: Gorilla Room 1.5.1
Session chair:
Arthur Beckers, NXP, BE
14:00 CET until 14:27 CET: Pitches of regular papers
14:27 CET until 15:30 CET: Interactive technical presentations by the authors of regular papers and extended abstracts
Regular Papers
Time | Label | Presentation Title Authors |
---|---|---|
14:00 CET | SS2.1 | TABLE RE-COMPUTATION BASED LOW ENTROPY INNER PRODUCT MASKING SCHEME Speaker: Wei Cheng, Télécom Paris & Secure-IC S.A.S, FR Authors: Jingdian Ming1, Yongbin Zhou2, Wei Cheng3 and Huizhong Li1 1Chinese Academy of Sciences, CN; 2Nanjing University of Science and Technology, CN; 3LTCI, Telecom Paris, Institut Polytechnique de Paris, 91120, Palaiseau, FR Abstract Masking is a popular countermeasure due to its provable security. Table re-computation based Boolean masking (BM) is efficient at small masking share number, and addition chain based inner product masking (IPM) provides higher security order than BM. As a result, the natural question is: can we design a masking scheme that costs close to that of re-computation based BM while providing security comparable to that of addition chain based IPM? In this paper, we propose a table re-computation based IPM scheme that provides 3$^{rd}$-order security while being slightly more expensive than table re-computation based BM. Furthermore, we improve the side-channel security of IPM by randomly selecting the parameter $L$ from an elaborated low entropy set, which we call low entropy inner product masking (LE-IPM). In an Intel Core i7-4790 CPU and ARM Cortex M4 based MCU for AES, we implemented four masking schemes, namely the addition chain based IPM and table re-computation based BM, IPM, and LE-IPM. Our proposals perform slightly slower (by about 0.8 times) than table re-computation based BM but significantly faster (at least 30 times) than addition chain based IPM. Furthermore, we assess the security of our proposals using a standard method named test vector leakage assessment methodology (TVLA). Our proposals provide the expected security against side-channel attacks according to the evaluation. |
14:03 CET | SS2.2 | SCFI: STATE MACHINE CONTROL-FLOW HARDENING AGAINST FAULT ATTACKS Speaker: Pascal Nasahl, TU Graz, AT Authors: Pascal Nasahl, Martin Unterguggenberger, Rishub Nagpal, Robert Schilling, David Schrammel and Stefan Mangard, TU Graz, AT Abstract Fault injection (FI) is a powerful attack methodology allowing an adversary to entirely break the security of a target device. As finite-state machines (FSMs) are fundamental hardware building blocks responsible for controlling systems, inducing faults into these controllers enables an adversary to hijack the execution of the integrated circuit. A common defense strategy mitigating these attacks is to manually instantiate FSMs multiple times and detect faults using a majority voting logic. However, as each additional FSM instance only provides security against one additional induced fault, this approach scales poorly in a multi-fault attack scenario. In this paper, we present SCFI: a strong, probabilistic FSM protection mechanism ensuring that control-flow deviations from the intended control-flow are detected even in the presence of multiple faults. At its core, SCFI consists of a hardened next-state function absorbing the execution history as well as the FSM's control signals to derive the next state. When either the absorbed inputs, the state registers, or the function itself are affected by faults, SCFI triggers an error with no detection latency. We integrate SCFI into a synthesis tool capable of automatically hardening arbitrary unprotected FSMs without user interaction and open-source the tool. Our evaluation shows that SCFI provides strong protection guarantees with a better area-time product than FSMs protected using classical redundancy-based approaches. Finally, we formally verify the resilience of the protected state machines using a pre-silicon fault analysis tool. |
14:06 CET | SS2.3 | EASIMASK - TOWARDS EFFICIENT, AUTOMATED, AND SECURE IMPLEMENTATION OF MASKING IN HARDWARE Speaker: Fabian Buschkowski, Ruhr-University Bochum, DE Authors: Fabian Buschkowski1, Pascal Sasdrich2 and Tim Güneysu3 1Ruhr-University, DE; 2Ruhr-Universität Bochum, DE; 3Ruhr-Universität Bochum & DFKI, DE Abstract Side-Channel Analysis (SCA) is a major threat to implementations of mathematically secure cryptographic algorithms. Applying masking countermeasures to hardware-based implementations is both time-consuming and error-prone due to side-effects buried deeply in the hardware design process. As a consequence, we propose our novel framework EASIMASK in this work. Our semi-automated framework enables designers that have little experience with hardware implementation or physical security and the application of countermeasures to create a securely masked hardware implementation from an abstract description of a cryptographic algorithm. Its design-flow dismisses the developer from many challenges in the masking process of hardware implementations, while the generated implementations match the efficiency of hand-optimized designs from experienced security engineers. The modular approach can be mapped to arbitrary instantiations using different languages and transformations. We have verified the functionality, security, and efficiency of generated designs for several state of the art symmetric cryptographic algorithms, such as Advanced Encryption Standard (AES), Keccak, and PRESENT. |
14:09 CET | SS2.4 | OBFUSLOCK: AN EFFICIENT OBFUSCATED LOCKING FRAMEWORK FOR CIRCUIT IP PROTECTION Speaker: Hai Zhou, Northwestern University, US Authors: You Li, Guannan Zhao, Yunqi He and Hai Zhou, Northwestern University, US Abstract With the rapid evolution of the IC supply chain, circuit IP protection has become a critical realistic issue for the semiconductor industry. One promising technique to resolve the issue is logic locking. It adds key inputs to the original circuit such that only authorized users can get the correct function, and it modifies the circuit to obfuscate it against structural analysis. However, there is a trilemma among locking, obfuscation, and efficiency within all existing logic locking methods that at most two of the objectives can be achieved. In this work, we propose ObfusLock, the first logic locking method that simultaneously achieves all three objectives: locking security, obfuscation safety, and locking efficiency. ObfusLock is based on solid mathematical proofs, incurs small overheads (<5% on average), and has passed experimental tests of various existing attacks. |
14:12 CET | SS2.5 | TEMPERATURE IMPACT ON REMOTE POWER SIDE-CHANNEL ATTACKS ON SHARED FPGAS Speaker: Ognjen Glamocanin, EPFL, CH Authors: Ognjen Glamocanin, Hajira Bazaz, Mathias Payer and Mirjana Stojilovic, EPFL, CH Abstract To answer the growing demand for hardware acceleration, Amazon, Microsoft, and many other major cloud service providers have included field-programmable gate arrays (FPGAs) in their datacenters. However, researchers have shown that cloud FPGAs, when shared between multiple tenants, face the threat of remote power side-channel analysis (SCA) attacks. FPGA time-to-digital converter (TDC) sensors enable adversaries to sense voltage fluctuations and, in turn, break cryptographic implementations or extract confidential information with the help of machine learning (ML). The operating temperature of the TDC sensor affects the traces it acquires, but its impact on the success of remote power SCA attacks has largely been ignored in literature. This paper attempts to fill in this gap. We focus on two attack scenarios: correlation power analysis (CPA) and ML-based profiling attacks. We show that the temperature impacts the success of the remote power SCA attacks: with the ambient temperature increasing, the success rate of the CPA attack decreases. In-depth analysis reveals that TDC sensor measurements suffer from temperature-dependent effects, which, if ignored, can lead to misleading and overly optimistic results of ML-based profiling attacks. We evaluate and stress the importance of following power side-channel trace acquisition guidelines for minimizing the temperature effects and, consequently, obtaining a more realistic measure of success for remote ML-based profiling attacks. |
14:15 CET | SS2.6 | APUF PRODUCTION LINE FAULTS: UNIQUENESS AND TESTING Speaker: Yeqi Wei, University of Illinois Chicago, US Authors: Yeqi Wei1, Wenjing Rao2 and Natasha Devroye2 1University of Illinois Chicago, US; 2University of Illinois at Chicago, US Abstract Arbiter Physically Unclonable Functions (APUFs) are low-cost hardware security primitives that may serve as unique digital fingerprints for ICs. To fulfill this role, it is critical for manufacturers to ensure that a batch of PUFs coming off the same design and production line have different truth tables, and uniqueness / inter-PUF-distance metrics have been defined to measure this. This paper points out that a widely-used uniqueness metric fails to capture some special cases, which we remedy by proposing a modified uniqueness metric. We then look at two fundamental APUF-native production line fault models that severely affect uniqueness: the mu (abnormal mean of a delay difference element) and sigma (abnormal variance of a delay difference element) faults. We propose test and diagnosis methods aimed at these two APUF production line faults, and show that these low-cost techniques can efficiently and effectively detect such faults, and pinpoint the element of abnormality, without the (costly) need to directly measure the uniqueness metric of a PUF batch. |
14:18 CET | SS2.7 | FAULT MODEL ANALYSIS OF DRAM UNDER ELECTROMAGNETIC FAULT INJECTION ATTACK Speaker: Longtao Guo, Tianjin University, CN Authors: Qiang Liu, Longtao Guo and Honghui Tang, Tianjin University, CN Abstract Electromagnetic fault injection (EMFI) attack has posed serious threats to the security of integrated circuits. Memory storing sensitive codes and data has become the first choice of attacking targets. This work performs a thorough characterization of the induced faults and the associated fault model of EMFI attacks on DRAM. Specifically, we firstly carry out a set of experiments to analyse the sensitivity of various types of memory to EMFI. The analysis shows that DRAM is more sensitive to EMFI than EEPROM, Flash, and SRAM in this experiment. Then, we classify the induced faults in DRAM and formulate the fault models. Finally, we find the underlying reasons that explain the observed fault models by circuit-level simulation of DRAM under EMFI. The in-depth understanding of the fault models will guide design of DRAM against EMFI attacks. |
14:21 CET | SS2.8 | EXPANDING IN-CONE OBFUSCATED TREE FOR ANTI SAT ATTACK Speaker: RuiJie Wang, National TsingHua University, CN Authors: RuiJie Wang1, Li-Nung Hsu1, Yung-Chih Chen2 and TingTing Hwang1 1National Tsing Hua University, TW; 2National Taiwan University of Science and Technology, TW Abstract Logic locking is a hardware security technology to protect circuit designs from overuse, piracy, and reverse engineering. It protects a circuit by inserting key gates to hide the circuit functionality, so that the circuit is functional only when a correct key is applied. In recent years, encrypting the point function, e.g., AND-tree, in a circuit has been shown to be promising to resist SAT attack. However, the encryption technique may suffer from two problems: First, the tree size may not be large enough to achieve desired security. Second, SAT attack could break the encryption in one iteration when it finds a specific input pattern, called remove-all DIP. Thus, in this paper, we present a new method for constructing the obfuscated tree. We first apply the sum-of-product transformation to find the largest AND-tree in a circuit, and then insert extra variables with the proposed split-compensate operation to further enlarge the AND-tree and mitigate the remove-all DIP issue. The experimental results show that the proposed obfuscated tree can effectively resist SAT attack. |
14:24 CET | SS2.9 | SHELL: SHRINKING EFPGA FABRICS FOR LOGIC LOCKING Speaker: Mark Tehranipoor, University of Florida, US Authors: Hadi Mardani Kamali1, Kimia Zamiri Azar1, Farimah Farahmandi1 and Mark Tehranipoor2 1University of Florida, US; 2Intel Charles E. Young Preeminence Endowed Chair Professor in Cybersecurity, Associate Chair for Research and Strategic Initiatives, ECE Department, University of Florida, US Abstract The utilization of fully reconfigurable logic and routing modules may be considered as one potential and even provably resilient technique against intellectual property (IP) piracy and integrated circuits (IC) overproduction. The embedded FPGA (eFPGA) is one instance that could be used for IP redaction leading to hiding the functionality through the untrusted stages of the IC supply chain. The eFPGA architecture, albeit reliable, unnecessarily results in exploding the die size even while it is supposed to be at fine granularity targeting small modules/IPs. In this paper, we propose SheLL, which primarily embeds the interconnects (routing channels) of the design and secondarily twists the minimal logic parts of the design into the eFPGA architecture. In SheLL, the eFPGA architecture is customized for this specific logic locking methodology, allowing us to minimize the overhead of eFPGA fabric as possible. Our experimental results demonstrate that SheLL guarantees robustness against notable attacks while the overhead is significantly lower compared to the existing eFPGA-based competitors. |
Extended Abstracts
Time | Label | Presentation Title Authors |
---|---|---|
14:27 CET | SS2.10 | HIGHLIGHTING TWO EM FAULT MODELS WHILE ANALYZING A DIGITAL SENSOR LIMITATIONS Speaker: Roukoz Nabhan, Mines Saint-Etienne, FR Authors: Roukoz Nabhan1, Jean-Max Dutertre1, Jean-Baptiste Rigaud1, Jean-Luc Danger2 and Laurent Sauvage2 1Mines Saint-Etienne, FR; 2Télécom ParisTech, FR Abstract Fault injection attacks can be carried out against an operating circuit by exposing it to EM perturbations. These attacks can be detected using embedded digital sensors based on the EM fault injection mechanism, as the one introduced by El-Baze et al. which uses the sampling fault model. We tested on an experimental basis the efficiency of this sensor embedded in the AES accelerator of an FPGA. It proved effective when the target was clocked at moderate frequency (the injected faults were consistent with the sampling fault model). As the clock frequency was progressively increased, faults started to escape detection, which raises warnings about possible limitations of the sampling model. Further tests at frequencies close to the target maximal frequency revealed faults injected according to a timing fault model. Both series of experimental results ascertain that EM injection can follow at least two different fault models. Undetected faults and the existence of different fault injection mechanisms cast doubt upon the use of sensors based on a single model. |
14:27 CET | SS2.11 | SECURING HETEROGENEOUS 2.5D ICS AGAINST IP THEFT THROUGH DYNAMIC INTERPOSER OBFUSCATION Speaker: Jonti Talukdar, Duke University, US Authors: Jonti Talukdar1, Arjun Chaudhuri1, Jinwoo Kim2, Sung-Kyu Lim2 and Krishnendu Chakrabarty1 1Duke University, US; 2Georgia Tech, US Abstract Recent breakthroughs in heterogeneous integration (HI) technologies using 2.5D and 3D ICs have been key to advances in the semiconductor industry. However, heterogeneous integration has also led to several sources of distrust due to the use of third-party IP, testing, and fabrication facilities in the design and manufacturing process. Recent work on 2.5D IC security has only focused on attacks that can be mounted through rogue chiplets integrated in the design. Thus, existing solutions implement inter-chiplet communication protocols that prevent unauthorized data modification and interruption in a 2.5D system. However, none of the existing solutions offer inherent security against IP theft. We develop a comprehensive threat model for 2.5D systems indicating that such systems remain vulnerable to IP theft. We present a method that prevents IP theft by obfuscating the connectivity of chiplets on the interposer using reconfigurable interconnection networks. We also evaluate the PPA impact and security offered by our proposed scheme. |
14:27 CET | SS2.12 | WARM-BOOT ATTACK ON MODERN DRAMS Speaker: SHUO WANG, University of Florida, CN Authors: Yichen Jiang, Shuo Wang, Renato Jansen Figueiredo and Yier Jin, University of Florida, US Abstract Memory plays a critical role in storing almost all computation data for various applications, including those with sensitive data such as bank transactions and critical business management. As a result, protecting memory security from attackers with physical access is ultimately important. Various memory attacks have been proposed, among which ''cold boot" and RowHammer are two leading examples. DRAM manufacturers have deployed a series of protection mechanisms to counter these attacks. Even with the latest protection techniques, DRAM may still be vulnerable to attackers with physical access. In this paper, we proposed a novel ''warm boot'' attack which utilizes external power supplies to bypass the existing protection mechanisms and steal the data from the modern SODIMM DDR4 memory. The proposed ''warm boot'' attack is applied to various DRAM chips from different brands. Based on our experiments, the ''warm boot'' attack can achieve as high as 94% data recovery rate from SODIMM DDR4 memory. |
14:27 CET | SS2.13 | LOW-COST FIRST-ORDER SECURE BOOLEAN MASKING IN GLITCHY HARDWARE Speaker: Dilip Kumar S V, COSIC, KU Leuven, BE Authors: Dilip Kumar S V, Josep Balasch, Benedikt Gierlichs and Ingrid Verbauwhede, KU Leuven, BE Abstract We describe how to securely implement the logical AND of two bits in hardware in the presence of glitches without the need for fresh randomness, and we provide guidelines for the composition of circuits. As a case study, we design, implement and evaluate a DES core. Our goal is an overall practically relevant tradeoff between area, latency, randomness cost, and security. We focus on first-order secure Boolean masking and we do not aim for provable security. The resulting DES engine shows no evidence of first-order leakage in a non-specific leakage assessment with 50M traces. |
14:27 CET | SS2.14 | TIPLOCK: KEY-COMPRESSED LOGIC LOCKING USING THROUGH-INPUT-PROGRAMMABLE LOOKUP-TABLES Speaker: Kaveh Shamsi, University of Texas at Dallas, US Authors: Kaveh Shamsi and Rajesh Datta, University of Texas at Dallas, US Abstract Herein we explore using logic elements that can be programmed through their inputs for logic locking. For this purpose, we design a novel through-input-programmable (TIP) lookup-table (LUT) element and develop algorithms to find cuts in the circuit that can be mapped to such elements while maintaining programmability. Our proposed TIPLock flow achieves area savings of 50-70\% compared to the traditional approach of using a key-vector-long scan-chain. |
W02 3D Integration: Heterogeneous 3D Architectures and Sensors
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 14:00 CET - 18:00 CET
Location / Room: Nightingale Room 2.6.1/2
Organisers:
Pascal VIVET, CEA List, FR
Peter Ramm, Fraunhofer EMFT, DE
Mustafa Badaroglu, QUALCOMM, US
Subhasish Mitra, Stanford University, US
Workshop Description
3D technologies are becoming more and more pervasive in digital architectures, as a strong enabler for heterogeneous integration. With the limits of current sub-nanometric technologies, 3D integration technology is paving the way to a wide architecture scope, with reduced cost, reduced form factor, increased energy efficiency, allowing a wide variety of heterogeneous architectures. Due to the high amount of required data and associated memory capacity, ML and AI accelerator could benefit of 3D integration not only for HPC, but also for the edge and embedded HPC. 3D integration and associated architectures are opening a wide spectrum of system solutions, from chiplet-based partitioning for High Performance Computing to various sensors such as fully integrated image sensors embedding AI features, but also but also for next generation of computing architectures: AI accelerators, InMemoryComputing, Quantum, etc.
The 3D Integration Workshop took place in DATE conference from 2009 to 2015 and took place again in 2022. With the continued evolution of 3D technologies in terms of interconnect density and its evolving manufacturing eco-system, there is a strong need to pursue the research efforts on key aspects of architecture and design, according to the potential capabilities offered by 3D integration.
The goal of the 3D Integration Workshop is to bring together experts from both academia and industry, interested in this exciting and rapidly evolving field, in order to update each other on the latest state-of-the-art, exchange ideas, and discuss future challenges.
This half-day event consists of a plenary keynote, invited talks, and regular presentations
Technical Program
Tentative schedule, under construction
Keynote
Session Chair : Peter Ramm, Fraunhofer, Germany
14:00 – 14:30 Chiplets for AI – AI for chiplets
Paul Franzon, North Carolina State University, USA
Session 1 : Chiplet based systems
14:30 – 14:45 Occamy - A 432-core RISC-V Based 2.5D Chiplet System for Ultra-Efficient (Mini) Floating-Point Computation
Gianna Paulin, ETH-Z, Switzerland.
14:45 – 15:00 Toward industrialization of 2.5D/3D heterogeneous solutions for ASICs
Fady Abouzeid, Philippe Roche, STMicroelectronics, France.
15:00 – 15:15 Energy-Efficient Communication in 2.5D Integrated Systems
Vasilis F. Pavlidis, Aristotle University of Thessaloniki, Greece.
15:15 – 15:30 Why Advanced Packaging & 3D Integration Does Matter to Everybody
Anna Fontanelli, Monozukuri SpA, Rome, Italy
15:30 – 16:15 Coffee Break
Session 2 : Advanced 3D architecture and design methodology
16:15 – 16:30 Temperature-Aware Design of 3D-Stacked Accelerators
Ayse K. Coskun, Boston University, USA.
16:30 – 16:45 Thermally aware 3D sign-off and design enablement of 3-dies stack
Mohamed Naeim and Dragomir Milojevic, IMEC, UBL, Belgium
16:45 – 17:00 3D Integration and Advanced Packaging for Modular Quantum Computer based on Diamond Spin Qubits
Ryoichi Ishihara , TU Delft, Nederlands
17:00 – 17:15 Efficient In Sensor processing based on advanced 3D technologies
Sébastien Thuriès, CEA List, France
17:15 – 17:30 Integrating Fault Tolerance for 2.5D/3D Chiplets Using the Advanced Interface Bus (AIB)
Antoine Rouget, STMicroelectronics / CEA, LIST, Grenoble, France
17:30 – 17:45 Efficient and Reliable Hardware Architectures based on Vertical Nanowire FETs
Bastien Deveautour, Institute of Nanotechnology (INL), France
Key Dates
Abstract Submission deadline | |
---|---|
Notification of Acceptance | |
Presentations and posters ready | 26 March 2023 |
Workshop | Wednesday, 19 April 2023 - 14:00 - 18:00 |
Workshop Committee
- General co-Chairs:
- P. Vivet – CEA-LIST, IRT Nanoelec (FR)
- M. Badaroglu, Qualcomm, (BE)
- Program Chair:
- P. Ramm, Fraunhofer EMFT (GE)
- Special Session Chair
- S. Mitra, Stanford University (USA)
- Industrial Liaison Chair
- Eric Ollier, CEA-Leti, IRT Nanoelec (FR)
Past editions
The 3D Integration workshop took place from 2009 to 2015 and was restarted in 2022.
- DATE 2009: https://past.date-conference.com/date09/conference/workshop-W5
- DATE 2010: https://past.date-conference.com/date10/conference/workshop-W5
- DATE 2011: https://past.date-conference.com/date11/conference/workshop-W5
- DATE 2012: https://past.date-conference.com/date12/conference/workshop-W5
- DATE 2013: https://past.date-conference.com/date13/conference/workshop-W5
- DATE 2014: https://past.date-conference.com/date14/conference/workshop-W5
- DATE 2015: https://past.date-conference.com/date15/conference/workshop-W05
- DATE 2022: https://date22.date-conference.com/workshop/w02
FS9 Focus session: Learning-Oriented Reliability Improvement of Computing Systems From Transistor to Application Level
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 16:30 CET - 18:00 CET
Location / Room: Gorilla Room 1.5.3
Session chair:
Christian Pilato, Politecnico di Milano, IT
Session co-chair:
Behnaz Ranjbar, TU Dresden, DE
Time | Label | Presentation Title Authors |
---|---|---|
16:30 CET | FS9.1 | ESTIMATING DEVICE AND CIRCUIT RELIABILITY Speaker: Hussam Amrouch, University of Stuttgart, DE Authors: Florian Klemme1, Paul Genssler1 and Hussam Amrouch2 1University of Stuttgart, DE; 2TU Munich, DE Abstract The pivotal issue of reliability is one of colossal concerns for circuit designers. Transistor self-heating is an ever-increasing challenging because transistor scaling is reaching atomic levels in which quantum confinement becomes substantially prominent. With more confined 3D structures (e.g., TSMC Nanosheet FETs and Intel Ribbon FETs), heat arising in the transistor's channel cannot be easily dissipated and is hence "trapped” inside the transistor's channel. This, in turn, largely accelerates the underlying aging mechanisms in transistors. At the design time, it is profoundly challenging to estimate close-to-the-edge safety margins that keep aging and self-heating effects during the entire projected lifetime at bay. This is because foundries do not share their calibrated physics-based models, comprised of highly confidential technology and material parameters. In this talk, we will demonstrate how machine learning techniques (both classical and brain-inspired methods) open new doors for foundries to train accurate models that empower circuit designers to estimate the actual impact of aging from the material and transistor level all the way up to the circuit and processor level without sharing any confidential physics-based models. Further, we will demonstrate how well-established EDA tools can be employed to upheave self-heating effects from individual devices at the transistor level all the way up to complete large processors at the final layout level. |
16:53 CET | FS9.2 | IMPROVING ARCHITECTURAL RELIABILITY Speaker: Aviral Shrivastava, Arizona State University, US Authors: Jinhyo Jung1, HwiSoo So1, Kyoungwoo Lee1, Shail Dave2 and Aviral Shrivastava3 1Yonsei University, KR; 2Arizona State University, US; 3School of Computing and Augmented Intelligence, Arizona State University, US Abstract As device scaling continues and fault-rates increase, assessing reliability of safety-critical systems is becoming increasingly important. Exploring impacts of hardware faults at the architecture level is desirable since it can provide valuable insights into the reliability of the system. However, it is difficult to control the timing and location of the fault in actual hardware. With the help of hardware simulators, it becomes much easier to inject hardware faults, but each experimental trial becomes much slower. Recent works have proposed to incorporate the idea of machine learning in tackling this problem. In this session, we briefly discuss the difficulties when modeling or improving reliability at architecture level. Then, we present learning-oriented methodologies that can be applied to alleviate those challenges. Specifically, we describe how we can develop a design methodology for making reliability as first-hand metric in design explorations of efficient embedded systems, while integrating application- and circuit-level estimations. |
17:15 CET | FS9.3 | IMPROVING APPLICATION RELIABILITY THROUGH OS Speaker: Akash Kumar, TU Dresden, DE Authors: Behnaz Ranjbar and Akash Kumar, TU Dresden, DE Abstract Due to technology scaling in modern embedded platforms, the safety and reliability issues have increased tremendously, which often accelerate aging, lead to permanent faults, and cause unreliable execution of applications. Failure during an application execution in some embedded systems like avionics may cause catastrophic consequences. Therefore, managing reliability under all circumstances of stress and environmental changes is crucial during run-time. Machine-Learning techniques are recently being employed for dynamic reliability optimization, which adapts to varying workloads and system conditions. These techniques can learn from past events and make better decisions to improve the system's performance. In this talk, we provide a survey of approaches that aim to improve reliability through learning for embedded platforms. Then, we discuss the open challenges and limitations within this domain for future academic and industrial works. |
17:38 CET | FS9.4 | RELIABILITY ANALYSIS ON A FAULT-TOLERANT TIMING-GUARANTEED SYSTEM Speaker: Ji-Yung Lin, KU Leuven, BE Authors: Ji-Yung Lin1, Pieter Weckx2, Subrat Mishra2, Francky Catthoor2 and Dwaipayan Biswas2 1KU Leuven, BE; 2IMEC, BE Abstract Fault-tolerant mechanisms like check-pointing and rollback-recovery play an essential role in ensuring functional correctness in the existence of register-level errors. However, these mechanisms induce execution time overhead. To ensure timing guarantees, the time overhead needs to be mitigated by the real-time scheduling mechanism, which can switch to higher processor speed to compensate for the overhead. We analyzed the interplay of the both mechanisms which are required to simultaneously reach the guaranteed performance and reliability. We developed a system model which integrates two sub-systems: a check-pointing and rollback-recovery system tackling register-level error occurrences for functional correctness, and on top of it, a real-time scheduling system ensuring the timing guarantees. Analysis by a cycle-accurate simulation flow shows that both reliability and time overhead are highly sensitive to the error probability of registers. Moreover, there is an error rate wall (at around 10^{-6} to 10^{-5} errors per cycle in our demonstrated system) beyond which the time overhead becomes too high for a feasible system to ensure reliability. |
LBR2 Late Breaking Results: new ideas for low power and reliable computing
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 16:30 CET - 18:00 CET
Location / Room: Okapi Room 0.8.3
Session chair:
Jie Han, University of Alberta, CA
Time | Label | Presentation Title Authors |
---|---|---|
16:30 CET | LBR2.1 | AN ULTRA-LOW-POWER SERIAL IMPLEMENTATION FOR SIGMOID AND TANH USING CORDIC ALGORITHM Speaker: Yaoxing CHANG, CSEM & ETH Zurich, CH Authors: Yaoxing Chang1, Petar Jokic2, Stephane Emery2 and Luca Benini3 1Swiss Center for Electronics and Microtechnology (CSEM), Swiss Federal Institute of Technology (ETH Zurich), CH; 2Swiss Center for Electronics and Microtechnology (CSEM), CH; 3Swiss Federal Institute of Technology (ETH Zurich), University of Bologna, CH Abstract Activation functions (AFs) such as sigmoid and tanh play an important role in neural networks (NNs). Their efficient implementation is critical for always-on edge devices. In this work, we propose a serial-arithmetic architecture for AFs in edge audio applications using the CORDIC algorithm. The design enables to dynamically trade-off throughput/latency and accuracy, and possesses higher area and power efficiency compared to conventional methods such as look-up table (LUT)- and piece-wise linear (PWL)- based methods. Considering the throughput difference among the designs, we evaluate average power consumption taking into account active and idle working cycles for same applications. Synthesis results in a 22 nm process show that our CORDIC-based design has an area of 545.77 μm2 and an average power of 0.69 μW for a keyword spotting task, achieving a reduction of 36.92% and 71.72% in average power consumption compared to LUT and PWL-based implementations, respectively. |
16:33 CET | LBR2.2 | PROCESS VARIATION RESILIENT CURRENT-DOMAIN ANALOG IN MEMORY COMPUTING Speaker: Kailash Prasad, IIT Gandhinagar, IN Authors: Kailash Prasad, Sai Shubham, Aditya Biswas and Joycee Mekie, IIT Gandhinagar, IN Abstract In-Memory Computing (IMC) has emerged as one of the energy-efficient solutions for data and compute-intensive machine learning applications. Analog IMC architectures have high throughput, but limited bit precision. Process variation fur- ther degrades the bit-precision. This work proposes an efficient way to track process variation and compensate for it to achieve high bit-resolution, which, to the best of our knowledge, is first such proposal. PV tracking is achieved by using an additional SRAM column and compensation by a non-conventional word-line driver. The proposed circuit can be augmented to any analog IMC architecture to make it resilient to process variations. To demonstrate the versatility of the proposal, we have implemented and analyzed 2-bit dot product operation in IMC architectures with six different SRAM cell configurations, and 2-bit, 4-bit, and 8-bit dot product on 6T SRAM IMC. For these, we report a reduction of 4× to 14× in the standard deviation of statistical variations in bit-line voltage for different SRAM cells, increase in the bit-resolution from 2 bits to 4 bits or 6 bits |
16:36 CET | LBR2.3 | ANALYSIS OF QUANTIZATION ACROSS DNN ACCELERATOR ARCHITECTURE PARADIGMS Speaker: Tom Glint, IIT Gandhinagar, IN Authors: Tom Glint1, Chandan Jha2, Manu Awasthi3 and Joycee Mekie1 1IIT Gandhinagar, IN; 2German Research Center for Artificial Intelligence, DE; 3Ashoka University, IN Abstract Quantization techniques promise to significantly reduce the latency, energy, and area associated with multiplier hardware. This work, to the best of our knowledge, for the first time, shows the system-level impact of quantization on SOTA DNN accelerators from different digital accelerator paradigms. Based on the placement of data and compute site, we identify SOTA designs from Conventional Hardware Accelerators (CHA), Near Data Processors (NDP), and Processing-in-Memory (PIM) paradigms and show the impact of quantization when inferencing CNN and Fully Connected Layer (FCL) workloads. We show that the 32-bit implementation of SOTA from PIM consumes less energy than the 8-bit implementation of SOTA from CHA for FCL, while the trend reverses for CNN workloads. Further, PIM has stable latency while scaling the word size while CHA and NDP suffer 20% to 2x slow down for doubling word size. |
16:39 CET | LBR2.4 | DIVIDE AND VERIFY: USING A DIVIDE-AND-CONQUER STRATEGY FOR POLYNOMIAL FORMAL VERIFICATION OF COMPLEX CIRCUITS Speaker: Alireza Mahzoon, University of Bremen, DE Authors: Rolf Drechsler1 and Alireza Mahzoon2 1University of Bremen | DFKI, DE; 2University of Bremen, DE Abstract With the rapid growth in the size and complexity of digital circuits, the possibility of bug occurrence has significantly increased. In order to avoid the enormous financial loss due to the production of buggy circuits, using scalable formal verification methods is essential. The scalability of a verification method for a specific design is proven by showing that the method has polynomial space and time complexities. Unfortunately, not all verification methods have a polynomial complexity, particularly when it comes to the verification of large and complex designs. In this paper, we propose a divide-and-conquer strategy for Polynomial Formal Verification (PFV) of complex circuits. Instead of using a monolithic proof engine to verify the entire design, we break the verification task down into several problems, which can be solved in polynomial space and time using a hybrid proof engine. As a case study, we investigate the PFV of a RISC-V processor using our divide-and-conquer strategy. |
16:42 CET | LBR2.5 | IMPROVING DESIGN UNDERSTANDING OF PROCESSORS LEVERAGING DATAPATH CLUSTERING Speaker: Katharina Ruep, Johannes Kepler University Linz, AT Authors: Katharina Ruep and Daniel Grosse, Johannes Kepler University Linz, AT Abstract In this paper, we present a novel approach for design understanding of processors. Our approach uses hierarchical clustering to identify datapath similarities based on control signal vectors. The resulting dendrogram captures the closeness of instructions wrt. their datapath and control in visual form. We demonstrate how our approach helps in design understanding for a RISC-V processor without looking into the HDL code. |
16:45 CET | LBR2.6 | ELECTRICAL RULE CHECKING OF INTEGRATED CIRCUITS USING SATISFIABILITY MODULO THEORY Speaker: Oussama Oulkaid, University Grenoble Alpes, CNRS, Grenoble INP, VERIMAG, 38000 Grenoble, France; Aniah, 38000 Grenoble, France, FR Authors: Bruno Ferres1, Oussama Oulkaid2, Ludovic Henrio1, Mehdi Khosravian2, Matthieu Moy1, Gabriel Radanne1 and Pascal Raymond3 1Univ Lyon, EnsL, UCBL, CNRS, Inria, LIP, F-69342, LYON Cedex 07, France., FR; 2Aniah, 38000 Grenoble, France, FR; 3University Grenoble Alpes, CNRS, Grenoble INP, VERIMAG, 38000 Grenoble, France, FR Abstract We consider the verification of electrical properties of circuits to identify potential violations of electrical design rules, also called Electrical Rule Checking (ERC). We present a general approach based on Satisfiability Modulo Theory (SMT) to verify that these errors cannot occur in a given circuit. We claim that our approach is scalable and more precise than existing analyses, like voltage propagation. We applied these techniques to a specific type of errors, the missing level shifters. On an industrial case-study, our technique is able to flag 25% of the warnings raised by the voltage propagation analysis as being false alarms. |
16:48 CET | LBR2.7 | EXPLORATION OF DECISION SUB-NETWORK ARCHITECTURES FOR FPGA-BASED DYNAMIC DNNS Speaker: Anastasios Dimitriou, University of Southampton, GB Authors: Anastasios Dimitriou, Mingyu Hu, Jonathon Hare and Geoff Merrett, University of Southampton, GB Abstract Dynamic Deep Neural Networks (DNNs) can achieve faster execution and less computationally intensive inference by spending fewer resources on easy to recognise or less informative parts of an input. They make data-dependent decisions,which strategically deactivate a model's components, e.g. layers,channels or sub-networks. However, dynamic DNNs have only been explored and applied on conventional computing systems (CPU+GPU) and programmed with libraries designed for static networks, limiting their effects. In this paper, we propose and explore two approaches for efficiently realising the sub-networks that make these decisions on FPGAs. A pipeline approach targets the use of the existing hardware to execute the sub-network, while a parallel approach uses dedicated circuitry for it. We explore the performance of each using the BranchyNet early exit approach on LeNet-5, and evaluate on a Xilinx ZCU106. The pipeline approach is 36% faster than a desktop CPU. It consumes 0.51 mJ per inference, 16x lower than a non-dynamic network on the same platform and 8x lower than an Nvidia Jetson Xavier NX. The parallel approach executes 17% faster than the pipeline approach when on dynamic inference no early exits are taken, but incurs an increase in energy consumption of 28%. |
16:51 CET | LBR2.8 | INTERACTIVE TECHNICAL PRESENTATIONS BY THE AUTHORS Speaker: Authors of the session, DATE, BE Author: Session Chairs, DATE, BE Abstract Participants can freely interact with authors during their interactive technical presentations. |
M02 Nervous Systems – From Spiking Neural Networks and Reservoir Computing to Neuromorphic Fault-tolerant Hardware
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 16:30 CET - 18:00 CET
Location / Room: Okapi Room 0.8.2
Organisers:
Martin A. Trefzer, University of York, GB
Jim Harkin, Ulster University, GB
Speakers:
Martin A. Trefzer, University of York, GB
Jim Harkin, Ulster University, GB
Martin A. Trefzer, University of York, GB
Jim Harkin, Ulster University, GB
Presenters:
Shimeng Wu, University of York, GB
Andrew Walter, University of York, GB
Technology scaling has enabled fast advancement of computing architectures through high-density integration of components and cores, and the provision of powerful systems on chip (SoC), e.g. NVIDIA Jetson, AMD/Xilinx UltraScale+ FPGA, ARM big.LITTLE. However, such systems are becoming hot and more prone to failure and timing violations as clock speed limits are reached. Therefore, parts of SoCs must be turned off to stay within thermal limits ("dark silicon"). This shifts challenges away from making designs smaller, setting the new focus on systems that are ultra-low power, resilient and autonomous in their adaptation to anomalies, faults, timing violations and performance degradation. There is a significant increase in numbers of temporary faults caused by radiation, and permanent faults due to manufacturing defects and stress. ITRS (https://irds.ieee.org/) estimates significant device failure rates, e.g., due to wear-out, in the short term. Hence, a critical requirement for such systems is to effectively perform detection and analysis at runtime, within a minimal area and power overhead. This is at odds with current state-of-the-art, including error correcting codes (ECC), built-in-self-test (BIST), localized fault detection, and traditional modular redundancy strategies (TMR), all resulting in prohibitively high system overheads and an inability to adapt, locate or predict faults. At the same time, technology diversification (More than Moore) is making fast progress, delivering technologies such as, e.g., memristors, graphene nanowires, etc. The current major issue of these technologies is large device variability preventing efficient scalability and usability. In this case, there are not even systematic state-of-the-art error correction or fault control strategies available yet.
This Nervous System on Chip tutorial is therefore discussing bio-inspired solutions becoming viable with the neuromorphic hardware design concepts becoming more mature. We will briefly introduce the principles of spiking neural networks, biological nervous systems, unconventional computing, and how to translate key concepts into functional hardware systems. We will primarily focus on SNNs for fault-tolerance, nervous system sense/act pathways, and multi-objective novelty search as an artificial nervous systems design methodology. Case studies will include an efficient SNN-based approach to detect timing violations in digital hardware, consider how efficient neuromorphic hardware may be achieved using a reservoir computing model, and highlight the challenges ahead. There will be some opportunity to run, for example, SNN, reservoir computing, or novelty search examples in simulation during a hands-on session.
M02.1 Nervous Systems - Tutorial Programme
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 16:30 CET - 18:00 CET
Location / Room: Okapi Room 0.8.2
Chair:
Martin A. Trefzer, University of York, GB
Time | Label | Presentation Title Authors |
---|---|---|
16:30 CET | M02.1.1 | INTRODUCTION TO SNNS Speaker: Jim Harkin, Ulster University, GB |
16:45 CET | M02.1.2 | NEUROMORPHIC HARDWARE OVERVIEW Speaker: Martin A. Trefzer, University of York, GB |
17:00 CET | M02.1.3 | APPLICATIONS OF SNNS - NEUROMORPHIC EMBEDDED SENSORS AND NETWORKS FOR FAULT-TOLERANCE Speaker: Jim Harkin, Ulster University, GB |
17:15 CET | M02.1.4 | NERVOUS SYSTEMS CONCEPT - MICROCIRCUITS AS BUILDING BLOCKS FOR NEUROMORPHIC ARCHITECTURES Speaker: Martin A. Trefzer, University of York, GB |
17:30 CET | M02.1.5 | HANDS-ON SESSION: SNNS IN VHDL Speaker: Shimeng Wu, University of York, GB Abstract Prerequisites for live participation is an installation of Xilinx Vivado 2022.1 (or later version). Tutorial resources are available from https://www-users.york.ac.uk/~mt540/nervous-systems/index.html#resources |
17:30 CET | M02.1.6 | HANDS-ON SESSION: SNNS WITH BRYAN2 & PYTHON Speaker: Andrew Walter, University of York, GB Abstract Prerequisites for live participation are an installation of Python 3.10, along with Brian2 2.5.1, numpy 1.23.3, matplotlib 3.6.1 (or later versions). Tutorial resources are available from https://www-users.york.ac.uk/~mt540/nervous-systems/index.html#resources |
M03 Embedded FPGAs (eFPGA) and Applications to IP Protection via eFPGA Redaction
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 16:30 CET - 18:00 CET
Location / Room: Toucan Room 2.7.1/2
Organisers:
Christian Pilato, Politecnico di Milano, IT
Pierre-Emmanuel Gaillardon, University of Utah, US
Ramesh Karri, New York University, US
Benjamin Tan, University of Calgary, CA
With the rise of open-source hardware and the never-ending requirements for computational power for modern applications, companies are increasingly interested in investments to create novel chips. However, these investments can be undermined by malicious actors in the semiconductor supply chain that can reverse engineer the chip design, steal hardware intellectual property, and make authorized copies of the original design. So, protecting the hardware intellectual property is becoming a critical concern to protect the huge investments behind developing novel architectures.
FPGA redaction is a novel, promising technique that aims to thwart reverse engineering attacks on integrated circuits (IC) by exploiting the flexibility of reconfigurable devices. Critical IC parts are mapped on and replaced by specific reconfigurable blocks (called embedded FPGAs - eFPGAs) with a two-fold goal: (1) during fabrication, reconfigurable devices can implement any arbitrary functions, without revealing their intended functionality; (2) during execution, they can be configured to implement the correct functionality by classic FPGA programming methods. In this context, novel tools like OpenFPGA can automate and significantly accelerate the development cycle of customizable FPGA architectures. Such tools can generate Verilog netlists associated with such customized FPGA that can be directly used to generate production-ready layouts.
This tutorial presents ALICE, a design flow that leverages OpenFPGA to explore a chip design (described at the behavioral register-transfer level), identify the best modules for redaction, and create the corresponding eFPGAs. This framework automates the process of eFPGA redaction, enabling its use in industrial environments.
SD2 High level synthesis and verification
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 16:30 CET - 18:00 CET
Location / Room: Gorilla Room 1.5.1
Session chair:
Katell Morin-Allory, Université Grenoble Alpes, FR
16:30 CET until 16:54 CET: Pitches of regular papers
16:54 CET until 18:00 CET: Interactive technical presentations by the authors of regular papers and extended abstracts
Regular Papers
Time | Label | Presentation Title Authors |
---|---|---|
16:30 CET | SD2.1 | TOWARDS HIGH-LEVEL SYNTHESIS OF QUANTUM CIRCUITS Speaker: Christian Pilato, Politecnico di Milano, IT Authors: Chao Lu1, Christian Pilato2 and Kanad Basu1 1University of Texas at Dallas, US; 2Politecnico di Milano, IT Abstract In recent years, there has been a proliferation of quantum algorithms, primarily due to their exponential speedup over their classical counterparts. Quantum algorithms find applications in various domains, including machine learning, molecular simulation, and cryptography. However, extensive knowledge of linear algebra and quantum mechanics are required to program a quantum computer, which might not be feasible for traditional software programmers. Moreover, current quantum programming paradigm is difficult to scale and integrate quantum circuits to achieve complex functionality. To this end, in this paper, we introduce QHLS, a quantum high-level synthesis (HLS) framework. To the best of our knowledge, this is the first HLS framework for quantum circuits. The proposed QHLS allows quantum programmers to start with high-level behavioral descriptions (e.g., C, C++) and automatically generate the corresponding quantum circuit; thus, reducing the complexity of programming a quantum computer. Our experimental results demonstrate the success of QHLS in translating high-level behavioral software programs containing arithmetic, logical and conditional statements. |
16:33 CET | SD2.2 | MIRROR: MAXIMIZING THE RE-USABILITY OF RTL THROUGH RTL TO C COMPILER Speaker: Benjamin Carrion Schaefer, University of Texas at Dallas, US Authors: Md Imtiaz Rashid and Benjamin Carrion Schaefer, University of Texas at Dallas, US Abstract This work presents a RTL to C compiler called MIRROR that maximizes the re-usability of the generated C code for High-Level Synthesis (HLS). The uniqueness of the compiler is that it generates C code by using libraries of pre-characterized RTL micro-structures that are uniquely identifiable through perceptual hashes. This allows to quickly generate C descriptions that include arrays and loops. These are important because HLS tools extensively use synthesis directives in the form of pragmas to control how to synthesize these constructs. E.g., arrays can be synthesized as registers or RAM, and loops fully unrolled, partially unrolled, not unrolled, or pipelined. Setting different pragma combinations lead to designs with unique area vs. performance and power trade-offs. Based on this, the main goal of our compiler is to parse synthesizable RTL descriptions specified in Verilog which have a fixed micro-architecture with specific area, performance and power profile and generate C code for HLS that can then be re-synthesized with different pragma combinations generating a variety of new micro-architectures with different area vs. performance trade-offs. We call this ‘maximizing the re-usability of the RTL code because it enables a path to re-target any legacy RTL description to applications with different constraints. |
16:36 CET | SD2.3 | HIGH-LEVEL SYNTHESIS VERSUS HARDWARE CONSTRUCTION Speaker: Georgi Gaydadjiev, University of Groningen, Plekhanov RUE, NL Authors: Alexander Kamkin1, Mikhail Chupilko1, Mikhail Lebedev1, Sergey Smolov1 and Georgi Gaydadjiev2 1ISP RAS, RU; 2University of Groningen, NL Abstract Application-specific systems with FPGA accelerators are often designed using high-level synthesis or hardware construction tools. Nowadays, there are many frameworks available, both open-source and commercial. In this work, we aim at a fair comparison of several languages (and tools), including Verilog (our baseline), Chisel, Bluespec SystemVerilog (Bluespec Compiler), DSLX (XLS), MaxJ (MaxCompiler), and C (Bambu and Vivado HLS). Our analysis has been carried out using a representative example of 8×8 inverse discrete cosine transform (IDCT), a widely used algorithm engaged in JPEG and MPEG decoders. The metrics under consideration include: (a) the degree of automation (how much less code is required compared to Verilog), (b) the controllability (possibility to achieve given design characteristics, namely a given ratio of the performance and area), and (c) the flexibility (ease of design modifications to achieve certain characteristics). Rather than focusing on computational kernels only, we use AXI-Stream wrappers for the synthesized implementations, which allows adequately evaluating characteristics of the designs when they are used as parts of real systems. Our study shows clear examples of what impact specific optimizations (tool settings and source code modifications) have on the overall system performance and area. It emphasizes how important is to be able to control the balance between the communication interface utilization and the computational kernel performance and delivers clear guidelines for the next generation tools for designing FPGA-accelerator-based systems. |
16:39 CET | SD2.4 | TPP: ACCELERATE APPLICATION LAUNCH VIA TWO-PHASE PREFETCHING ON SMARTPHONE Speaker: Ying Yuan, author, CN Authors: Ying Yuan, Zhipeng Tan, Shitong Wei, Lihua Yang, Wenjie Qi, Xuanzhi Wang and Cong Liu, Huazhong University of Science & Technology, CN Abstract The fast app launch is crucial to users' experience and it is one of the eternal pursuits of manufacturers. Page fault is a critical factor leading to long app launch latency. Prefetching is the current method of reducing page faults during app launch. Before the app launch, prefetching all demanded pages of the target app can speed up the app launch effectively, but it always uses a memory of several hundred MB, leading to low memory and slowing other apps' launch. Prefetching during application launch uses memory effectively, however, current methods are not aware of the order of pages accessed, causing noticeable accessing-prefetching order inversions, which results in limited acceleration of app launch. In order to accelerate the application launch effectively with little memory usage, we propose a Two-Phase Prefetching schema (TPP), which performs prefetching via two phases: 1) Before the app launch, to increase the efficiency of memory usage in prefetching, TPP prefetches few critical pages with app prediction, which is based on Long Short-Term Memory (LSTM) with high accuracy. 2) During app launch, TPP prefetches the rest of the critical pages via an order-aware sliding window method, resolving the accessing-prefetching order inversions and significantly reducing the app launch latency. We evaluate TPP on Google Pixel 3, compared to the state-of-the-art method, TPP reduces the application launch time by up to 52.5%, and 37% on average, and the data prefetched before the target application started is only 1.31 MB on average. |
16:42 CET | SD2.5 | USING HIGH-LEVEL SYNTHESIS TO MODEL SYSTEMVERILOG PROCEDURAL TIMING CONTROLS Speaker: Luca Ezio Pozzoni, Politecnico di Milano, IT Authors: Luca Pozzoni1, Fabrizio Ferrandi1, Loris Mendola2, Alfio Palazzo2 and Francesco Pappalardo2 1Politecnico di Milano, IT; 2STMicroelectronics, IT Abstract In modern SoC designs, digital components' development and verification processes often depend on the component's interactions with other digital and analog modules on the same die. While designers can rely on a wide range of tools and practices for validating fully-digital models, porting the same workflow to mixed models' development requires significant efforts from the designers. A common practice is to use Real Number Modeling techniques to generate HDL-based behavioral models of analog components to efficiently simulate mixed models using only event-based simulations rather than Analog Mixed Signals (AMS) simulations. However, some of these models' language features are not synthesizable with existing synthesis tools, requiring additional efforts from the designers to generate post-tapeout prototypes. This paper presents a methodology for transforming some non-synthesizable SystemVerilog language features related to timing controls into functionally-equivalent synthesizable Verilog constructs. The resulting synthesizable models replicate their respective RNMs' behavior while explicitly managing delay controls and event expressions. The RNMs are first transformed using the MLIR framework and then synthesized with open-source HLS tools to obtain FPGA-synthesizable Verilog models. |
16:45 CET | SD2.6 | R-LDPC: REFINING BEHAVIOR DESCRIPTIONS IN HLS TO IMPLEMENT HIGH-THROUGHPUT LDPC DECODER Speaker: Yifan Zhang, Wuhan National Laboratory for Optoelectronics, CN Authors: Yifan Zhang1, Qiang Cao1, Jie Yao2 and Hong Jiang3 1Wuhan National Laboratory for Optoelectronics, CN; 2Huazhong University of Science & Technology, CN; 3professor, UT Arlington, US Abstract High-Level Synthesis (HLS) translates high-level behavior-description to Register-Transfer Level (RTL) implementation in modern Field-Programmable Gate Arrays (FPGAs), accelerating domain-specific hardware developments. Low-Density Parity-Check (LDPC), as a powerful error-correction code family, has been widely implemented in hardware for building a reliable data channel over a noisy physical channel in communication and storage applications. Leveraging HLS to fast prototype high-performance LDPC decoder is intriguing with high scalability and low hardware-dependence, but generally is sub-optimal due to the lack of accurate and precise behavior descriptions in HLS to characterize iteration- and circuit-level implementation details. This paper proposes an HLS-based QC-LDPC decoder with scalable throughput by precisely refining the LDPC behavior descriptions, R-LDPC for short. To this end, R-LDPC first adopts an HLS-based LDPC decoder microarchitecture with a module-level pipeline. Second, R-LDPC offers a multi-instance-sharing one (MSO) description to explicitly define shared parts and non-shared parts for an array of check-node updating-units (CNU), eliminating redundant function modules and addressing circuits. Third, R-LDPC designs efficient single-stage and multi-stage shifters to eliminate unnecessary bit-selection circuits. Finally, R-LDPC provides invalid-element aware loop scheduling before the compile phase to avoid some unnecessary stalls at runtime. We implement an R-LDPC decoder, compared to the original HLS-based implementation, R-LDPC reduces the hardware consumption up to 56%, the latency up to 67%, and the decoding throughput up to 300%. Furthermore, R-LDPC is adapted to different scales, LDPC standards, and code rates, and can achieve 9.9Gbps decoding throughput in Xilinx U50. |
16:48 CET | SD2.7 | AN AUTOMATED VERIFICATION FRAMEWORK FOR HALIDEIR-BASED COMPILER TRANSFORMATIONS Speaker: Qingshuang Sun, Northwestern Polytechnical University, CN Authors: Yanzhao Wang1, Fei Xie1, Zhenkun Yang2, Jeremy Casas2, Pasquale Cocchini2 and Jin Yang2 1Portland State University, US; 2Intel Corporation, US Abstract HalideIR is a popular intermediate representation for compilers in domains such as deep learning, image processing, and hardware design. In this paper, we present an automated verification framework for HalideIR-based compiler transformations. The framework conducts verification using symbolic execution in two steps. Given a compiler transformation, our automated verification framework first uses symbolic execution to enumerate the compiler transformation's paths, and then utilizes symbolic execution to verify if the output program for each transformation path is equivalent to its source. We have successfully applied this framework to verify 46 transformations from the three most-starred HalideIR-based compilers on GitHub and detected 4 transformation bugs undetected by manually crafted unit tests. |
16:51 CET | SD2.8 | CHISELFV: A FORMAL VERIFICATION FRAMEWORK FOR CHISEL Speaker: Mufan Xiang, East China Normal University, CN Authors: Mufan Xiang1, Yongjian Li2 and Yongxin Zhao3 1East China Normal Universtiy, CN; 2Chinese Academy of Sciences, Institute of Software, Laboratory of Computer Science, CN; 3East China Normal University, CN Abstract Modern digital hardware is becoming ever more complex. And agile development, an efficient idea in software development, has been introduced into hardware. Furthermore, as a new hardware construction language, Chisel helps to raise the level of hardware design abstraction with the support of object-oriented and functional programming. Chisel plays a crucial role in future hardware design and open-source hardware development. However, the formal verification for Chisel is still limited. In this paper, we propose ChiselFV, a formal verification framework that has supported detailed formal hardware property descriptions and integrated mature formal hardware verification flows based on SymbiYosys. It builds on top of Chisel and uses Scala to drive the verification process. Thus the framework can be seen as an extension of Chisel. ChiselFV makes it easy to verify hardware designs formally when implementing them in Chisel. |
Extended Abstracts
Time | Label | Presentation Title Authors |
---|---|---|
16:54 CET | SD2.10 | EMNAPE: EFFICIENT MULTI-DIMENSIONAL NEURAL ARCHITECTURE PRUNING FOR EDGEAI Speaker: Hao Kong, Nanyang Technological University, SG Authors: Hao Kong1, Xiangzhong Luo1, SHUO HUAI1, Di Liu2, Ravi Subramaniam3, Christian Makaya3, Qian Lin3 and Weichen Liu1 1Nanyang Technological University, SG; 2Yunnan University, CN; 3HP Inc., US Abstract In this paper, we propose a multi-dimensional pruning framework, EMNAPE, to jointly prune the three dimensions (depth, width, and resolution) of convolutional neural networks (CNNs) for better execution efficiency on embedded hardware. In EMNAPE, we introduce a two-stage evaluation strategy to evaluate the importance of each pruning unit and identify the computational redundancy in the three dimensions. Based on the evaluation strategy, we further present a heuristic pruning algorithm to progressively prune redundant units from the three dimensions for better accuracy and efficiency. Experiments demonstrate the superiority of EMNAPE over existing methods. |
16:54 CET | SD2.13 | METRIC TEMPORAL LOGIC WITH RESETTABLE SKEWED CLOCKS Speaker: Alberto Bombardelli, Fondazione Bruno Kessler, IT Authors: Alberto Bombardelli and Stefano Tonetta, FBK, IT Abstract The formal verification of distributed real-time systems is particularly challenging due to the intertwining of timing constraints and synchronization and communication mechanisms. Real-time properties are usually expressed in Metric Temporal Logic (MTL), an extension of Linear-time Temporal Logic (LTL) with metric constraints over time. One of the issues to apply these methods to distributed systems is that clocks are not perfectly synchronized and the local properties may refer to different, possibly skewed, clocks, which are reset for synchronization. Local components and properties, therefore, may refer to time points that are not guaranteed to be monotonic. In this paper, we investigate the specification of temporal properties of distributed systems with resettable skewed clocks. In order to take into account the synchronization of clocks, the local temporal operators are interpreted over resettable skewed clocks. We extend MTL with metric operators that are more suitable to express bounds over non-monotonic time. |
16:54 CET | SD2.14 | POLYNOMIAL FORMAL VERIFICATION OF FLOATING POINT ADDERS Speaker: Jan Kleinekathöfer, University of Bremen, DE Authors: Jan Kleinekathöfer1, Alireza Mahzoon1 and Rolf Drechsler2 1University of Bremen, DE; 2University of Bremen | DFKI, DE Abstract In this paper, we present our verifier that takes advantage of Binary Decision Diagrams (BDDs) with case splitting to fully verify a floating point adder. We demonstrate that the traditional symbolic simulation using BDDs has an exponential time complexity and fails for large floating point adders. However, polynomial bounds can be ensured if our case splitting technique is applied in the specific points of the circuit. The efficiency of our verifier is demonstrated by experiments on an extensive set of floating point adders with different exponent and significand sizes. |
SE3 Efficient utilization of heterogeneous hardware architectures running machine learning-based applications
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 16:30 CET - 18:00 CET
Location / Room: Marble Hall
Session chair:
Zdenek Vasicek, BRNO UNIVERSITY OF TECHNOLOGY, CZ
16:30 CET until 16:54 CET: Pitches of regular papers
16:54 CET until 18:00 CET: Interactive technical presentations by the authors of regular papers and extended abstracts
Regular Papers
Time | Label | Presentation Title Authors |
---|---|---|
16:30 CET | SE3.1 | BLOCK GROUP SCHEDULING: A GENERAL PRECISION-SCALABLE NPU SCHEDULING TECHNIQUE WITH PRECISION-AWARE MEMORY ALLOCATION Speaker: Seokho Lee, Hanyang University, KR Authors: Seokho Lee1, Younghyun Lee1, Hyejun Kim2, taehoon kim2 and Yongjun Park3 1Department of Artificial Intelligence, Hanyang University, KR; 2Hanyang University, KR; 3Yonsei University, KR Abstract Precision-scalable neural processing units (PSNPUs) efficiently provide native support for quantized neural networks. However, with the recent advancements of deep neural networks, PSNPUs are affected by a severe memory bottleneck owing to the need to perform an extreme number of simple computations simultaneously. In this study, we first analyze whether the memory bottleneck issue can be solved using conventional neural processing unit scheduling techniques. Subsequently, we introduce new capacity-aware memory allocation and block-level scheduling techniques to minimize the memory bottleneck. Compared with the baseline, the new method achieves up to 2.26× performance improvements by substantially relieving the memory pressure of low-precision computations without hardware overhead. |
16:33 CET | SE3.2 | FPGA-BASED ACCELERATOR FOR RANK-ENHANCED AND HIGHLY-PRUNED BLOCK-CIRCULANT NEURAL NETWORKS Speaker: Haena Song, Pohang University of Science and Technology, KR Authors: Haena Song, Jongho Yoon, Dohun Kim, Eunji Kwon, Tae-Hyun Oh and Seokhyeong Kang, Pohang University of Science and Technology, KR Abstract Numerous network compression methods have been proposed to deploy the deep neural networks in a resource-constrained embedded system. Among them, block-circulant matrix (BCM) compression is one of the promising hardware-friendly methods for both acceleration and compression. However, it has several limitations; (i) limited representation due to the structural characteristic of circulant matrix, (ii) limitation of the compression parameter, (iii) needs for specialized dataflow to BCM-compressed network accelerators. In this paper, rank-enhanced and highly-pruned block-circulant matrices compression (RP-BCM) framework is proposed to overcome these limitations. RP-BCM comprises two stages: Hadamard-BCM and BCM-wise pruning. Also, a dedicated skip scheme is introduced to processing element design for maintaining high-parallelism with BCM-wise sparsity. Furthermore, we propose specialized dataflow for a BCM-compressed network, rather than the conventional CNN dataflow on FPGA. As a result, the proposed method achieves parameter reduction and FLOPs reduction for ResNet-50 in ImageNet by 92.4% and 77.3%, respectively. Moreover, compared to GPU, the proposed hardware design achieves 3.1x improvement in energy efficiency on the Xilinx PYNQ-Z2 FPGA board for ResNet-18 trained on ImageNet. |
16:36 CET | SE3.3 | LOSSLESS SPARSE TEMPORAL CODING FOR SNN-BASED CLASSIFICATION OF TIME-CONTINUOUS SIGNALS Speaker: Johnson Loh, IDS, RWTH Aachen, DE Authors: Johnson Loh and Tobias Gemmeke, RWTH Aachen University, DE Abstract Ultra-low power classification systems using spiking neural networks (SNN) promise efficient processing for mobile devices. Temporal coding represents activations in an artificial neural network (ANN) as binary signaling events in time, thereby minimizing circuit activity. Discrepancies in numeric results are inherent to common conversion schemes, as the atomic computing unit, i.e. the neuron, performs algorithmically different operations and, thus, potentially degrading SNN's quality of service (QoS). In this work, a lossless conversion method is derived in a top-down design approach for continuous time signals using electrocardiogram (ECG) classification as an example. As a result, the converted SNN achieves identical results compared to its fixed-point ANN reference. The computations, implied by proposed method, result in a novel hybrid neuron model located in between the integrate-and-fire (IF) and conventional ANN neurons, which numerical result is equivalent to the latter. Additionally, a dedicated SNN accelerator is implemented in 22 nm FDSOI CMOS suitable for continuous real-time classification. The direct comparison with an equivalent ANN counterpart shows that power reductions of 2.32x and area reductions of 7.22x are achievable without loss in QoS. |
16:39 CET | SE3.4 | NAF: DEEPER NETWORK/ACCELERATOR CO-EXPLORATION FOR CUSTOMIZING CNNS ON FPGA Speaker: Wenqi Lou, University of Science and Technology of China, CN Authors: Wenqi Lou, Jiaming Qian, Lei Gong, Xuan Wang, Chao Wang and Xuehai Zhou, USTC, CN Abstract Recently, algorithm and hardware co-design for neural networks (NNs) has become the key to obtaining high-quality solutions. However, prior works lack consideration of the underlying hardware and thus suffer from a severely unbalanced neural architecture and hardware architecture search (NA-HAS) space on FPGAs, failing to unleash the performance potential. Nevertheless, a deeper joint search leads to a larger (multiplicative) search space, highly challenging the search. To this end, we propose an efficient differentiable search framework NAF, which jointly searches the networks (e.g., operations and bitwidths) and accelerators (e.g., heterogeneous multicores and mappings) under a balanced NA-HAS space. Concretely, we design a coarse-grained hardware-friendly quantization algorithm and integrate it at a block granularity into the co-search process. Meanwhile, we design a highly optimized block processing unit (BPU) with key dataflow configurable. Afterward, a dynamic hardware generation algorithm based on modeling and heuristic rules is designed to perform the critical HAS and fast generate hardware feedback. Experimental results show that compared with the previous state-of-the-art (SOTA) co-design works, NAF improves the throughput by 1.99×-6.84× on Xilinx ZCU102 and energy efficiency by 17%-88% under similar accuracy on the ImageNet dataset. |
16:42 CET | SE3.5 | ESRU: EXTREMELY LOW-BIT AND HARDWARE-EFFICIENT STOCHASTIC ROUNDING UNIT DESIGN FOR 8-BIT DNN TRAINING Speaker: Sung En Chang, Northeastern University, US Authors: Sung-En Chang1, Geng Yuan1, Alec Lu2, Mengshu Sun1, Yanyu Li1, Xiaolong Ma3, Zhengang Li1, Yanyue Xie1, Minghai Qin4, Xue Lin1, Zhenman Fang2 and Yanzhi Wang1 1Northeastern University, US; 2Simon Fraser University, CA; 3Clemson University, US; 4Self-employed, US Abstract Stochastic rounding is crucial in the low-bit (e.g., 8-bit) training of deep neural networks (DNNs) to achieve high accuracy. One of the drawbacks of prior studies is that they require a large number of high-precision stochastic rounding units (SRUs) to guarantee low-bit DNN accuracy, which involves considerable hardware overhead. In this paper, we use extremely low-bit SRUs (ESRUs) to save a large number of hardware resources during low-bit DNN training. However, a naively designed ESRU introduces a biased distribution of random numbers, causing accuracy degradation. To address this issue, we further propose an ESRU design with a plateau-shape distribution. The plateau-shape distribution in our ESRU design is implemented with the combination of an LFSR and an inverted LFSR, which avoids LFSR packing and turns an inherent LFSR drawback into an advantage in our efficient ESRU design. Experimental results using state-of-the-art DNN models demonstrate that, compared to the prior 24-bit SRU with 24-bit pseudo-random number generators (PRNG), our 8-bit ESRU with 3-bit PRNG reduces the SRU hardware resource usage by 9.75 times while achieving slightly higher accuracy. |
16:45 CET | SE3.6 | CLASS-BASED QUANTIZATION FOR NEURAL NETWORKS Speaker: Wenhao Sun, TU Munich, DE Authors: Wenhao Sun1, Grace Li Zhang2, Huaxi Gu3, Bing Li1 and Ulf Schlichtmann1 1TU Munich, DE; 2TU Darmstadt, DE; 3Xidian University, CN Abstract In deep neural networks (DNNs), there are a huge number of weights and multiply-and-accumulate (MAC) operations. Accordingly, it is challenging to apply DNNs on resource-constrained platforms, e.g., mobile phones. Quantization is a method to reduce the size and the computational complexity of DNNs. Existing quantization methods either require hardware overhead to achieve a non-uniform conversion or focus on model-wise and layer-wise uniform conversions, which are not as fine-grained as filter-wise quantization. In this paper, we propose a class-based quantization method to determine the minimum number of quantization bits for each filter or neuron in DNNs individually. In the proposed method, the importance score of each filter or neuron with respect to the number of classes in the dataset is first evaluated. The larger the score is, the more important the filter or neuron is and thus the larger the number of quantization bits should be. Afterwards, a search algorithm is adopted to exploit the different importance of filters and neurons to determine the number of quantization bits of each filter or neuron. Experimental results demonstrate that the proposed method can maintain the inference accuracy with low bit-width quantization. Given the same number of quantization bits, the proposed method can also achieve a better inference accuracy than the existing methods. |
16:48 CET | SE3.7 | ROAD-RUNNER: COLLABORATIVE DNN PARTITIONING AND OFFLOADING ON HETEROGENEOUS EDGE SYSTEMS Speaker: Manolis Katsaragakis, National TU Athens, GR Authors: Andreas Kakolyris1, Manolis Katsaragakis1, Dimosthenis Masouros1 and Dimitrios Soudris2 1National TU Athens, GR; 2National Technical University of Athens, GR Abstract Deep Neural Networks (DNNs) are becoming extremely popular for many modern applications deployed at the edge of the computing continuum. Despite their effectiveness, DNNs are typically resource intensive, making it prohibitive to be deployed on resource- and/or energy-constrained devices found in such environments. To overcome this limitation, partitioning and offloading part of the DNN execution from edge devices to more powerful servers has been introduced as a prominent solution. While previous works have proposed resource management schemes to tackle this problem, they usually neglect the high dynamicity found in such environments, both regarding the diversity of the deployed DNN models, as well as the heterogeneity of the underlying hardware infrastructure. In this paper, we present RoaD-RuNNer, a framework for DNN partitioning and offloading for edge computing systems. RoaD-RuNNer relies on its prior knowledge and leverages collaborative filtering techniques to quickly estimate performance and energy requirements of individual layers over heterogeneous devices. By aggregating this information, it specifies a set of Pareto optimal DNN partitioning schemes that trade-off between performance and energy consumption. We evaluate our approach using a set of well-known DNN architectures and show that our framework i) outperforms existing state-of-the-art approaches by achieving 9.58× speedup on average and up to 88.73% less energy consumption, ii) achieves high prediction accuracy by limiting the prediction error down to 3.19% and 0.18% for latency and energy, respectively and iii) provides lightweight and dynamic performance characteristics. |
16:51 CET | SE3.8 | PRUNING AND EARLY-EXIT CO-OPTIMIZATION FOR CNN ACCELERATION ON FPGAS Speaker: Guilherme Korol, UFRGS, BR Authors: Guilherme Korol1, Michael Jordan2, Mateus Beck Rutzig3, Jeronimo Castrillon4 and Antonio Carlos Schneider Beck1 1Universidade Federal do Rio Grande do Sul, BR; 2UFRGS, BR; 3UFSM, BR; 4TU Dresden, DE Abstract The challenge of processing heavy-load ML tasks, particularly CNN-based ones at resource-constrained IoT devices, has encouraged the use of edge servers. The edge offers performance levels higher than the end devices and better latency and security levels than the Cloud. On top of that, the rising complexity of ML applications, the ever-increasing number of connected devices, and the current demands for energy efficiency require optimizing such CNN models. Pruning and early-exit are notable optimizations that have been successfully used to alleviate the computational cost of inference. However, these optimizations have not yet been exploited simultaneously: while pruning is usually applied at design time, which involves retraining the CNN before deployment, early-exit is inherently dynamic. In this work, we propose AdaPEx, a framework that exploits the intrinsic reconfigurable FPGA capabilities so both can be cooperatively employed. AdaPEx first explores the trade-off between pruning and early-exit at design time, creating a design space never exploited in the state-of-the-art. Then, AdaPEx applies FPGA reconfiguration as a means to enable the combined use of pruning and early-exit dynamically. At run-time, this allows matching the inference processing to the current edge conditions and a user-configurable accuracy threshold. In a smart IoT application, AdaPEx processes up to 1.32x more inferences and improves EDP by up to 2.55x over the state-of-the-art FPGA-based FINN accelerator. |
Extended Abstracts
Time | Label | Presentation Title Authors |
---|---|---|
16:54 CET | SE3.9 | LATTICE QUANTIZATION Speaker: Clement Metz, CEA, Paris-Saclay University, FR Authors: Clement Metz1, Thibault Allenet1, Johannes Thiele2, Antoine Dupret1 and Olivier Bichler1 1CEA, FR; 2CEA / Axelera.ai, CH Abstract Post-training quantization of neural networks consists in quantizing a model without retraining nor hyperparameter search, while being fast and data frugal. In this paper, we propose LatticeQ, a novel post-training weight quantization method designed for deep convolutional neural networks (DCNNs). Contrary to scalar rounding widely used in state-of-the-art quantization methods, LatticeQ uses a quantizer based on lattices -- discrete algebraic structures. LatticeQ exploits the inner correlations between the model parameters to the benefit of minimizing quantization error. We achieve state-of-the-art results in post-training quantization. In particular, we achieve ImageNet classification results close to full precision on Resnet-18/50, with little to no accuracy drop for 4-bit models. |
16:54 CET | SE3.10 | MITIGATING HETEROGENEITIES IN FEDERATED EDGE LEARNING WITH RESOURCE-INDEPENDENCE AGGREGATION Speaker: Zhao Yang, Northwestern Polytechnical University, CN Authors: Zhao Yang and Qingshuang Sun, Northwestern Polytechnical University, CN Abstract Heterogeneities have emerged as a critical challenge in Federated Learning (FL). In this paper, we identify the cause of FL performance degradation due to heterogeneous issues: the local communicated parameters have feature mismatches and feature representation range mismatches, resulting in ineffective global model generalization. To address it, Heterogeneous mitigating FL is proposed to improve the generalization of the global model with resource-independence aggregation. Instead of linking local model contributions to its occupied resources, we look for contributing parameters directly in each node's training results. |
16:54 CET | SE3.11 | MULTISPECTRAL FEATURE FUSION FOR DEEP OBJECT DETECTION ON EMBEDDED NVIDIA PLATFORMS Speaker: Thomas Kotrba, TU Wien, AT Authors: Thomas Kotrba1, Martin Lechner1, Omair Sarwar2 and Axel Jantsch1 1TU Wien, AT; 2Mission Embedded GmbH, AT Abstract Multispectral images can improve object detection systems' performance due to their complementary information, especially in adverse environmental conditions. To use multispectral image data in deep-learning-based object detectors, a fusion of the information from the individual spectra, e.g., inside the neural network, is necessary. This paper compares the impact of general fusion schemes in the backbone of the YOLOv4 object detector. We focus on optimizing these fusion approaches for an NVIDIA Jetson AGX Xavier and elaborating on their impact on the device in physical metrics. We optimize six different fusion architectures in the network's backbone for the TensorRT framework and compare their inference time, power consumption, and object detection performance. Our results show that multispectral fusion approaches with little design effort can benefit resource usage and object detection metrics compared to individual networks. |
16:54 CET | SE3.12 | RANKSEARCH: AN AUTOMATIC RANK SEARCH TOWARDS OPTIMAL TENSOR COMPRESSION FOR VIDEO LSTM NETWORKS ON EDGE Speaker: Chenchen Ding, Southern University of Science and Technology, CN Authors: Changhai Man1, Cheng Chang2, Chenchen Ding3, Ao Shen3, Hongwei Ren3, Ziyi Guan4, Yuan Cheng5, Shaobo Luo3, Rumin Zhang3, Ngai Wong4 and Hao Yu3 1Georgia Tech, US; 2University of California, Los Angeles, US; 3Southern University of Science and Technology, CN; 4University of Hong Kong, HK; 5Shanghai Jiao Tong University, CN Abstract Various industrial and domestic applications call for optimized lightweight video LSTM network models on edge. The recent tensor-train method can transform space-time features into tensors, which can be further decomposed into low-rank network models for lightweight video analysis on edge. The rank selection of tensor is however manually performed with no optimization. This paper formulates a rank search algorithm to automatically decide tensor ranks with consideration of the trade-off between network accuracy and complexity. A fast rank search method, called RankSearch, is developed to find optimized low-rank video LSTM network models on edge. Results from experiments show that RankSearch achieves a 4.84× reduction in model complexity, and 1.96× speed-up in run time while delivering a 3.86% accuracy improvement compared with the manual-ranked models. |
CC Closing Ceremony
Add this session to my calendar
Date: Wednesday, 19 April 2023
Time: 18:00 CET - 18:30 CET
Location / Room: Darwin Hall
Session chair:
Ian O’Connor, Ecole Centrale de Lyon, FR
Session co-chair:
Robert Wille, TU Munich, DE
Time | Label | Presentation Title Authors |
---|---|---|
18:00 CET | CC.1 | CLOSING REMARKS Speaker: Ian O'Connor and Robert Wille, DATE, BE Authors: Ian O'Connor1 and Robert Wille2 1Lyon Institute of Nanotechnology, FR; 2TU Munich, DE Abstract Closing Remarks from DATE Chairs |
18:15 CET | CC.2 | SAVE THE DATE 2024 Speaker and Author: Andy Pimentel, University of Amsterdam, NL Abstract SAVE the DATE 2024 |