A slightly more detailed intro to GPUs in the context of ML
Many places

Basic GPU tutorial for GPU newbies, in the context of deep learning.

Enough to understand the basic tradeoffs and main optimization goals.

Taught at various guest lectures in the Technion and beyond.

Accelerator-centric system design
10 July 2022
HiPEAC ACACES Summer School (Fiuggi, Italy)

This is a short version of the Accelerators and Accelerator Computing Systems course that I teach at the Technion.

I taught this course to the ACACES summer school students organized by HiPEAC.


Securing Trusted Execution Environments from side-channel and untrusted interface attacks
Feb 14, 2022
Microsoft Research Cambridge

Hardware Trusted Execution Environments (TEEs) available in all modern CPUs aim to guarantee confidentiality and integrity of program execution and data. A unique property of TEEs is their strong threat model that permits an attacker to control an OS or hypervisor, making TEEs a foundation of confidential computing offerings in modern cloud systems.  

This threat model, however, implies a dramatically amplified attacker’s ability to mount realistic side-channel attacks. While such attacks are commonly excluded from the TEE security guarantees, they form the majority of all known attacks on TEEs.  Another Achilles heel of TEEs is the software interface between the trusted software running in the TEE, and untrusted privileged software on the host. For example,  TEE programs often rely on access to the host file system, inevitably requesting services from the untrusted OS. Yet, if not properly secured, this untrusted interface may be used to cause TEE control flow violation and data leakage.
In this talk I will survey our efforts to protect TEEs from these attacks with minimum performance costs. The key research question we address is: “How to securely use untrusted external services from a trusted environment?”. The guiding principle of our solution is to establish a software contract between the TEE and external software and build lightweight trusted mechanisms to enforce it. In Varys we introduce an in-TEE runtime to enforce co-scheduling and interrupt-free execution of TEE threads on the same CPU core to prevent cache-timing attacks. In CosMIX and Autarky we devise a compiler, hardware and runtime mechanisms to enforce memory paging policies that mitigate an architectural paging side-channel in Intel SGX. In ProtheN we protect a TEE against untrusted interface attacks by introducing a software layer auto-generated from a formal model of the untrusted service, which is used to validate the correctness of the service behavior. I will describe these mechanisms in more detail and will discuss future directions toward enhancing the TEE security.
A computational cache: a neural-net based algorithm for range matching with application to packet classification
15 March 2021
Hebrew University Colloquium
As computer hardware evolves, processor compute capabilities scale much faster than memory performance and capacity.
In our work we strive to leverage this trend to design more efficient data structures and faster computer systems.
We introduce a new algorithm for range matching, which turns a traditional memory-bound task into a compute-bound one, with the help of neural networks.
Given a set of non-intersecting numerical ranges, the goal is to find the range that matches the input. Variants of this problem serve as building blocks in a variety of software and hardware systems, from network switches, storage controllers, DNA sequencing, memory management, and more. The key problem of conventional algorithms is that the indexing data structure grows too large, spilling out of fast caches, and constraining the number of ranges that can be handled efficiently.
Our new data structure, Range Query Recursive Model Index (RQ-RMI), allows compressing the index by up to two orders of magnitude by representing it as a neural network trained to  predict the correct range, thus replacing the index lookup with model inference. The key enabler is an efficient training algorithm that guarantees the lookup correctness.
We apply RQ-RMI to multi-field packet classification, which is a crucial component in modern data center networks. Packet classification is a challenging problem as it requires efficient support for range insertions, as well as involves matching in multiple dimensions with overlapping ranges at a rate of millions of queries per second.
We show that our approach scales to a large number of classification rules and outperforms all existing algorithms by a wide margin, both in isolation, and as part of a production OpenVSwitch system, with tens of nanoseconds per lookup on a commodity CPU.
This talk is based on the SIGCOMM’20 paper led by my PhD student Alon Rashelbach jointly with Prof Ori Rottenstreich.
Fuzzing Away Speculative Execution Attacks
December 2019
December 2019
Intel Israel
Technion Cyber Retreat

SpecFuzz is a first tool to enable dynamic testing of Spectre V1 vulnerabilities.

This presentation provides a high-level explanation of the concept of speculation exposure.

GPU native I/O 2.0 - leveraging new hardware for efficient GPU I/O abstractions
November 2019
NVIDIA HQ, Santa Clara

This talk surveys the evolution of GPU-native I/O services.

I will first discuss the lessons we learned from our first prototypes for GPUfs (file access from GPUs), GPUnet (streaming network I/O)  and GPUrdma (RDMA support), focusing on the main
hardware and software hurdles that made their implementation and use harder than expected.

I will then describe our recent works that strive to overcome the limitations of earlier systems by using new hardware capabilities. I will first discuss the system for the GPU file system access via GPU
memory mapped files. Unlike GPUfs we remove the file system software layer from the GPU, but build on the GPUfs distributed page cache principles to fully integrate the OS page cache into GPU memory with the help of GPU page faults.

Next I will focus on the new opportunities afforded by SmartNICs to improve the performance and efficiency of GPU-accelerated computing services. We develop an accelerator-centric network server
architecture which offloads the server data and control plane to the SmartNIC and enables direct networking from accelerators via a lightweight hardware-friendly I/O mechanism. In addition to freeing the CPU from running  network processing and accelerator management as in GPUnet and GPUrdma, we also eliminate the need to run network logic on the GPU, streamlining the integration of network I/O with existing GPUs. Moreover,  this architecture easily scales beyond a single machine, enabling convenient network interfaces for remote GPUs. We show experimentally that the use of SmartNICs for GPU-native I/O is portable across accelerators, provides good scalability and can be efficient in different types of SmartNICs. For example, our Mellanox BlueField-based  LeNet neural network inference server achieves 300usec request turnaround time and linear scaling with 12 GPUs located in three different servers, and projected to scale linearly to 100 GPUs without using the host CPU.

OmniX - an OS architecture for omni-programmable systems
November 2019
October 2019
September 2019
May 2019
March 2020
May 2021
UC Berkeley
TU Dresden
Imperial College London

Future systems will be omni-programmable: alongside CPUs, GPUs, Security accelerators and FPGAs, they will execute user code near-storage, near-network, and near-memory.  Ironically, while
breaking power and memory walls via hardware specialization and near data processing, emerging programmability wall will become a key impediment for materializing  the promised performance and power efficiency benefits of omni-programmable systems. I argue that the root cause of the programming complexity lies in todays CPU-centric operating system (OS) design which  is no longer appropriate for omni-programmable systems.

In this talk I will describe the ongoing efforts in my lab to design an accelerator-centric OS called OmniX, which  extends standard OS abstractions into accelerators,  while maintaining  a coherent view of the system among all the processors. In OmniX, near-data computation accelerators may directly invoke tasks and access I/O services among themselves, excluding the CPU from the
performance-critical data and control plane operations, and turning it into a “yet another” accelerator for sequential computations. I will show how OmniX design principles have been successfully applied to GPUs, Programmable NICs and Intel SGX.


Here is the video for my presentation @ Huawei Systems Software innovation Summit 2021:

Watch the video starting from  5.50

Foreshadow attack explained
May 2019
March 2019
Technion Cyber Day

Foreshadow is a speculative execution attack on Intel SGX. This talk explains the basic mechanisms of speculative execution attacks and then delves into the details of Foreshadow.

Accelerating Network Application on SmartNICs with NICA
October 2018
Jun 2018
IBM Research

NICA is a SmartNIC-based infrastructure for inline acceleration of network applications. This talk explains the main concepts.

Zero-effort adaptable security
March 2018

Seamlessly securing applications by running them in Intel SGX is not quite realistic due to performance overheads and hardware side channels. In this talk we argue that there are intermediate points on the security-performance tradeoff curve, which trade some security to achieve better performance and vice versa. We envision that this adjustable security can be enabled by simply recompiling a program with different flags, and show a few ideas how it can be achieved in practice using CosMIX compiler (ATC19).