Mark Silberstein, Professor

A slightly more detailed intro to GPUs in the context of ML

2015-now

Many places

Basic GPU tutorial for GPU newbies, in the context of deep learning.

Enough to understand the basic tradeoffs and main optimization goals.

Taught at various guest lectures in the Technion and beyond.

presentation

Accelerator-centric system design

10 July 2022

HiPEAC ACACES Summer School (Fiuggi, Italy)

This is a short version of the Accelerators and Accelerator Computing Systems course that I teach at the Technion.

I taught this course to the ACACES summer school students organized by HiPEAC.

presentation

Securing Trusted Execution Environments from side-channel and untrusted interface attacks

Feb 14, 2022

Microsoft Research Cambridge

Hardware Trusted Execution Environments (TEEs) available in all modern CPUs aim to guarantee confidentiality and integrity of program execution and data. A unique property of TEEs is their strong threat model that permits an attacker to control an OS or hypervisor, making TEEs a foundation of confidential computing offerings in modern cloud systems.

This threat model, however, implies a dramatically amplified attacker’s ability to mount realistic side-channel attacks. While such attacks are commonly excluded from the TEE security guarantees, they form the majority of all known attacks on TEEs. Another Achilles heel of TEEs is the software interface between the trusted software running in the TEE, and untrusted privileged software on the host. For example, TEE programs often rely on access to the host file system, inevitably requesting services from the untrusted OS. Yet, if not properly secured, this untrusted interface may be used to cause TEE control flow violation and data leakage.

In this talk I will survey our efforts to protect TEEs from these attacks with minimum performance costs. The key research question we address is: “How to securely use untrusted external services from a trusted environment?”. The guiding principle of our solution is to establish a software contract between the TEE and external software and build lightweight trusted mechanisms to enforce it. In Varys we introduce an in-TEE runtime to enforce co-scheduling and interrupt-free execution of TEE threads on the same CPU core to prevent cache-timing attacks. In CosMIX and Autarky we devise a compiler, hardware and runtime mechanisms to enforce memory paging policies that mitigate an architectural paging side-channel in Intel SGX. In ProtheN we protect a TEE against untrusted interface attacks by introducing a software layer auto-generated from a formal model of the untrusted service, which is used to validate the correctness of the service behavior. I will describe these mechanisms in more detail and will discuss future directions toward enhancing the TEE security.

A computational cache: a neural-net based algorithm for range matching with application to packet classification

15 March 2021

Hebrew University Colloquium

As computer hardware evolves, processor compute capabilities scale much faster than memory performance and capacity.

In our work we strive to leverage this trend to design more efficient data structures and faster computer systems.

We introduce a new algorithm for range matching, which turns a traditional memory-bound task into a compute-bound one, with the help of neural networks.

Given a set of non-intersecting numerical ranges, the goal is to find the range that matches the input. Variants of this problem serve as building blocks in a variety of software and hardware systems, from network switches, storage controllers, DNA sequencing, memory management, and more. The key problem of conventional algorithms is that the indexing data structure grows too large, spilling out of fast caches, and constraining the number of ranges that can be handled efficiently.

Our new data structure, Range Query Recursive Model Index (RQ-RMI), allows compressing the index by up to two orders of magnitude by representing it as a neural network trained to predict the correct range, thus replacing the index lookup with model inference. The key enabler is an efficient training algorithm that guarantees the lookup correctness.

We apply RQ-RMI to multi-field packet classification, which is a crucial component in modern data center networks. Packet classification is a challenging problem as it requires efficient support for range insertions, as well as involves matching in multiple dimensions with overlapping ranges at a rate of millions of queries per second.

We show that our approach scales to a large number of classification rules and outperforms all existing algorithms by a wide margin, both in isolation, and as part of a production OpenVSwitch system, with tens of nanoseconds per lookup on a commodity CPU.

This talk is based on the SIGCOMM’20 paper led by my PhD student Alon Rashelbach jointly with Prof Ori Rottenstreich.

Youtube video

presentation

Fuzzing Away Speculative Execution Attacks

December 2019

Intel Israel

Technion Cyber Retreat

SpecFuzz is a first tool to enable dynamic testing of Spectre V1 vulnerabilities.

This presentation provides a high-level explanation of the concept of speculation exposure.

presentation

GPU native I/O 2.0 - leveraging new hardware for efficient GPU I/O abstractions

November 2019

NVIDIA HQ, Santa Clara

This talk surveys the evolution of GPU-native I/O services.

I will first discuss the lessons we learned from our first prototypes for GPUfs (file access from GPUs), GPUnet (streaming network I/O) and GPUrdma (RDMA support), focusing on the main
hardware and software hurdles that made their implementation and use harder than expected.

I will then describe our recent works that strive to overcome the limitations of earlier systems by using new hardware capabilities. I will first discuss the system for the GPU file system access via GPU
memory mapped files. Unlike GPUfs we remove the file system software layer from the GPU, but build on the GPUfs distributed page cache principles to fully integrate the OS page cache into GPU memory with the help of GPU page faults.

Next I will focus on the new opportunities afforded by SmartNICs to improve the performance and efficiency of GPU-accelerated computing services. We develop an accelerator-centric network server
architecture which offloads the server data and control plane to the SmartNIC and enables direct networking from accelerators via a lightweight hardware-friendly I/O mechanism. In addition to freeing the CPU from running network processing and accelerator management as in GPUnet and GPUrdma, we also eliminate the need to run network logic on the GPU, streamlining the integration of network I/O with existing GPUs. Moreover, this architecture easily scales beyond a single machine, enabling convenient network interfaces for remote GPUs. We show experimentally that the use of SmartNICs for GPU-native I/O is portable across accelerators, provides good scalability and can be efficient in different types of SmartNICs. For example, our Mellanox BlueField-based LeNet neural network inference server achieves 300usec request turnaround time and linear scaling with 12 GPUs located in three different servers, and projected to scale linearly to 100 GPUs without using the host CPU.

presentation

OmniX - an OS architecture for omni-programmable systems

November 2019

October 2019

September 2019

May 2019

March 2020

May 2021

UC Berkeley

EPFL

TU Dresden

Huawei

Imperial College London

Future systems will be omni-programmable: alongside CPUs, GPUs, Security accelerators and FPGAs, they will execute user code near-storage, near-network, and near-memory. Ironically, while
breaking power and memory walls via hardware specialization and near data processing, emerging programmability wall will become a key impediment for materializing the promised performance and power efficiency benefits of omni-programmable systems. I argue that the root cause of the programming complexity lies in todays CPU-centric operating system (OS) design which is no longer appropriate for omni-programmable systems.

In this talk I will describe the ongoing efforts in my lab to design an accelerator-centric OS called OmniX, which extends standard OS abstractions into accelerators, while maintaining a coherent view of the system among all the processors. In OmniX, near-data computation accelerators may directly invoke tasks and access I/O services among themselves, excluding the CPU from the
performance-critical data and control plane operations, and turning it into a “yet another” accelerator for sequential computations. I will show how OmniX design principles have been successfully applied to GPUs, Programmable NICs and Intel SGX.

Here is the video for my presentation @ Huawei Systems Software innovation Summit 2021:

Watch the video starting from 5.50

presentation

Foreshadow attack explained

May 2019

March 2019

Cornell-Tec

Technion Cyber Day

Foreshadow is a speculative execution attack on Intel SGX. This talk explains the basic mechanisms of speculative execution attacks and then delves into the details of Foreshadow.

presentation

Accelerating Network Application on SmartNICs with NICA

October 2018

Jun 2018

Cornell

IBM Research

NICA is a SmartNIC-based infrastructure for inline acceleration of network applications. This talk explains the main concepts.

presentation

Zero-effort adaptable security

March 2018

Paris

Seamlessly securing applications by running them in Intel SGX is not quite realistic due to performance overheads and hardware side channels. In this talk we argue that there are intermediate points on the security-performance tradeoff curve, which trade some security to achieve better performance and vice versa. We envision that this adjustable security can be enabled by simply recompiling a program with different flags, and show a few ideas how it can be achieved in practice using CosMIX compiler (ATC19).

presentation

Talks