Mark Silberstein, Professor

Professor

+972 73 378 7164

Room 517, Zisapel (new ECE building)

Viterbi Faculty of Electrical and Computing Engineering, Technion City, Haifa 3200003, Israel

SHORT BIO & RESEARCH INTERESTS

more about me

I am the head of the Accelerated Computing Systems Lab.

We use bleeding-edge hardware to build new easy-to-program high-performance systems with strong security. Our projects include new operating systems for GPUs, FPGAs and SmartNICs, data-center scale OS for disaggregated architectures, new defenses against hardware side channels and speculative execution attacks, new compilers and runtimes for secure processors, distributed programs and virtual machines on programmable switches, and machine learning-based hardware systems. Our research results were partially adopted by NVIDIA, Mellanox and Intel. Most of our software is open-source and free.

I teach Intro to OS (046209) and Advanced OS (048080) in Winters, and Accelerated Systems (046278/236278) and Computer Security Principles (046280) in Springs.

I am always looking for ambitious students with a strong technical background, creative and original thinking, and a passion for building innovative computer systems using emerging hardware with real long-term impact on industry and academic research. I also have postdoc and summer internship openings.

מידע לסטודנטים לגבי בחירת מסלול לימודים בפקולטה להנדסת חשמל

SELECTED PUBLICATIONS

all publications

[NSDI 2026] In Link We Trust: BFT at the Speed of CFT using Switches

In Link We Trust: BFT at the Speed of CFT using Switches

Abstract

We introduce SwitchBFT, a novel BFT consensus protocol for data centers that matches the performance and fault tolerance guarantees of the fastest Crash Fault Tolerance (CFT) protocols. We take advantage of several unique properties of the trusted network that have emerged in modern data centers. SwitchBFT leverages packet source authentication to eliminate the overheads of cryptographic signatures, thus speeding up the fault-free scenario, and utilizes network switch programmability to enforce agreement decisions and to verify that safety is not violated, thereby offering robust performance even when some replicas are faulty. Designing a practical BFT that makes the most of these properties requires solving several challenges, such as packet losses and switch crash faults, all within the tight switch resource budget. We show that SwitchBFT outperforms state-of-the-art BFTs in scalability and performance, attaining the speed of NOPaxos, an in-switch CFT implementation.

Lior Zeno, Naama Ben-David, Mark Silberstein [accepted]

[HotNets 2025] FlowPulse: Catching Network Failures in ML Clusters

FlowPulse: Catching Network Failures in ML Clusters

Abstract

Network hardware faults are inevitable in massive scale-out ML training clusters. Networks in such systems are inherently designed for resiliency, routing around faulty components as long as a fault is detected. Unfortunately, some silent faults evade detection. Notably, the effects of silent faults are amplified in modern production networks that deploy per-packet load balancing, because packets of a single flow traverse many network paths, making such faults particularly hard to localize. We present FlowPulse, the first system for rapid, low-overhead detection of silent network faults in per-packet spraying networks. Our key insight is that distributed training workloads induce predictable traffic patterns in the switch ports we refer to as a temporal symmetry. This symmetry emerges even in the presence of known faults, and can be modeled analytically or learned from the traffic. FlowPulse detects new network faults of training tasks by identifying subtle deviations from the expected temporal symmetry on each switch during collective communications, all without any inter-switch coordination or probing overheads. Our preliminary results show that FlowPulse is effective in detecting silent faults in a variety of realistic settings, topologies, and fault patterns. For example, FlowPulse identifies a single faulty link with 1.5% corruption rate by checking temporal symmetry in a full two-level fat tree topology with 32 leaf switches while performing Ring-AllReduce on all nodes.

Jakob Kerbs, Dmitry Gavrilenko, Daniel Amir, Shir Landau Feibish, Mark Silberstein [accepted]

[USENIX ATC 2025] Accelerating Nested Virtualization with HyperTurtle

Accelerating Nested Virtualization with HyperTurtle

Abstract

Nested virtualization provides strong isolation but incurs non-trivial performance costs. Prior works alleviate some overheads but suffer from limitations such as intrusive code changes or reduced control over nested virtual environments. We present HyperTurtle, a general approach to accelerate nested virtualization. It reduces the number of costly world switches between the virtualization layers, the primary source of performance overheads. HyperTurtle offloads the execution of certain parts on the critical path of the virtualized hypervisor, encapsulating them as eBPF programs and executing them safely in the context of the bare-metal hypervisor. Thus, HyperTurtle reduces the performance cost of world switches whilst retaining control over nested VMs. We show that HyperTurtle can be used to optimize a variety of OS subsystems and apply it to memory management, networking, and application profiling. HyperTurtle achieves significant performance improvements in micro and macro-benchmarks, for example, 5× faster EPT fault handling, which translates to up to 27% faster boot-time of Kata containers, without requiring intrusive code changes to the virtualization infrastructure.

Ori Ben-Zur, Jakob Kerbs, Shai Bergman, Mark Silberstein

Best Paper Award

[SYSTOR 2024] Space-efficient FTL for Mobile Storage via Tiny Neural Nets

Space-efficient FTL for Mobile Storage via Tiny Neural Nets

BibTeX

@inproceedings{10.1145/3688351.3689157,
author = {Marcus, Ron and Rashelbach, Alon and Ben-Zur, Ori and Lifshits, Pavel and Silberstein, Mark},
title = {Space-efficient FTL for Mobile Storage via Tiny Neural Nets},
year = {2024},
isbn = {9798400711817},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3688351.3689157},
doi = {10.1145/3688351.3689157},
booktitle = {Proceedings of the 17th ACM International Systems and Storage Conference},
pages = {146–161},
numpages = {16},
keywords = {FTL, Mobile Storage, Range Matching, SSD},
location = {Virtual, Israel},
series = {SYSTOR '24}
}

Space-efficient FTL for Mobile Storage via Tiny Neural Nets

Abstract

We present RQFTL, a demand-based FTL for mobile storage controllers that boosts the effective Logical-To-Physical (L2P) address translation cache capacity over state-of-the-art techniques. RQFTL stores a large part of the L2P cache in a compressed form, and employs a learned data structure called RQRMI that leverages tiny neural nets to quickly find the correct translation entry in the cache. RQFTL uses neural network inference for cache lookups, and rapidly retrains the neural nets to efficiently handle L2P cache updates. It is specifically optimized to achieve high coverage for scattered read accesses, making it suitable for popular read-skewed workloads such as mobile gaming. We evaluate RQFTL on hours-long real-world I/O traces of popular modern mobile apps, including games, video editing, and social networking apps collected on Google Pixel 6a phone. We show that RQFTL outperforms all the state-ofthe-art FTLs in these workloads, increasing the effective L2P cache capacity by over an order of magnitude compared toDFTL and up to 5× over the recent LeaFTL. As a result, it achieves 65%, and 25% lower miss rate compared to DFTL and LeaFTL respectively, under the same SRAM capacity, and allows reduction of the total SRAM capacity of a controller by about a third of that of LeaFTL.

BibTeX

Ron Marcus, Ori Ben-Zur, Alon Rashelbach, Pavel Lifshits, Mark Silberstein

[SIGCOMM 2024] In-Network Address Caching for Virtual Networks

In-Network Address Caching for Virtual Networks

BibTeX

@inproceedings {10.1145/3651890.3672213,
author = {Lior Zeno, Ang Chen, and Mark Silberstein},
title = {In-Network Address Caching for Virtual Networks},
booktitle = {Proceedings of the ACM SIGCOMM 2024 Conference},
year = {2024},
address = {Sydney, NSW, Australia},
url = {https://doi.org/10.1145/3651890.3672213},
publisher = {Association for Computing Machinery},
}

In-Network Address Caching for Virtual Networks

Abstract

Packet routing in virtual networks requires virtual-to-physical address translation. Whereas the translation mappings are created by a single party (i.e., the network administrator), they are read across the network when routing tenant packets. Existing approaches face an inherent read-write performance tradeoff — either store these mappings in dedicated gateways for fast updates at the cost of slower forwarding or replicate them at end-hosts and suffer from slow updates.

SwitchV2P aims to escape this tradeoff by leveraging the network switches to transparently cache the mappings while learning them from the traffic. This brings the translations closer to the sender, reducing the first packet latency and lookup-related network overheads, and combines that with efficient mapping updates, without interfering with the existing routing policies and deployed gateways. The topology-aware data-plane caching protocol enables the switches to transparently and dynamically adapt to changing network conditions and handle variations of in-switch memory capacity.

Our evaluation shows the benefits of in-network address mapping, including up-to a 7.8X reduction in Flow CompletionTtime, a 4.3X decrease in first packet latency, and a substantial reduction in translation gateway load. Additionally, SwitchV2P achieves up to a 1.9X reduction in bandwidth overhead and requires order-of-magnitude fewer gateways for equivalent performance.

BibTeX

Lior Zeno, Ang Chen, Mark Silberstein

[NSDI 2024] Multitenant In-Network Acceleration with SwitchVM

Multitenant In-Network Acceleration with SwitchVM

BibTeX

@inproceedings {
Khashab-NSDI24,
author = {Sajy Khashab and Alon Rashelbach and Mark Silberstein},
title = {Multitenant {In-Network} Acceleration with {SwitchVM}},
booktitle = {21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)},
year = {2024},
isbn = {978-1-939133-39-7},
address = {Santa Clara, CA},
pages = {691--708},
url = {https://www.usenix.org/conference/nsdi24/presentation/khashab},
publisher = {USENIX Association},
}

Multitenant In-Network Acceleration with SwitchVM

Abstract

We propose a practical approach to implementing multitenancy on programmable network switches to make in-network acceleration accessible to cloud users. We introduce a Switch Virtual Machine (SwitchVM), that is deployed on the switches and offers an expressive instruction set and program state abstractions. Tenant programs, called data-plane filters (DPFs), are executed on top of SwitchVM in a sandbox with memory, network, and state isolation policies controlled by network operators. The packets that trigger DPF execution include the code to execute or a reference to the dpfs deployed in the switch. dpfs are Turing-complete, may maintain state in the packet and in switch virtual memory, may form a dynamic chain, and may steer packets to desired destinations, all while enforcing the operator’s policies.

We demonstrate that this idea is practical by prototyping projname in P4 on Intel Tofino switches. We describe a variety of use cases that SwitchVM supports, and implement three complex applications from prior works — Key-Value Store cache, Load-aware load balancer and Paxos accelerator. We also show that SwitchVM provides strong performance isolation, zero-overhead runtime programmability, may hold two orders of magnitude more in-switch programs than existing techniques, and may support up to thirty thousand concurrent tenants each with its private state.

BibTeX

Sajy Khashab, Alon Rashelbach, Mark Silberstein

[MICRO 2023] NeuroLPM: Scaling Longest Prefix Match Hardware with Neural Networks

NeuroLPM: Scaling Longest Prefix Match Hardware with Neural Networks

Abstract

Longest Prefix Match engines (LPM) are broadly used in computer systems and especially in modern network devices such as Network Interface Cards (NICs), switches and routers. However, existing LPM hardware fails to scale to millions of rules required by modern
systems, is often optimized for specific applications, and thus is performance-sensitive to the structure of LPM rules.

We describe NeuroLPM, a new architecture for multi-purpose LPM hardware that replaces queries in traditional memory-intensive trie- and hash-table data structures with inference in a lightweight Neural Network-based model, called RQRMI. NeuroLPM scales to millions of rules under small on-die SRAM budget and achieves stable, rule-structure-agnostic performance, allowing its use in a variety of applications. We solve several unique challenges when implementing RQRMI inference in hardware, including minimizing the amount of floating point computations while maintaining query correctness, and scaling the rule-set size while ensuring small, deterministic off-chip memory bandwidth.

We prototype NeuroLPM in Verilog and evaluate it on real-world packet forwarding rule-sets and network traces. NeuroLPM offers substantial scalability benefits without any application-specific optimizations. For example, it is the only algorithm that can serve a 950K-large rule-set at an average of 196M queries per second with 4.5MB of SRAM, only within 2% of the best-case throughput of the state-of-the-art Tree Bitmap and SAIL on smaller rule-sets. With 2MB of SRAM, it reduces the DRAM bandwidth per query, the dominant performance factor, by up to 9× and 3× compared to the state-of-the-art.

Alon Rashelbach, Igor De-Paula, Mark Silberstein

[USENIX Security 2023] AEX-Notify: Thwarting Precise Single-Stepping Attacks through Interrupt Awareness for Intel SGX Enclaves

AEX-Notify: Thwarting Precise Single-Stepping Attacks through Interrupt Awareness for Intel SGX Enclaves

BibTeX

@inproceedings {291295,
author = {Scott Constable and Jo Van Bulck and Xiang Cheng and Yuan Xiao and Cedric Xing and Ilya Alexandrovich and Taesoo Kim and Frank Piessens and Mona Vij and Mark Silberstein},
title = {{AEX-Notify}: Thwarting Precise {Single-Stepping} Attacks through Interrupt Awareness for Intel {SGX} Enclaves},
booktitle = {32nd USENIX Security Symposium (USENIX Security 23)},
year = {2023},
isbn = {978-1-939133-37-3},
address = {Anaheim, CA},
pages = {4051--4068},
url = {https://www.usenix.org/conference/usenixsecurity23/presentation/constable},
publisher = {USENIX Association},
month = aug,
}

AEX-Notify: Thwarting Precise Single-Stepping Attacks through Interrupt Awareness for Intel SGX Enclaves

Abstract

Intel® Software Guard Extensions (Intel® SGX) supports the creation of shielded enclaves within unprivileged processes. While enclaves are architecturally protected against malicious system software, Intel SGX’s privileged attacker model could potentially expose enclaves to new powerful side-channel attacks. In this paper, we consider hardware-software co-design countermeasures to an important class of single-stepping attacks that use privileged timer interrupts to precisely step through enclave execution exactly one instruction at a time, as supported, e.g., by the open-source SGX-Step framework. This is a powerful deterministic attack primitive that has been employed in a broad range of high-resolution Intel SGX attacks, but so far remains unmitigated.

We propose AEX-Notify, a flexible hardware ISA extension that makes enclaves interrupt-aware: enclaves can register a trusted handler to be run after an interrupt or exception. AEX-Notify can be used as a building block for implementing countermeasures against different types of interrupt-based attacks in software. With our primary goal to thwart deterministic single-stepping, we first diagnose the underlying hardware behavior to determine the root cause that enables it. We then apply the learned insights to remove this root cause by building an efficient software handler and constant-time disassembler to transparently determine and atomically prefetch the working set of the next enclave application instruction.

The ISA extension we propose in this paper has been incorporated into a revised version of the Intel SGX specification.

BibTeX

Scott Constable, Jo Van Bulck, Xiang Cheng, Yuan Xiao, Cedric Xing, Ilya Alexandrovich, Taesoo Kim, Frank Piessens, Mona Vij, Mark Silberstein

Intel ISA extension

[USENIX ATC 2023] Translation Pass-Through for Near-Native Paging Performance in VMs

Translation Pass-Through for Near-Native Paging Performance in VMs

BibTeX

@inproceedings {288762,
author = {Shai Bergman and Mark Silberstein and Takahiro Shinagawa and Peter Pietzuch and Llu{\'\i}s Vilanova},
title = {Translation {Pass-Through} for {Near-Native} Paging Performance in {VMs}},
booktitle = {2023 USENIX Annual Technical Conference (USENIX ATC 23)},
year = {2023},
isbn = {978-1-939133-35-9},
address = {Boston, MA},
pages = {753--768},
url = {https://www.usenix.org/conference/atc23/presentation/bergman},
publisher = {USENIX Association},
month = jul,
}

Translation Pass-Through for Near-Native Paging Performance in VMs

Abstract

Virtual machines (VMs) are used for consolidation, isolation, and provisioning in the cloud, but applications with large working sets are impacted by the overheads of memory address translation in VMs. Existing translation approaches incur non-trivial overheads: (i) nested paging has a worst-case latency that increases with page table depth; and (ii) paravirtualized and shadow paging suffer from high hypervisor intervention costs when updating guest page tables.

We describe Translation Pass-Through (TPT), a new memory virtualization mechanism that achieves near-native performance. TPT enables VMs to control virtual memory translation from guest-virtual to host-physical addresses using one-dimensional page tables. At the same time, inter-VM isolation is enforced by the host by exploiting new hardware support for physical memory tagging in commodity CPUs.

We prototype TPT by modifying the KVM/QEMU hypervisor and enlightening the Linux guest. We evaluate it by emulating the memory tagging mechanism of AMD CPUs. Our conservative performance estimates show that TPT achieves native performance for real-world data center applications, with speedups of up to 2.4 $\times$ and 1.4 $\times$ over nested and shadow paging, respectively

BibTeX

Shai Bergman, Mark Silberstein, Takahiro Shinagawa, Peter Pietzuch, Lluis Vilanova

[S&P 2023] Hide and Seek with Spectres: Efficient discovery of speculative information leaks with random testing

Hide and Seek with Spectres: Efficient discovery of speculative information leaks with random testing

BibTeX

@inproceedings{oleksenko2023hide,
author = { Oleksii Oleksenko and
Marco Guarnieri and
Boris K"{o}pf and
Mark Silberstein
},
title = {{Hide and Seek with Spectres: Efficient discovery of speculative information leaks with random testing}},
booktitle = {Proceedings of the 44th IEEE Symposium on Security and Privacy},
year = {2023},
series={S&P 2023},
publisher={IEEE}
}

Hide and Seek with Spectres: Efficient discovery of speculative information leaks with random testing

Abstract

Attacks like Spectre abuse speculative execution, one of the key performance optimizations of modern CPUs. Recently, several testing tools have emerged to automatically detect speculative leaks in commercial (black-box) CPUs. However, the testing process is still slow, which has hindered in-depth testing campaigns, and so far prevented the discovery of new classes of leakage. In this paper, we identify the root causes of the performance limitations in existing approaches and propose techniques to overcome these limitations. With these techniques, we improve the testing speed over the state-of-the-art by up to two orders of magnitude. These improvements enable us to run a testing campaign of unprecedented depth on Intel and AMD CPUs. As a highlight, we discover two types of previously unknown speculative leaks (affecting string comparison and division) that have escaped previous manual and automatic analyses.

BibTeX

Oleksii Oleksenko, Marco Guarnieri, Boris Köpf, Mark Silberstein

[ISMM 2022] Reconsidering OS Memory Optimizations in the Presence of Disaggregated Memory

Reconsidering OS Memory Optimizations in the Presence of Disaggregated Memory

BibTeX

@inproceedings{Bergman22,
author = {Bergman, Shai and Faldu, Priyank and Grot, Boris and Vilanova, Llu\'{\i}s and Silberstein, Mark},
title = {Reconsidering OS Memory Optimizations in the Presence of Disaggregated Memory},
year = {2022},
isbn = {9781450392679},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3520263.3534650},
doi = {10.1145/3520263.3534650},
abstract = {Tiered memory systems introduce an additional memory level with higher-than-local-DRAM access latency and require sophisticated memory management mechanisms to achieve cost-efficiency and high performance. Recent works focus on byte-addressable tiered memory architectures which offer better performance than pure swap-based systems. We observe that adding disaggregation to a byte-addressable tiered memory architecture requires important design changes that deviate from the common techniques that target lower-latency non-volatile memory systems. Our comprehensive analysis of real workloads shows that the high access latency to disaggregated memory undermines the utility of well-established memory management optimizations Based on these insights, we develop HotBox – a disaggregated memory management subsystem for Linux that strives to maximize the local memory hit rate with low memory management overhead. HotBox introduces only minor changes to the Linux kernel while outperforming state-of-the-art systems on memory-intensive benchmarks by up to 2.25\texttimes{}.},
booktitle = {Proceedings of the 2022 ACM SIGPLAN International Symposium on Memory Management},
pages = {1–14},
numpages = {14},
keywords = {Disaggregated Memory, Operating Systems},
location = {San Diego, CA, USA},
series = {ISMM 2022}
}

Reconsidering OS Memory Optimizations in the Presence of Disaggregated Memory

Abstract

Tiered memory systems introduce an additional memory level with higher-than-local-DRAM access latency and require sophisticated memory management mechanisms to achieve cost-efficiency and high performance. In particular, recent works focus on byte-addressable tiered memory architectures which offer better performance than pure swap-based systems. We observe that adding disaggregation to a byte-addressable tiered memory architecture requires important design changes that deviate from the common techniques that target lower-latency non-volatile memory systems. Our comprehensive analysis of real workloads on a hardware-emulated tiered memory system shows that the high access latency to disaggregated memory undermines the utility of well-established memory management optimizations. In particular, we find that the use of huge pages and page migration in batches, advocated by recent works, are detrimental to performance in high-latency memory regime due to their negative side effects on the local memory hit rate. Based on these insights, we develop HotBox – a disaggregated memory management subsystem for Linux that strives to maximize the local memory hit rate with low memory management overhead. HotBox introduces only minor changes to the Linux kernel while outperforming state-of-the-art systems on memory-intensive benchmarks by up to 2.25 $\times$ .

BibTeX

Shai Bergman, Priyank Faldu, Lluis Vilanova, Boris Grot, Mark Silberstein

[ATC 2022] ZNSwap: un-Block your Swap

ZNSwap: un-Block your Swap

BibTeX

@inproceedings {znswap-bergman22,
title = {{ZNSwap}: {un-Block} your Swap},
booktitle = {2022 USENIX Annual Technical Conference (USENIX ATC 22)},
year = {2022},
address = {Carlsbad, CA},
publisher = {USENIX Association},
month = jul,
}

ZNSwap: un-Block your Swap

Abstract

We introduce ZNSwap, a novel swap subsystem optimized for the recent Zoned Namespace (ZNS) SSDs. ZNSwap leverages ZNS’s explicit control over data management on the drive and introduces a space-efficient host-side Garbage Collector (GC) for swap storage co-designed with the OS swap logic. ZNSwap enables cross-layer optimizations, such as direct access to the in-kernel swap usage statistics by the GC to enable fine-grain swap storage management, and correct accounting of the GC bandwidth usage in the OS resource isolation mechanisms to improve performance isolation in multi-tenant environments. We evaluate ZNSwap using standard Linux swap benchmarks and two production key-value stores. ZNSwap shows significant performance benefits over the Linux swap on traditional SSDs, such as stable throughput for different memory access patterns, and 10× lower 99th percentile latency and 5× higher throughput for memcached key-value store under realistic usage scenarios.

BibTeX

Shai Bergman, Niklas Cassel, Matias Bjørling, Mark Silberstein

[EuroSys 2022] Slashing the Disaggregation Tax in Heterogeneous Data Centers with FractOS

Slashing the Disaggregation Tax in Heterogeneous Data Centers with FractOS

BibTeX

@inproceedings{vilanova2022Fractos,
author = {Vilanova, Llu\'{\i}s and Maudlej, Lina and Bergman, Shai and Miemietz, Till and Hille, Matthias and Asmussen, Nils and Roitzsch, Michael and H\"{a}rtig, Hermann and Silberstein, Mark},
title = {Slashing the Disaggregation Tax in Heterogeneous Data Centers with FractOS},
year = {2022},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
booktitle = {Proceedings of the Seventeenth European Conference on Computer Systems},
pages = {352–367},
numpages = {16},
location = {Rennes, France},
series = {EuroSys '22}
}

Slashing the Disaggregation Tax in Heterogeneous Data Centers with FractOS

Abstract

Disaggregated heterogeneous data centers promise higher efficiency, lower total costs of ownership, and more flexibility for data-center operators. However, the current software stack can levy a high tax on application performance. Applications and OSes are designed for systems where local PCIe-connected devices are centrally managed by CPUs, but this centralization introduces unnecessary messages through the shared data center network in a disaggregated system.

We present FractOS, a distributed OS that is designed to minimize the network overheads of disaggregation in heterogeneous data centers. FractOS elevates disaggregated devices to be first-class citizens in the system enabling direct data transfers and task invocations among them, without centralized application and OS control. FractOS achieves this through:
(1) new mechanisms to express distributed applications across services and disaggregated devices,
(2) new mechanisms that enable devices to securely interact with each other and other data-center services,
(3) a distributed and isolated OS layer that implements these mechanisms and can execute on host CPUs and SmartNICs.

Our prototype shows that FractOS accelerates real-world heterogeneous applications by 47%, while reducing their network traffic by 3x

BibTeX

Lluis Vilanova, Lina Maudlej, Shai Bergman, Matthias Hille, Till Miemietz, Nils Asmussen, Michael Roitzsch, Hermann Härtig, Mark Silberstein

[NSDI 2022] Scaling Open vSwitch with a Computational Cache

Scaling Open vSwitch with a Computational Cache

BibTeX

@inproceedings {rashelbach2022NSDI,
author = {Alon Rashelbach and Ori Rottenstreich and Mark Silberstein},
title = {Scaling Open {vSwitch} with a Computational Cache},
booktitle = {19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)},
year = {2022},
address = {Renton, WA},
pages = {1359--1374},
url = {https://www.usenix.org/conference/nsdi22/presentation/rashelbach},
publisher = {USENIX Association},
month = apr,
}

Scaling Open vSwitch with a Computational Cache

Abstract

Open vSwitch (OVS) is a widely used open-source virtual switch implementation. In this work, we seek to scale up OVS to support hundreds of thousands of OpenFlow rules by accelerating the core component of its data-path – the packet classification mechanism. To do so we use NuevoMatch, a recent algorithm that uses neural network inference to match packets, and promises significant scalability and performance benefits. We overcome the primary algorithmic challenge of the slow rule update rate in the vanilla NuevoMatch, speeding it up by over three orders of magnitude. This improvement enables two design options to integrate NuevoMatch with OVS: (1) using it as an extra caching layer in front of OVS’s megaflow cache, and (2) using it to completely replace OVS’s data-path while performing classification directly on OpenFlow rules, and obviating control-path upcalls. Our comprehensive evaluation on real-world packet traces and between 1K to 500K ClassBench rules demonstrates the geometric mean speedups of 1.9times× and 12.3times× for the first and second designs, respectively, with the latter also supporting up to 60K OpenFlow rule updates/second, by far exceeding the original OVS.

BibTeX

Alon Rashelbach, Ori Rottenstreich, Mark Silberstein

[NSDI 2022] An edge-queued datagram service for all data-center traffic

An edge-queued datagram service for all data-center traffic

BibTeX

@inproceedings {Raicu2022NSDI,
author = {Vladimir Olteanu and Haggai Eran and Dragos Dumitrescu and Adrian Popa and Cristi Baciu and Mark Silberstein and Georgios Nikolaidis and Mark Handley and Costin Raiciu},
title = {An edge-queued datagram service for all datacenter traffic},
booktitle = {19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)},
year = {2022},
address = {Renton, WA},
pages = {761--777},
url = {https://www.usenix.org/conference/nsdi22/presentation/olteanu},
publisher = {USENIX Association},
month = apr,
}

An edge-queued datagram service for all data-center traffic

Abstract

Modern datacenters support a wide range of protocols and in-network switch enhancements aimed at improving performance. Unfortunately, most new and legacy protocols and enhancements often don’t coexist gracefully because they inevitably interact via queuing in the network.

In this paper we describe EQDS, a new datagram service for datacenters that moves almost all of the queuing out of the core network and into the sending host. This enables it to support multiple (conflicting) higher layer protocols, while only sending packets into the network when decided by a receiver-driven credit scheme. EQDS can speed-up legacy TCP and RDMA stacks and enable transport protocol evolution, while benefiting from future switch enhancements without needing to modify higher layer stacks. We show through simulation and multiple implementations that EQDS can reduce FCT of legacy TCP by 2x, improve the NVMeOF throughput by 30%, and safely run TCP alongside RDMA on the same network.

BibTeX

Costin Raiciu, Vladimir Olteanu, Adrian Popa, Mark Handley, Haggai Eran, Mark Silberstein, Dragos Dumitrescu, Cristi Baciu, Georgios Nikolaidis

[NSDI 2022] SwiSh: Distributed Shared State Abstractions for Programmable Switches

SwiSh: Distributed Shared State Abstractions for Programmable Switches

BibTeX

@inproceedings {zeno2022NSDI,
author = {Lior Zeno and Dan R. K. Ports and Jacob Nelson and Daehyeok Kim and Shir Landau-Feibish and Idit Keidar and Arik Rinberg and Alon Rashelbach and Igor De-Paula and Mark Silberstein},
title = {{SwiSh}: Distributed Shared State Abstractions for Programmable Switches},
booktitle = {19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)},
year = {2022},
address = {Renton, WA},
pages = {171--191},
url = {https://www.usenix.org/conference/nsdi22/presentation/zeno},
publisher = {USENIX Association},
month = apr,
}

SwiSh: Distributed Shared State Abstractions for Programmable Switches

Abstract

We design and evaluate SwiSh, a distributed shared state management layer for data-plane P4 programs. SwiSh enables running scalable stateful distributed network functions on programmable switches entirely in the data-plane. We explore several schemes to build a shared variable abstraction, which differ in consistency, performance, and in-switch implementation complexity.

We introduce the novel Strong Delayed-Writes (SDW) protocol which offers consistent snapshots of shared data-plane objects with semantics known as strong r-relaxed linearizability, enabling implementation of distributed concurrent sketches with precise error bounds.

We implement strong, eventual, and SDW consistency protocols in Tofino switches, and compare their performance in microbenchmarks and three realistic network functions, NAT, DDoS detector, and rate limiter. Our results demonstrate that the general distributed state management in the data plane is practical, and outperforms any centralized solution by up to four orders of magnitude in update throughput and replication latency.

BibTeX

Lior Zeno, Dan Ports, Jacob Nelson, Daehyeok Kim, Shir Landau Feibish, Idit Keidar, Arik Rinberg, Alon Rashelbach, Igor De-Paula, Mark Silberstein

[ASPLOS 2022] Revizor: Testing Black-box CPUs against Speculation Contracts

Revizor: Testing Black-box CPUs against Speculation Contracts

BibTeX

@inproceedings{oleksenko2022ASPLOS
author = {Oleksenko, Oleksii and Fetzer, Christof and K\"{o}pf, Boris and Silberstein, Mark},
title = {Revizor: Testing Black-Box CPUs against Speculation Contracts},
year = {2022},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
booktitle = {Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems},
pages = {226–239},
numpages = {14},
keywords = {mds, spectre, information flow, microarchitecture, testing, contracts},
location = {Lausanne, Switzerland},
series = {ASPLOS 2022}
}

Revizor: Testing Black-box CPUs against Speculation Contracts

Abstract

Speculative vulnerabilities such as Spectre and Meltdown expose speculative execution state that can be exploited to leak information across security domains via side-channels. Such vulnerabilities often stay undetected for a long time as we lack the tools for systematic testing of CPUs to find them.

In this paper, we propose an approach to automatically detect microarchitectural information leakage in commercial black-box CPUs. We build on speculation contracts, which we employ to specify the permitted side effects of program execution on the CPU’s microarchitectural state. We propose a Model-based Relational Testing (MRT) technique to empirically assess the CPU compliance with these specifications.

We implement MRT in a testing framework called Revizor, and showcase its effectiveness on real Intel x86 CPUs. Revizor automatically detects violations of a rich set of contracts or indicates their absence. A highlight of our findings is that Revizor managed to automatically surface Spectre, MDS, and LVI, as well as several previously unknown variants.

BibTeX

Oleksii Oleksenki, Boris Köpf, Christof Fetzer, Mark Silberstein

[ASPLOS 2022] FlexDriver: A Network Driver for Your Accelerator

FlexDriver: A Network Driver for Your Accelerator

BibTeX

@inproceedings{eran2022ASPLOS,
author = {Eran, Haggai and Fudim, Maxim and Malka, Gabi and Shalom, Gal and Cohen, Noam and Hermony, Amit and Levi, Dotan and Liss, Liran and Silberstein, Mark},
title = {FlexDriver: A Network Driver for Your Accelerator},
year = {2022},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
booktitle = {Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems},
pages = {1115–1129},
numpages = {15},
keywords = {accelerator networking, network function acceleration, accelerator disaggregation},
location = {Lausanne, Switzerland},
series = {ASPLOS 2022}
}

FlexDriver: A Network Driver for Your Accelerator

Abstract

We propose a new system design for connecting hardware and FPGA accelerators to the network that allows the accelerator to directly control commodity Network Interface Cards (NICs) without using the CPU. This allows us to solve the key challenge of leveraging the existing NIC hardware offloads such as virtualization, tunneling, and RDMA for accelerator networking. Our approach supports a diverse set of use cases, from direct network access for disaggregated accelerators to inline-acceleration of the network stack, all without the complex networking logic in the accelerator.

To demonstrate the feasibility of this approach, we build FlexDriver (FLD), an on-accelerator hardware module that implements a NIC data-plane driver. Our main technical contribution is a mechanism that compresses the NIC control structures by two orders of magnitude, allowing FLD to achieve high networking scalability with low die area cost and no bandwidth interference with the accelerator logic.

The prototype for NVIDIA Mellanox Innova-2 FPGA SmartNICs showcases our design’s utility for three different accelerators: a disaggregated LTE cipher, an IP-defragmentation inline accelerator, and an IoT cryptographic-token authentication offload. These accelerators reach 25 Gbps line rate and leverage the NIC for RDMA processing, VXLAN tunneling, and traffic shaping without CPU involvement.

BibTeX

Haggai Eran, Maxim Fudim, Gabi Malka, Gal Shalom, Noam Cohen, Amit Hermony, Dotan Levi, Liran Liss, Mark Silberstein

[NeurIPS 2021] Faster Neural Network Training with Approximate Tensor Operations

Faster Neural Network Training with Approximate Tensor Operations

BibTeX

@inproceedings{
adelman2021faster,
title={Faster Neural Network Training with Approximate Tensor Operations},
author={Menachem Adelman and Kfir Yehuda Levy and Ido Hakimi and Mark Silberstein},
pages={0--33},
booktitle={Thirty-Fifth Conference on Neural Information Processing Systems},
year={2021},
url={https://openreview.net/forum?id=Rz-hPxb6ODl}
}

Faster Neural Network Training with Approximate Tensor Operations

Abstract

We propose a novel technique for faster deep neural network training which systematically applies sample-based approximation to the constituent tensor operations, i.e., matrix multiplications and convolutions. We introduce new sampling techniques, study their theoretical properties, and prove that they provide the same convergence guarantees when applied to SGD training.
We apply approximate tensor operations to single and multi-node training of MLP and CNN networks on MNIST, CIFAR-10 and ImageNet datasets. We demonstrate up to 66% reduction in the amount of computations and communication, and up to 1.37x faster training time while maintaining negligible or no impact on the final test accuracy.

BibTeX

Menachem Edelman, Ido Hakimi, Kfir Yehuda Levy, Mark Silberstein

[ATC 2021] Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism

Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism

BibTeX

@inproceedings {273947,
title = {Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism},
booktitle = {2021 {USENIX} Annual Technical Conference ({USENIX} {ATC} 21)},
year = {2021},
url = {https://www.usenix.org/conference/atc21/presentation/eliad},
publisher = {{USENIX} Association},
month = jul,
}

Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism

Abstract

Fine-tuning is an increasingly common technique that leverages transfer learning to dramatically expedite the training of huge, high-quality models. Critically, fine-tuning holds the potential to make giant state-of-the-art models pre-trained on high-end super-computing-grade systems readily available for users that lack access to such costly resources.
Unfortunately, this potential is still difficult to realize because the models often do not fit in the memory of a single commodity GPU, making fine-tuning a challenging problem.

We present FTPipe, a system that explores a new dimension of pipeline model parallelism, making multi-GPU execution of fine-tuning tasks for giant neural networks readily accessible on commodity hardware. A key idea is a novel approach to model partitioning and task allocation, called Mixed-pipe. Mixed-pipe partitions the model into arbitrary computational blocks rather than layers, and relaxes the model topology constraints when assigning blocks to GPUs, allowing non-adjacent blocks to be executed on the same GPU. More flexible partitioning affords a much better balance of the compute- and memory-load on the GPUs compared to prior works, yet does not increase the communication overheads. Moreover, and perhaps surprisingly, when applied for asynchronous training, Mixed-pipe has negligible or no effect on the end-to-end accuracy of fine-tuning tasks despite the addition of pipeline stages.

Our extensive experiments on giant state-of-the-art NLP models (BERT-340M, GPT2-1.5B, and T5-3B) show that FTPipie achieves up to 3x speedup and state-of-the-art accuracy when fine-tuning giant transformers with billions of parameters. These models require from 12GB to 59GB of GPU memory, and projname executes them on 8 commodity RTX2080-Ti GPUs, each with 11GB memory and standard PCIe.

BibTeX

Saar Eliad, Ido Hakimi, Alon De Jager, Mark Silberstein, Assaf Schuster

[HotNets 2020] SwiShmem: Distributed Shared State Abstractions for Programmable Switches

SwiShmem: Distributed Shared State Abstractions for Programmable Switches

BibTeX

@inproceedings{zeno20hotnets,
author = {Zeno, Lior and Ports, Dan R. K. and Nelson, Jacob and Silberstein, Mark},
title = {SwiShmem: Distributed Shared State Abstractions for Programmable Switches},
year = {2020},
isbn = {9781450381451},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
booktitle = {Proceedings of the 19th ACM Workshop on Hot Topics in Networks},
pages = {160–167},
numpages = {8},
keywords = {distributed state management, programmable switches, network function virtualization, programmable networks},
location = {Virtual Event, USA},
series = {HotNets '20}
}

SwiShmem: Distributed Shared State Abstractions for Programmable Switches

Abstract

Programmable switches provide an appealing platform for running network functions (NFs), such as NATs, firewalls, and DDoS detectors, entirely in data plane, at staggering multi-Tbps processing rates. However, to be used in real deployments with a complex multi-switch topology, one NF instance must be deployed on each switch, which together act as a single logical NF. This requirement poses significant challenges in particular for stateful NFs, due to the need to manage distributed shared NF state among the switches. While considered a solved problem in classical distributed systems, data-plane state sharing requires addressing several unique challenges: high data rate, limited switch memory, and packet loss.

We present the design of SwiShmem, the first distributed shared state management layer for data-plane P4 programs, which facilitates the implementation of stateful distributed NFs on programmable switches. We first, analyze the access patterns and consistency requirements of popular NFs that lend themselves for in-switch execution, and then discuss the design and implementation options while highlighting open research questions.

BibTeX

Lior Zeno, Dan Ports, Jacob Nelson, Mark Silberstein

[WACI 2020] Putting Bugs in Your Data Center Might Actually be a Good Idea

Putting Bugs in Your Data Center Might Actually be a Good Idea

Abstract

Data centers of cloud providers hold millions of processor
cores, exabytes of storage, and petabytes of network bandwidth.
Research shows that in 2019, data centers consumed
more than 2% of global electricity production, where 50% of
consumption targeted for cooling infrastructures. While the
most effective solution for thermal distribution is liquid cooling,
technical challenges and complexities make it expensive.
We suggest using living spiders as cooling devices for data
centers. A prior work shows that spider silk has high thermal
conductivity, close to that of copper: the second-best metallic
conductor. Spiders not only generate spider silk but maintain
it. Recruiting spiders for the job requires no more than inserting
bugs to the data center for the spiders to catch. This
solution is effective, self-sustaining, and environment-friendly,
but requires solving a number of non-trivial technical and
zoological challenges on the way to make it practical.

Alon Rashelbach, Mark Silberstein

[EuroSys 2020] Autarky: Closing controlled channels with self-paging enclaves

Autarky: Closing controlled channels with self-paging enclaves

BibTeX

@inproceedings{ autarky20eurosys,
authors={ Orenbach, Meni and Baumann, Andrew and Silberstein, Mark},
title = {{Autarky: Closing controlled channels with self-paging enclaves},
booktitle={ Fifteenth European Conference on Computer Systems, Heraklion, Greece},
series= {EuroSys ’20},
year= {2020},
}

Autarky: Closing controlled channels with self-paging enclaves

Abstract

As the first widely-deployed secure enclave hardware, Intel SGX shows promise as a practical basis for confidential cloud computing. However, side channels remain SGX’s greatest security weakness. In particular, the “controlled-channel attack” on enclave page faults exploits a longstanding architectural side channel and still lacks effective mitigation.

We propose Autarky: a set of minor, backward-compatible modifications to the SGX ISA that hide an enclave’s page access trace from the host, and give the enclave full control over its page faults. A trusted library OS implements enclave self-paging policy.

We prototype Autarky on current SGX hardware and the Graphene library OS, implementing three paging schemes: a fast software oblivious RAM system made practical by leveraging the proposed ISA, a novel page cluster abstraction for application-aware secure self-paging, and a rate-limiting paging mechanism for unmodified binaries. Overall, Autarky provides a comprehensive defense for controlled-channel attacks which supports efficient secure demand paging, and adds no overheads in page-fault free execution.

BibTeX

Meni Orenbach, Andrew Baumann, Mark Silberstein

[SIGCOMM 2020] A computational approach to packet classification

A computational approach to packet classification

BibTeX

@inproceedings{Rashebach2020SIGCOMM,
author = {Rashelbach, Alon and Rottenstreich, Ori and Silberstein, Mark},
title = {A Computational Approach to Packet Classification},
year = {2020},
isbn = {9781450379557},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3387514.3405886},
doi = {10.1145/3387514.3405886},
booktitle = {Proceedings of the Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication},
pages = {542–556},
numpages = {15},
keywords = {Packet Classification, Virtual Switches, Neural Networks},
location = {Virtual Event, USA},
series = {SIGCOMM ’20}
}

A computational approach to packet classification

Abstract

Multi-field packet classification is a crucial component in modern software-defined data center networks. To achieve high throughput and low latency, state-of-the-art algorithms strive to fit the rule lookup data structures into on-die caches; however, they do not scale well with the number of rules. We present a novel approach, NuevoMatch, which improves the memory scaling of existing methods. A new data structure, Range Query Recursive Model Index (RQ-RMI), is the key component that enables NuevoMatch to replace most of the accesses to main memory with model inference computations. We describe an efficient training algorithm which guarantees the correctness of the RQ-RMI-based classification. The use of RQ-RMI allows the packet rules to be compressed into model weights that fit into the hardware cache and takes advantage of the growing support for fast neural network processing in modern CPUs, such as wide vector processing engines, achieving a rate of tens of nanoseconds per lookup. Our evaluation using 500K multi-field rules from the standard ClassBench benchmark shows a geomean compression factor of 4.9X, 8X, and 82X, and average performance improvement of 2.7X, 4.4X and 2.6X in latency and 1.3X, 2.2X, and 1.2X in throughput compared to CutSplit, NeuroCuts, and TupleMerge, all state-of-the-art algorithms.

BibTeX

Alon Rashelbach, Ori Rottenstreich, Mark Silberstein

[USENIX Security 2020] SpecFuzz: Bringing Spectre-type vulnerabilities to the surface

SpecFuzz: Bringing Spectre-type vulnerabilities to the surface

BibTeX

@Inproceedings{SpeckFuzz20UsenixSec,
author = {Oleksii Oleksenko and Bohdan Trach and Mark Silberstein and Christof Fetzer},
title = {SpecFuzz: Bringing Spectre-type vulnerabilities to the surface},
booktitle = {29th {USENIX} Security Symposium ({USENIX} Security 20)},
year = {2020},
isbn = {978-1-939133-17-5},
pages = {1481--1498},
url = {https://www.usenix.org/conference/usenixsecurity20/presentation/oleksenko},
publisher = {{USENIX} Association},
month = aug,
}

SpecFuzz: Bringing Spectre-type vulnerabilities to the surface

Abstract

SpecFuzz is the first tool that enables dynamic testing for speculative execution vulnerabilities (e.g., Spectre). The key is a novel concept of speculation exposure: The program is instrumented to simulate speculative execution in software by forcefully executing the code paths that could be triggered due to mispredictions, thereby making the speculative memory accesses visible to integrity checkers (e.g., AddressSanitizer). Combined with the conventional fuzzing techniques, speculation exposure enables more precise identification of potential vulnerabilities compared to state-of-the-art static analyzers.

Our prototype for detecting Spectre V1 vulnerabilities successfully identifies all known variations of Spectre V1 and decreases the mitigation overheads across the evaluated applications, reducing the amount of instrumented branches by up to 93% given a sufficient test coverage.

BibTeX

Oleksii Oleksenko, Bohdan Trach, Mark Silberstein, Christof Fetzer

[ASPLOS 2020] Lynx: a SmartNIC-driven accelerator-centric architecture for network servers

Lynx: a SmartNIC-driven accelerator-centric architecture for network servers

BibTeX

@inproceedings{lynx20Tork,
author = {Tork, Maroun and Maudlej, Lina and Silberstein, Mark},
title = {Lynx: A SmartNIC-Driven Accelerator-Centric Architecture for Network Servers},
year = {2020},
isbn = {9781450371025},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3373376.3378528},
doi = {10.1145/3373376.3378528},
booktitle = {Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems},
pages = {117–131},
numpages = {15},
keywords = {server architecture, hardware accelerators, smartnics, i/o services for accelerators, operating systems},
location = {Lausanne, Switzerland},
series = {ASPLOS ’20}
}

Lynx: a SmartNIC-driven accelerator-centric architecture for network servers

Abstract

This paper explores new opportunities afforded by the growing deployment of compute and I/O accelerators to improve the performance and efficiency of hardware-accelerated com-
puting services in data centers.

We propose Lynx, an accelerator-centric network server architecture that offloads the server data and control planes to the SmartNIC, and enables direct networking from accelerators via a lightweight hardware-friendly I/O mechanism. Lynx enables the design of hardware-accelerated network servers that run without CPU involvement, freeing CPU cores and improving performance isolation for accelerated services. It is portable across accelerator architectures and allows the management of both local and remote accelerators, seamlessly scaling beyond a single physical machine.

We implement and evaluate Lynx on GPUs and the Intel Visual Compute Accelerator, as well as two SmartNIC architectures – one with an FPGA, and another with an 8-core ARM processor. Compared to a traditional host-centric approach, Lynx achieves over 4× higher throughput for a GPU-centric face verification server, where it is used for GPU communications with an external database, and 25% higher throughput for a GPU-accelerated neural network inference service. For this workload, we show that a single SmartNIC may drive 4 local and 8 remote GPUs while achieving linear performance scaling without using the host CPU.

BibTeX

Maroun Tork, Lina Maudlej, Mark Silberstein

[ACM TOCS 2019] SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs

SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs

BibTeX

@article{spin19TOCS,
author = {Bergman, Shai and Brokhman, Tanya and Cohen, Tzachi and Silberstein, Mark},
title = {SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs},
year = {2019},
issue_date = {April 2019},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {36},
number = {2},
issn = {0734-2071},
url = {https://doi.org/10.1145/3309987},
doi = {10.1145/3309987},
journal = {ACM Trans. Comput. Syst.},
month = apr,
articleno = {Article 5},
numpages = {26},
keywords = {I/O subsystem, Accelerators, operating systems, file systems, GPU}
}

SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs

Abstract

Recent GPUs enable Peer-to-Peer Direct Memory Access (p2p) from fast peripheral devices like NVMe SSDs to exclude the CPU from the data path between them for efficiency. Unfortunately, using p2p to access files is challenging because of the subtleties of low-level non-standard interfaces, which bypass the OS file I/O layers and may hurt system performance. Developers must possess intimate knowledge of low-level interfaces to manually handle the subtleties of data consistency and misaligned accesses.

We present SPIN, which integrates p2p into the standard OS file I/O stack, dynamically activating p2p where appropriate, transparently to the user. It combines p2p with page cache accesses, re-enables read-ahead for sequential reads, all while maintaining standard POSIX FS consistency, portability across GPUs and SSDs, and compatibility with virtual block devices such as software RAID.

We evaluate SPIN on NVIDIA and AMD GPUs using standard file I/O benchmarks, application traces, and end-to-end experiments. SPIN achieves significant performance speedups across a wide range of workloads, exceeding p2p throughput by up to an order of magnitude. It also boosts the performance of an aerial imagery rendering application by 2.6× by dynamically adapting to its input-dependent file access pattern, enables 3.3× higher throughput for a GPU-accelerated log server, and enables 29% faster execution for the highly optimized GPU-accelerated image collage with only 30 changed lines of code.

BibTeX

Shai Bergman, Tanya Brokhman, Tsahi Cohen, Mark Silberstein

Extended version of the ATC'17 paper

[PACT 2019] Achieving Scalability in a k-NN Multi-GPU Network Service with Centaur

Achieving Scalability in a k-NN Multi-GPU Network Service with Centaur

BibTeX

@INPROCEEDINGS{8891620,
author={A. {Watad} and A. {Libov} and O. {Shacham} and E. {Bortnikov} and M. {Silberstein}},
booktitle={2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT)},
title={Achieving Scalability in a k-NN Multi-GPU Network Service with Centaur},
year={2019},
volume={},
number={},
pages={245-257},
keywords={concurrency (computers);file servers;graphics processing units;nearest neighbour methods;parallel processing;storage management;multiGPU distributed data flow runtime;GPU management overheads;CPU load;k-NN multiGPU network service;Centaur;GPU-centric architecture;network request processing;CPU-driven server architecture;k-nearest-neighbors network server;scalability;parallel computing;high-concurrency memory-demanding server applications;Graphics processing units;Servers;Kernel;Throughput;Clustering algorithms;Approximation algorithms;Computer architecture;GPU;Parallel Computing},
doi={10.1109/PACT.2019.00027},
ISSN={1089-795X},
month={Sep.},}

Achieving Scalability in a k-NN Multi-GPU Network Service with Centaur

Abstract

Centaur is a GPU-centric architecture for building a low-latency approximate k-Nearest-Neighbors network server. We implement a multi-GPU distributed data flow runtime which enables efficient and scalable network request processing on GPUs. The runtime eliminates GPU management overheads from the CPU, making the server throughput and response time largely agnostic to the CPU load, speed or the number of dedicated CPU cores. Our experiments systems show that our server achieves near-perfect scaling for 16 GPUs, beating the throughput of a highly-optimized CPU-driven server by 35% while maintaining about 2msec average request latency. Furthermore, it requires only a single CPU core to run, achieving over an order of magnitude higher throughput than the standard CPU-driven server architecture in this setting.

BibTeX

Amir Watad, Alexander Libov, Ohad Shacham, Edward Bortnikov, Mark Silberstein

[USENIX ATC 2019] GAIA: An OS Page Cache for Heterogeneous Systems

GAIA: An OS Page Cache for Heterogeneous Systems

BibTeX

@inproceedings{10.5555/3358807.3358864,
author = {Brokhman, Tanya and Lifshits, Pavel and Silberstein, Mark},
title = {GAIA: An OS Page Cache for Heterogeneous Systems},
year = {2019},
isbn = {9781939133038},
publisher = {USENIX Association},
address = {USA},
booktitle = {Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference},
pages = {661–674},
numpages = {14},
location = {Renton, WA, USA},
series = {USENIX ATC ’19}
}

GAIA: An OS Page Cache for Heterogeneous Systems

Abstract

We propose a principled approach to integrating GPU memory with an OS page cache. We design GAIA, a weakly-consistent page cache that spans CPU and GPU memories. GAIA enables the standard mmap system call to map files into the GPU address space, thereby enabling data-dependent GPU accesses to large files and efficient write-sharing between the CPU and GPUs. Under the hood, GAIA (1) integrates lazy release consistency protocol into the OS page cache while maintaining backward compatibility with CPU processes and unmodified GPU kernels; (2) improves CPU I/O performance by using data cached in GPU memory, and (3) optimizes the readahead prefetcher to support accesses to files cached in GPUs.

We prototype GAIA in Linux and evaluate it on NVIDIA Pascal GPUs. We show up to 3× speedup in CPU file I/O and up to 8× in unmodified realistic workloads such as Gunrock GPU-accelerated graph processing, image collage, and microscopy image stitching.

BibTeX

Tanya Brokhman, Pavel Lifshits, Mark Silberstein

[USENIX ATC 2019] CoSMIX: A Compiler-based System for Secure Memory Instrumentation and Execution in Enclaves

CoSMIX: A Compiler-based System for Secure Memory Instrumentation and Execution in Enclaves

BibTeX

@inproceedings {234958,
author = {Meni Orenbach and Yan Michalevsky and Christof Fetzer and Mark Silberstein},
title = {CoSMIX: A Compiler-based System for Secure Memory Instrumentation and Execution in Enclaves},
booktitle = {2019 {USENIX} Annual Technical Conference ({USENIX} {ATC} 19)},
year = {2019},
isbn = {978-1-939133-03-8},
address = {Renton, WA},
pages = {555--570},
url = {https://www.usenix.org/conference/atc19/presentation/orenbach},
publisher = {{USENIX} Association},
month = jul,
}

CoSMIX: A Compiler-based System for Secure Memory Instrumentation and Execution in Enclaves

Abstract

Hardware secure enclaves are increasingly used to run complex applications. Unfortunately, existing and emerging enclave architectures do not allow secure and efficient implementation of custom page fault handlers. This limitation impedes in-enclave use of secure memory-mapped files and prevents extensions of the application memory layer commonly used in untrusted systems, such as transparent memory compression or access to remote memory.

CoSMIX is a Compiler-based system for Secure Memory Instrumentation and eXecution of applications in secure enclaves. A novel memory store abstraction allows implementation of application-level secure page fault handlers that are invoked by a lightweight enclave runtime. The CoSMIX compiler instruments the application memory accesses to use one or more memory stores, guided by a global instrumentation policy or code annotations without changing application code.

The CoSMIX prototype runs on Intel SGX and is compatible with popular SGX execution environments, including SCONE and Graphene. Our evaluation of several production applications shows how CoSMIX improves their security and performance by recompiling them with appropriate memory stores. For example, unmodified Redis and Memcached key-value stores achieve about 2× speedup by using a self-paging memory store while working on datasets up to 6× larger than the enclave’s secure memory. Similarly, annotating a single line of code in a biometric verification server changes it to store its sensitive data in Oblivious RAM and makes it resilient against SGX side-channel attacks.

BibTeX

Meni Orenbach, Yan Michalevsky, Christof Fetzer, Mark Silberstein

[USENIX ATC 2019] NICA: An Infrastructure for Inline Acceleration of Network Applications

NICA: An Infrastructure for Inline Acceleration of Network Applications

BibTeX

@inproceedings {234884,
author = {Haggai Eran and Lior Zeno and Maroun Tork and Gabi Malka and Mark Silberstein},
title = {{NICA}: An Infrastructure for Inline Acceleration of Network Applications},
booktitle = {2019 {USENIX} Annual Technical Conference ({USENIX} {ATC} 19)},
year = {2019},
isbn = {978-1-939133-03-8},
address = {Renton, WA},
pages = {345--362},
url = {https://www.usenix.org/conference/atc19/presentation/eran},
publisher = {{USENIX} Association},
month = jul,
}

NICA: An Infrastructure for Inline Acceleration of Network Applications

Abstract

With rising network rates, cloud vendors increasingly deploy FPGA-based SmartNICs (F-NICs), leveraging their inline processing capabilities to offload hypervisor networking infrastructure. However, the use of F-NICs for accelerating general-purpose server applications in clouds has been limited.

NICA is a hardware-software co-designed framework for inline acceleration of the application data plane on F-NICs in multi-tenant systems. A new ikernel programming abstraction, tightly integrated with the network stack, enables application control of F-NIC computations that process application network traffic, with minimal code changes. In addition, NICA’s virtualization architecture supports fine-grain time-sharing of F-NIC logic and provides I/O path virtualization. Together these features enable cost-effective sharing of F-NICs across virtual machines with strict performance guarantees.

We prototype NICA on Mellanox F-NICs and integrate ikernels with the high-performance VMA network stack and the KVM hypervisor. We demonstrate significant acceleration of real-world applications in both bare-metal and virtualized environments, while requiring only minor code modifications to accelerate them on F-NICs. For example, a transparent key-value store cache ikernel added to the stock memcached server reaches 40 Gbps server throughput (99% line-rate) at 6 μs 99th-percentile latency for 16-byte key-value pairs, which is 21× the throughput of a 6-core CPU with a kernel-bypass network stack. The throughput scales linearly for up to 6 VMs running independent instances of memcached.

BibTeX

Haggai Eran, Lior Zeno, Maroun Tork, Gabi Malka, Mark Silberstein

[FCCM 2019] Design Patterns for Code Reuse in HLS Packet Processing Pipelines

Design Patterns for Code Reuse in HLS Packet Processing Pipelines

BibTeX

@INPROCEEDINGS{8735559,
author={H. {Eran} and L. {Zeno} and Z. {István} and M. {Silberstein}},
booktitle={2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)},
title={Design Patterns for Code Reuse in HLS Packet Processing Pipelines},
year={2019},
volume={},
number={},
pages={208-217},
keywords={field programmable gate arrays;high level synthesis;logic design;software libraries;class library;FPGA-based SmartNICs;code reuse;HLS packet processing pipelines;high-level synthesis;high-speed networking applications;UDP stateless firewall;key-value store cache;FPGA circuits;Optimization;Tools;C++ languages;Logic gates;Hardware;Field programmable gate arrays;Data structures;High level synthesis;Design methodology;Networking;Packet processing},
doi={10.1109/FCCM.2019.00036},
ISSN={2576-2613},
month={April},}

Design Patterns for Code Reuse in HLS Packet Processing Pipelines

Abstract

High-level synthesis (HLS) allows developers to be more productive in designing FPGA circuits thanks to familiar programming languages and high-level abstractions. In order to create high-performance circuits, HLS tools, such as Xilinx Vivado HLS, require following specific design patterns and techniques. Unfortunately, when applied to network packet processing tasks, these techniques limit code reuse and modularity, requiring developers to use deprecated programming conventions. We propose a methodology for developing high-speed networking applications using Vivado HLS for C++, focusing on reusability, code simplicity, and overall performance. Following this methodology, we implement a class library (ntl) with several building blocks that can be used in a wide spectrum of networking applications. We evaluate the methodology by implementing two applications: a UDP stateless firewall and a key-value store cache designed for FPGA-based SmartNICs, both processing packets at 40Gbps line-rate.

BibTeX

Haggai Eran, Lior Zeno, Zsolt István, Mark Silberstein

[MICRO Top Picks 2019] Breaking Virtual Memory Protection and the SGX Ecosystem with Foreshadow

Breaking Virtual Memory Protection and the SGX Ecosystem with Foreshadow

BibTeX

@ARTICLE{8691527,
author={J. {Van Bulck} and M. {Minkin} and O. {Weisse} and D. {Genkin} and B. {Kasikci} and F. {Piessens} and M. {Silberstein} and T. F. {Wenisch} and Y. {Yarom} and R. {Strackx}},
journal={IEEE Micro},
title={Breaking Virtual Memory Protection and the SGX Ecosystem with Foreshadow},
year={2019},
volume={39},
number={3},
pages={66-74},
keywords={security of data;software architecture;trusted computing;virtual machines;virtualisation;virtual memory protection;SGX ecosystem;foreshadow;speculative execution attack;security guarantees;virtual machines;physical memory;Intel Software Guard eXtensions;Program processors;Ecosystems;Microarchitecture;Kernel;Side-channel attacks},
doi={10.1109/MM.2019.2910104},
ISSN={1937-4143},
month={May},}

Breaking Virtual Memory Protection and the SGX Ecosystem with Foreshadow

Abstract

Foreshadow is a speculative execution attack that allows adversaries to subvert the security guarantees of Intel’s Software Guard eXtensions (SGX). Foreshadow allows access to data across process boundaries, and allows virtual machines (VMs) to read the physical memory belonging to other VMs or the hypervisor.

BibTeX

Jo Van Bulck, Marina Minkin, Ofir Weisse, Daniel Genkin, Baris Kasikci, Frank Piessens, Mark Silberstein, Thomas F. Wenisch, Yuval Yarom, Raoul Strackx

Selected for publication in IEEE Micro Top Picks

[SFMA'19 2019] One Interface to Rule them All: A Hardware/Software Co-Design for Disaggregated Computing

One Interface to Rule them All: A Hardware/Software Co-Design for Disaggregated Computing

BibTeX

@misc{ caladan-position,
authors={Lluis Vilanova and Yoav Etsion and Mark Silberstein},
title = {{One Interface to Rule them All: A Hardware/Software
Co-Design for Disaggregated Computing}},
series = {SFMA'19},
}

One Interface to Rule them All: A Hardware/Software Co-Design for Disaggregated Computing

Abstract

Datacenters are moving towards a paradigm of pooling resources (e.g., CPUs, storage and accelerators) into separate nodes to lower costs through easier hardware upgradability and higher resource utilization when running applications with heterogeneous demands.
A single request to an application can trigger a chain of accesses to multiple devices, but each device has wildly different hardware capabilities which expose vastly different data and control interfaces. As a result, applications cannot securely span all these devices in a way that keeps the cost and simplicity benefits of disaggregation while maintaining efficiency.

In this paper, we propose extending NICs to implement a model of continuation-based computations inspired in dataflow, which is used to weave the execution flow of applications across hardware devices without the need for each device to know each other’s communication protocol.

To achieve this, we lean on the observation that modern technology trends like device
self-virtualization, multi-queue designs, RDMA and remote device transports (e.g., NVMe over fabric [14]) can be extended to allow devices to interact with each other without the need for intermediate software layers. Existing NICs can be easily extended to trigger such continuations as a response to device command completions, translating a continuation into a request directed at the next device on the processing pipeline.

BibTeX

Lluis Vilanova, Yoav Etsion, Mark Silberstein Position paper

[USENIX Security 2018] Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution

Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution

BibTeX

@inproceedings{vanbulck2018foreshadow,
author = {Van Bulck, Jo and Minkin, Marina and Weisse, Ofir and Genkin, Daniel and Kasikci, Baris and
Piessens, Frank and Silberstein, Mark and Wenisch, Thomas F. and Yarom, Yuval and Strackx, Raoul},
title = {Foreshadow: Extracting the Keys to the {Intel SGX} Kingdom with Transient Out-of-Order Execution},
booktitle = {Proceedings of the 27th {USENIX} Security Symposium},
year = {2018},
month = {August},
publisher = {{USENIX} Association},
note={See also technical report Foreshadow-NG~cite{weisse2018foreshadowNG}}
}

Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution

Abstract

Trusted execution environments, and particularly the Software Guard eXtensions (SGX) included in recent Intel x86 processors, gained significant traction in recent years. A long track of research papers, and increasingly also real-world industry applications, take advantage of the strong hardware-enforced confidentiality and integrity guarantees provided by Intel SGX. Ultimately, enclaved execution holds the compelling potential of securely offloading sensitive computations to untrusted remote platforms.

We present Foreshadow, a practical software-only microarchitectural attack that decisively dismantles the security objectives of current SGX implementations. Crucially, unlike previous SGX attacks, we do not make any assumptions on the victim enclave’s code and do not necessarily require kernel-level access. At its core, Foreshadow abuses a speculative execution bug in modern Intel processors, on top of which we develop a novel exploitation methodology to reliably leak plaintext enclave secrets from the CPU cache. We demonstrate our attacks by extracting full cryptographic keys from Intel’s vetted architectural enclaves, and validate their correctness by launching rogue production enclaves and forging arbitrary local and remote attestation responses. The extracted remote attestation keys affect millions of devices.

BibTeX

Jo Van Bulck, Marina Minkin, Ofir Weisse, Daniel Genkin, Baris Kasikci, Frank Piessens, Mark Silberstein, Thomas F. Wenisch, Yuval Yarom, Raoul Strackx

First Prize in CSAW Regional Competition

[PETS 2018] Power to peep-all: Inference Attacks by Malicious Batteries on Mobile Devices

Power to peep-all: Inference Attacks by Malicious Batteries on Mobile Devices

BibTeX

@article{lifshits2018power,
title={Power to peep-all: Inference attacks by malicious batteries on mobile devices},
author={Lifshits, Pavel and Forte, Roni and Hoshen, Yedid and Halpern, Matt and Philipose, Manuel and Tiwari, Mohit and Silberstein, Mark},
journal={Proceedings on Privacy Enhancing Technologies},
volume={2018},
number={4},
pages={141--158},
year={2018},
publisher={Sciendo}
}

Power to peep-all: Inference Attacks by Malicious Batteries on Mobile Devices

Abstract

Mobile devices are equipped with increasingly smart batteries designed to provide responsiveness and extended lifetime. However, such smart batteries may present a threat to users’ privacy. We demonstrate that the phone’s power trace sampled from the battery at 1KHz holds enough information to recover a variety of sensitive information.

We show techniques to infer characters typed on a touchscreen; to accurately recover browsing history in an open-world setup; and to reliably detect incoming calls, and the photo shots including their lighting conditions. Combined with a novel exfiltration technique that establishes a covert channel from the battery to a remote server via a web browser, these attacks turn the malicious battery into a stealthy surveillance device. We deconstruct the attack by analyzing its robustness to sampling rate and execution conditions. To find mitigations we identify the sources of the information leakage exploited by the attack. We discover that the GPU or DRAM power traces alone are sufficient to distinguish between different websites. However, the CPU and power-hungry peripherals such as a touchscreen are the primary sources of fine-grain information leakage.

We consider and evaluate possible mitigation mechanisms, highlighting the challenges to defend against the attacks. In summary, our work shows the feasibility of the malicious battery and motivates further research into system and application-level defenses to fully mitigate this emerging threat.

BibTeX

Pavel Lifshits, Roni Forte, Yedid Hoshen, Matt Halpern, Manuel Philipose, Mohit Tiwari, Mark Silberstein

Third Prize in CSAW Regional Competition

[USENIX ATC 2018] Varys: Protecting SGX enclaves from practical side-channel attacks

Varys: Protecting SGX enclaves from practical side-channel attacks

BibTeX

@inproceedings {216033,
author = {Oleksii Oleksenko and Bohdan Trach and Robert Krahn and Mark Silberstein and Christof Fetzer},
title = {Varys: Protecting {SGX} Enclaves from Practical Side-Channel Attacks},
booktitle = {2018 {USENIX} Annual Technical Conference ({USENIX} {ATC} 18)},
year = {2018},
isbn = {ISBN 978-1-939133-01-4},
address = {Boston, MA},
pages = {227--240},
url = {https://www.usenix.org/conference/atc18/presentation/oleksenko},
publisher = {{USENIX} Association},
month = jul,
}

Varys: Protecting SGX enclaves from practical side-channel attacks

Abstract

Numerous recent works have experimentally shown that Intel Software Guard Extensions (SGX) are vulnerable to cache timing and page table side-channel attacks which could be used to circumvent the data confidentiality guarantees provided by SGX. Existing mechanisms that protect against these attacks either incur high execution costs, are ineffective against certain attack variants, or require significant code modifications.

We present Varys, a system that protects unmodified programs running in SGX enclaves from cache timing and page table side-channel attacks. Varys takes a pragmatic approach of strict reservation of physical cores to security-sensitive threads, thereby preventing the attacker from accessing shared CPU resources during enclave execution. The key challenge that we are addressing is that of maintaining the core reservation in the presence of an untrusted OS.

Varys fully protects against all L1/L2 cache timing attacks and significantly raises the bar for page table side-channel attacks – all with only 15% overhead on average for Phoenix and PARSEC benchmarks. Additionally, we propose a set of minor hardware extensions that hold the potential to extend Varys’ security guarantees to L3 cache and further improve its performance.

BibTeX

Oleksii Oleksenki, Bohdan Trach, Robert Krahn, Andre Martin, Mark Silberstein, Christof Fetzer

[ARXIV 2018] You shall not bypass: Employing data dependencies to prevent bounds check bypass

You shall not bypass: Employing data dependencies to prevent bounds check bypass

BibTeX

@article{DBLP:journals/corr/abs-1805-08506,
author = {Oleksii Oleksenko and
Bohdan Trach and
Tobias Reiher and
Mark Silberstein and
Christof Fetzer},
title = {You Shall Not Bypass: Employing data dependencies to prevent Bounds
Check Bypass},
journal = {CoRR},
volume = {abs/1805.08506},
year = {2018},
url = {http://arxiv.org/abs/1805.08506},
archivePrefix = {arXiv},
eprint = {1805.08506},
timestamp = {Mon, 13 Aug 2018 16:48:45 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-1805-08506.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

You shall not bypass: Employing data dependencies to prevent bounds check bypass

Abstract

A recent discovery of a new class of microarchitectural attacks called Spectre picked up the attention of the security community as these attacks can circumvent many
traditional mechanisms of defence. One of the attacks— Bounds Check Bypass—can neither be efficiently solved on system nor architectural levels and requires changes in the application itself. So far, the proposed mitigations involved serialization, which reduces the usage of CPU resources and causes high overheads. In this report, we explore methods of delaying the vulnerable instructions
without complete serialization. We discuss several ways of achieving it and compare them with Speculative Load Hardening, an existing solution based on a similar idea. The solutions of this type cause 60% overhead across Phoenix benchmark suite, which compares favourably to the full serialization causing 440% slowdown.

BibTeX

Oleksii Oleksenko, Bohdan Trach, Mark Silberstein, Christof Fetzer

[USENIX ATC 2017] SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs

SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs

BibTeX

@inproceedings {203153,
author = {Shai Bergman and Tanya Brokhman and Tzachi Cohen and Mark Silberstein},
title = {{SPIN}: Seamless Operating System Integration of Peer-to-Peer {DMA} Between SSDs and GPUs},
booktitle = {2017 {USENIX} Annual Technical Conference ({USENIX} {ATC} 17)},
year = {2017},
isbn = {978-1-931971-38-6},
address = {Santa Clara, CA},
pages = {167--179},
url = {https://www.usenix.org/conference/atc17/technical-sessions/presentation/bergman},
publisher = {{USENIX} Association},
month = jul,
}

SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs

Abstract

Recent GPUs enable Peer-to-Peer Direct Memory Access (P2P) from fast peripheral devices like NVMe SSDs to exclude the CPU from the data path between them for efficiency. Unfortunately, using P2P to access files is challenging because of the subtleties of low-level nonstandard interfaces, which bypass the OS file I/O layers and may hurt system performance.

SPIN integrates P2P into the standard OS file I/O stack, dynamically activating P2P where appropriate, transparently to the user. It combines P2P with page cache accesses, re-enables read-ahead for sequential reads, all while maintaining standard POSIX FS consistency, portability across GPUs and SSDs, and compatibility with virtual block devices such as software RAID.

We evaluate SPIN on NVIDIA and AMD GPUs using standard file I/O benchmarks, application traces and end-to-end experiments. SPIN achieves significant performance speedups across a wide range of workloads, exceeding P2P throughput by up to an order of magnitude. It also boosts the performance of an aerial imagery rendering application by 2.6× by dynamically adapting to its input-dependent file access pattern, and enables 3.3× higher throughput for a GPU-accelerated log server.

BibTeX

Shai Bergman, Tanya Brokhman, Tsahi Cohen, Mark Silberstein

[HotOS 2017] OmniX: an accelerator-centric OS for omni-programmable systems

OmniX: an accelerator-centric OS for omni-programmable systems

BibTeX

@inproceedings{OmniX,
author = {Silberstein, Mark},
title = {OmniX: An Accelerator-Centric OS for Omni-Programmable Systems},
year = {2017},
isbn = {9781450350686},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3102980.3102992},
doi = {10.1145/3102980.3102992},
booktitle = {Proceedings of the 16th Workshop on Hot Topics in Operating Systems},
pages = {69–75},
numpages = {7},
location = {Whistler, BC, Canada},
series = {HotOS ’17}
}

OmniX: an accelerator-centric OS for omni-programmable systems

Abstract

Future systems will be omni-programmable: alongside CPUs, GPUs and FPGAs,
they will execute user code near-storage, near-network, near-memory, or on other
Near-X accelerator Units, NXUs}.
This paper explores the design space of OS support for omni-programmable systems,
aiming to simplify the development of efficient applications that span multiple
heterogeneous processors and near-data accelerators.
OmniX is an accelerator-centric OS architecture that extends standard OS
abstractions, such as task execution and I/O, into NXUs while maintaining a coherent view of the system among all the processors.OmniX enables NXUs to directly invoke
tasks and access I/O services among themselves, excluding the CPU from the performance-critical
control plane operations. The host CPU serves as a controller — for protection,
device configuration and monitoring. We discuss the hardware trends
that motivate our work, outline OmniX design principles, and sketch the core implementation ideas while highlighting missing hardware features, in the hope of motivating hardware vendors to implement them soon.

BibTeX

Mark Silberstein

[EuroSys 2017] Eleos: Exit-Less OS Services for SGX Enclaves

Eleos: Exit-Less OS Services for SGX Enclaves

BibTeX

@inproceedings{Eleos,
author = {Orenbach, Meni and Lifshits, Pavel and Minkin, Marina and Silberstein, Mark},
title = {Eleos: ExitLess OS Services for SGX Enclaves},
year = {2017},
isbn = {9781450349383},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3064176.3064219},
doi = {10.1145/3064176.3064219},
booktitle = {Proceedings of the Twelfth European Conference on Computer Systems},
pages = {238–253},
numpages = {16},
location = {Belgrade, Serbia},
series = {EuroSys ’17}
}

Eleos: Exit-Less OS Services for SGX Enclaves

Abstract

Intel Software Guard eXtensions (SGX) enable secure and trusted execution of user code in an isolated enclave to protect against a powerful adversary. Unfortunately, running I/O-intensive, memory-demanding server applications in enclaves leads to significant performance degradation. Such applications put a substantial load on the in-enclave system call and secure paging mechanisms, which turn out to be the main reason for the application slowdown. In addition to the
high direct cost of thousands-of-cycles long SGX management instructions, these mechanisms incur the high indirect cost of enclave exits due to associated TLB flushes and processor state pollution.
We tackle these performance issues in Eleos by enabling exit-less system calls and exit-less paging in enclaves. Eleos introduces a novel Secure User-managed Virtual Memory (SUVM) abstraction that implements application-level paging inside the enclave. SUVM eliminates the overheads of
enclave exits due to paging, and enables new optimizations such as sub-page granularity of accesses. We thoroughly evaluate Eleos on a range of microbenchmarks and two real server applications, achieving notable system performance gains. memcached and a face verification server running in-enclave with Eleos, achieves up to 2.2× and 2.3× higher throughput respectively while working on datasets up to 5× larger than the enclave’s secure physical memory.

BibTeX

Meni Orenbach, Marina Minkin, Tsahi Cohen, Mark Silberstein

[GPGPU 2017] Understanding The Security of Discrete GPUs

Understanding The Security of Discrete GPUs

BibTeX

@incollection{zhu2017understanding,
title={Understanding the security of discrete GPUs},
author={Zhu, Zhiting and Kim, Sangman and Rozhanski, Yuri and Hu, Yige and Witchel, Emmett and Silberstein, Mark},
booktitle={Proceedings of the General Purpose GPUs},
pages={1--11},
year={2017},
series = {GPGPU' 17}
}

Understanding The Security of Discrete GPUs

Abstract

GPUs have become an integral part of modern systems, but their implications for system security are not yet clear. This paper demonstrates both that discrete GPUs cannot be used as secure
co-processors and that GPUs provide a stealthy platform for malware. First, we examine a recent proposal to use discrete GPUs as secure co-processors and show that the security guarantees of
the proposed system do not hold on the GPUs we investigate. Second, we demonstrate that (under certain circumstances) it is possible to bypass IOMMU protections and create stealthy, long-lived
GPU-based malware. We demonstrate a novel attack that compromises the in-kernel GPU driver and one that compromises GPU microcode to gain full access to CPU physical memory. In general,
we find that the highly sophisticated, but poorly documented GPU hardware architecture, hidden behind obscure close-source device drivers and vendor-specific APIs, not only make GPUs a poor
choice for applications requiring strong security, but also make GPUs into a security threat.

BibTeX

Zhiting Zhu, Sangman Kim, Yuri Rozhanski, Yige Hu, Emmett Witchel, Mark Silberstein

[EuroCrypt'17 2017] Computational integrity with a public random string from quasi-linear PCPs

Computational integrity with a public random string from quasi-linear PCPs

BibTeX

@inproceedings{ben2017computational,
title={Computational integrity with a public random string from quasi-linear PCPs},
author={Ben-Sasson, Eli and Bentov, Iddo and Chiesa, Alessandro and Gabizon, Ariel and Genkin, Daniel and Hamilis, Matan and Pergament, Evgenya and Riabzev, Michael and Silberstein, Mark and Tromer, Eran and Virza, Madars},
booktitle={Annual International Conference on the Theory and Applications of Cryptographic Techniques},
pages={551--579},
year={2017},
organization={Springer}
}

Computational integrity with a public random string from quasi-linear PCPs

Abstract

A party running a computation remotely may benefit from misreporting its output,
say, to lower its tax. Cryptographic protocols that detect and prevent such falsities hold the promise to enhance the security of decentralized systems with stringent computational integrity requirements, like Bitcoin [Nak09]. To gain public trust it is imperative to use publicly verifiable protocols that have no “backdoors” and which can be set up using only a short public random string. Probabilistically Checkable Proof (PCP) systems [BFL90, BFLS91, AS98, ALM + 98] can be used to construct astonishingly efficient protocols [Kil92, Mic00] of this nature but some of the main
components of such systems — proof composition [AS98] and low-degree testing via PCPs of Proximity (PCPPs) [BGH + 05, DR06] — have been considered efficient only asymptotically, for unrealistically large computations; recent cryptographic alternatives [PGHR13, BCG + 13a] suffer from a non-public setup phase. This work introduces SCI, the first implementation of a scalable PCP system (that uses both PCPPs and proof composition). We used SCI to prove correctness of executions of up to 2 20 cycles of a simple processor and calculated its break-even
point [SVP + 12, SMBW12]. The significance of our findings is two-fold: (i) it marks the transition of core PCP techniques (like proof composition and PCPs of Proximity ) from mathematical theory to practical system engineering, and (ii) the thresholds obtained are nearly achievable and hence show that PCP-supported computational integrity is closer to reality than previously assumed.

BibTeX

Eli Ben-Sasson, Iddo Ben-Tov, Alessandro Chiesa, Ariel Gabizon, Daniel Genkin, Matan Hamilis, Evgenya Pergament, Michael Riabzev, Mark Silberstein, Eran Tromer, Madars Virza Annual International Conference on the Theory and Applications of Cryptographic Techniques

[SysTex 2016] SGX Enclaves as Accelerators

SGX Enclaves as Accelerators

BibTeX

@inproceedings{orenbach16systex,
title={SGX Enclaves as Accelerators},
author={Meni Orenbach and Mark Silberstein},
booktitle={1st Workshop on System Software for Trusted Execution},
year={2016},
}

SGX Enclaves as Accelerators

Abstract

Intel SGX enclaves is a novel technology that holds the promise to revolutionize the way secure and trustworthy applications are built. However, from the perspective of interaction with the rest
of the system, some of the enclave’s characteristics are remarkably similar to the characteristics of traditional hardware accelerators, such as GPUs. For example, enclaves suffer from significant in-
vocation overheads, offer space-constrained private memory, and cannot directly invoke OS services such as network or file I/O. Over the course of GPU computing evolution, there have been developed many techniques to improve system performance and programmability. Our key observation is that the conceptual similarities between enclaves and accelerators may help to build efficient runtime support for enclaves by learning from past experience with GPUs.
We demonstrate this simple idea by implementing SGXIO, a simple yet powerful enhancement to the current SGX runtime which boosts the performance of I/O system calls from enclaves. SGXIO
design is almost identical to the design of GPUfs and GPUnet systems for efficient I/O services for GPU programs. Our preliminary evaluation shows that GXIO improves the performance of
a simple network parameter server for distributed machine learning by up to 3.7×. These promising results suggest new ways to design more efficient runtime and system services for enclaves.

BibTeX

@inproceedings{orenbach16systex,
title={SGX Enclaves as Accelerators},
author={Meni Orenbach and Mark Silberstein},
booktitle={1st Workshop on System Software for Trusted Execution},
year={2016},
}

Meni Orenbach, Mark Silberstein

[ACM TOCS 2016] GPUnet: networking abstractions for GPUs

GPUnet: networking abstractions for GPUs

BibTeX

@article{silberstein16GPUnet
author = {Silberstein, Mark and Kim, Sangman and Huh, Seonggu and Zhang, Xinya and Hu, Yige and Wated, Amir and Witchel, Emmett},
title = {GPUnet: Networking Abstractions for GPU Programs},
year = {2016},
issue_date = {September 2016},
publisher = {ACM},
address = {New York, NY, USA},
volume = {34},
number = {3},
issn = {0734-2071},
doi = {10.1145/2963098},
journal = {ACM Transactions on Computer Systems},
month = sep,
articleno = {Article 9},
numpages = {31},
}

GPUnet: networking abstractions for GPUs

Abstract

Despite the popularity of GPUs in high-performance and scientific computing, and despite increasingly general-purpose hardware capabilities, the use of GPUs in network servers or distributed systems poses significant challenges.

GPUnet is a native GPU networking layer that provides a socket abstraction and high-level networking APIs for GPU programs. We use GPUnet to streamline the development of high-performance, distributed applications like in-GPU-memory MapReduce and a new class of low-latency, high-throughput GPU-native network services such as a face verification server.

BibTeX

Mark Silberstein, Sangman Kim, Amir Watad, Yige Hu, Xinya Zhang, Seonggu Huh, Emmett Witchel

Extended version of the OSDI'14 paper, Fast-track acceptance

[ROSS 2016] GPUrdma: GPU-side library for high performance networking from GPU kernels

GPUrdma: GPU-side library for high performance networking from GPU kernels

BibTeX

@inproceedings{ross16gpurdma,
author = {Daoud, Feras and Watad, Amir and Silberstein, Mark},
title = {GPUrdma: GPU-Side Library for High Performance Networking from GPU Kernels},
year = {2016},
isbn = {9781450343879},
publisher = {ACM},
url = {https://doi.org/10.1145/2931088.2931091},
doi = {10.1145/2931088.2931091},
booktitle = {Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers},
articleno = {Article 6},
numpages = {8},
keywords = {Networking, accelerators, Operating Systems Design, GPGPUs},
location = {Kyoto, Japan},
series = {ROSS ’16}
}

GPUrdma: GPU-side library for high performance networking from GPU kernels

Abstract

We present GPUrdma, a GPU-side library for performing Remote Direct Memory Accesses (RDMA) across the network directly from GPU kernels. The library executes no code on CPU, directly accessing the Host Channel Adapter (HCA) Infiniband hardware for both control and data. Slow single-thread GPU performance and the intricacies of the GPU-to-network adapter interaction pose a significant challenge. We describe several design options and analyze their performance implications in detail.

We achieve 5usec one-way communication latency and up to 50Gbit/sec transfer bandwidth for messages from 16KB and larger between K40c NVIDIA GPUs across the network. Moreover, GPUrdma outperforms the CPU RDMA for smaller packets ranging from 2 to 1024 bytes by factor of 4.5x thanks to greater parallelism of transfer requests enabled by highly parallel GPU hardware.

We use GPUrdma to implement a subset of the global address space programming interface (GPI) for point-to-point asynchronous RDMA messaging. We demonstrate our preliminary results using two simple applications — ping-pong and a multi-matrix-vector product with constant matrix and multiple vectors — each running on two different machines connected by Infiniband. Our basic ping-pong implementation achieves 5%higher performance than the baseline using GPI-2. The improved ping-pong implementation with per-threadblock communication overlap enables a further 20% improvement. The multi-matrix-vector product is up to 4.5x faster thanks to higher throughput for small messages and the ability to keep the matrix in fast GPU shared memory while receiving new inputs.

GPUrdma prototype is not yet suitable for production systems due to hardware constraints in the current generation of NVIDIA GPUs which we discuss in detail. However, our results highlight the great potential of GPU-side native networking, and encourage further research toward scalable, high-performance, a heterogeneous networking infrastructure.

BibTeX

Feras Daoud, Amir Watad, Mark Silberstein

Best Paper Award

[SYSTOR 2016] Supporting data-driven I/O on GPUs using GPUfs

Supporting data-driven I/O on GPUs using GPUfs

BibTeX

@inproceedings{gpufs16systor,
author = {Shahar, Sagi and Silberstein, Mark},
title = {Supporting Data-Driven I/O on GPUs Using GPUfs},
year = {2016},
isbn = {9781450343817},
publisher = {ACM},
url = {https://doi.org/10.1145/2928275.2928276},
doi = {10.1145/2928275.2928276},
booktitle = {Proceedings of the 9th ACM International on Systems and Storage Conference},
articleno = {Article 12},
numpages = {11},
keywords = {GPGPUs, Operating Systems, File Systems},
location = {Haifa, Israel},
series = {SYSTOR ’16}
}

Supporting data-driven I/O on GPUs using GPUfs

Abstract

Using discrete GPUs for processing very large datasets is challenging, in particular when an algorithm exhibit unpredictable, data-driven access patterns. In this paper, we investigate the utility of GPUfs, a library that provides direct access to files from GPU programs, to implement such algorithms. We analyze the system’s bottlenecks, and suggest several modifications to the GPUfs design, including new concurrent hash table for the buffer cache and a highly parallel memory allocator. We also show that by implementing the workload in a warp-centric manner we can improve the performance even further. We evaluate our changes by implementing a real image processing application which creates collages from a dataset of 10 Million images. The enhanced GPUfs design improves the application performance by 5.6× on average over the original GPUfs, and outperforms both 12-core parallel CPU which uses the AVX instruction set, and a standard CUDA-based GPU implementation by up to 2.5× and 3× respectively, while significantly enhancing system programmability and simplifying the application design and implementation.

BibTeX

Sagi Shachar, Mark Silberstein

[ICS 2016] Fast Multiplication in Binary Fields on GPUs via Register Cache

Fast Multiplication in Binary Fields on GPUs via Register Cache

BibTeX

@inproceedings{gpufft16ics,
author = {Ben-Sasson, Eli and Hamilis, Matan and Silberstein, Mark and Tromer, Eran},
title = {Fast Multiplication in Binary Fields on GPUs via Register Cache},
year = {2016},
isbn = {9781450343619},
publisher = {ACM},
url = {https://doi.org/10.1145/2925426.2926259},
doi = {10.1145/2925426.2926259},
booktitle = {Proceedings of the 2016 International Conference on Supercomputing},
articleno = {Article 35},
numpages = {12},
keywords = {Finite Field Multiplication, GPGPU, SIMD, Parallel Algorithms, GPU Code Optimization},
location = {Istanbul, Turkey},
series = {ICS ’16}
}

Fast Multiplication in Binary Fields on GPUs via Register Cache

Abstract

Finite fields of characteristic 2 — “binary fields” — are used in a variety of applications in cryptography and data storage. Multiplication of two finite field elements is a fundamental operation and a well-known computational bottleneck in many of these applications, as they often require multiplication of a large number of elements. In this work we focus on accelerating multiplication in “large” binary fields of sizes greater than 232. We devise a new parallel algorithm optimized for execution on GPUs. This algorithm makes it possible to multiply large number of finite field elements and achieves high performance via bit-slicing and fine-grained parallelization.

The key to the efficient implementation of the algorithm is a novel performance optimization methodology we call the register cache. This methodology speeds up an algorithm that caches its input in shared memory by transforming the code to use per-thread registers instead. We show how to replace shared memory accesses with the shuffle() intra-warp communication instruction, thereby significantly reducing or even eliminating shared memory accesses. We thoroughly analyze the register cache approach and characterize its benefits and limitations.

We apply the register cache methodology to the implementation of the binary finite field multiplication algorithm on GPUs. We achieve up to 138x speedup for fields of size 232 over the popular, highly optimized Number Theory Library (NTL) [26], which uses the specialized CLMUL CPU instruction, and over 30x for larger fields of size below 2256. Our register cache implementation enables up to 50% higher performance compared to the traditional shared-memory based design.

BibTeX

Matan Hamilis, Eli Ben-Sasson, Eran Tromer, Mark Silberstein

Also published in NVIDIA Developers Blog

[ISCA 2016] ActivePointers: A Case for Software Address Translation on GPUs

ActivePointers: A Case for Software Address Translation on GPUs

BibTeX

@INPROCEEDINGS{activepointers16isca,
author={Shahar, Sagi and Bergman, Shai and Silberstein, Mark},
booktitle={2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)},
title={{ActivePointers: A Case for Software Address Translation on GPUs}},
year={2016},
pages={596-608}
series = {ISCA'16}
}

ActivePointers: A Case for Software Address Translation on GPUs

Abstract

Modern discrete GPUs have been the processors of choice for accelerating compute-intensive applications, but using them in large-scale data processing is extremely challenging. Unfortunately, they do not provide important I/O abstractions long established in the CPU context, such as memory mapped files, which shield programmers from the complexity of buffer and I/O device management. However, implementing these abstractions on GPUs poses a problem: the limited GPU virtual memory system provides no address space management and page fault handling mechanisms to GPU developers, and does not allow modifications to memory mappings for running GPU programs.

We implement ActivePointers, a software address translation layer and paging system that introduces native support for page faults and virtual address space management to GPU programs, and enables the implementation of fully functional memory mapped files on commodity GPUs. Files mapped into GPU memory are accessed using active pointers, which behave like regular pointers but access the GPU page cache under the hood, and trigger page faults which are handled on the GPU. We design and evaluate a number of novel mechanisms, including a translation cache in hardware registers and translation aggregation for deadlock-free page fault handling of threads in a single warp.

We extensively evaluate ActivePointers on commodity NVIDIA GPUs using microbenchmarks, and also implement a complex image processing application that constructs a photo collage from a subset of 10 million images stored in a 40GB file. The GPU implementation maps the entire file into GPU memory and accesses it via active pointers. The use of active pointers adds only up to 1% to the application’s runtime, while enabling speedups of up to 3.9× over a combined CPU+GPU implementation and 2.6× over a 12-core CPU-only implementation which uses AVX vector instructions.

BibTeX

Sagi Shachar, Shai Bergman, Mark Silberstein

[EuroSys 2016] Optimizing Distributed Actor Systems for Dynamic Interactive Services

Optimizing Distributed Actor Systems for Dynamic Interactive Services

BibTeX

@inproceedings{actop16eurosys,
author = {Newell, Andrew and Kliot, Gabriel and Menache, Ishai and Gopalan, Aditya and Akiyama, Soramichi and Silberstein, Mark},
title = {Optimizing Distributed Actor Systems for Dynamic Interactive Services},
year = {2016},
isbn = {9781450342407},
publisher = {ACM},
url = {https://doi.org/10.1145/2901318.2901343},
doi = {10.1145/2901318.2901343},
booktitle = {Proceedings of the Eleventh European Conference on Computer Systems},
articleno = {Article 38},
numpages = {15},
location = {London, United Kingdom},
series = {EuroSys ’16}
}

Optimizing Distributed Actor Systems for Dynamic Interactive Services

Abstract

Distributed actor systems are widely used for developing interactive scalable cloud services, such as social networks and online games. By modelling an application as a dynamic set of lightweight communicating “actors”, developers can easily build complex distributed applications, while the underlying runtime system deals with low-level complexities of a distributed environment.

We present ActOp—a data-driven, application-independent runtime mechanism for optimizing end-to-end service latency of actor-based distributed applications. ActOp targets the two dominant factors affecting latency: the overhead of remote inter-actor communications across servers, and the intra-server queuing delay. ActOp automatically identifies frequently communicating actors and migrates them to the same server transparently to the running application. The migration decisions are driven by a novel scalable distributed graph partitioning algorithm which does not rely on a single server to store the whole communication graph, thereby enabling efficient actor placement even for applications with rapidly changing graphs (e.g., chat services). Further, each server autonomously reduces the queuing delay by learning an internal queuing model and configuring threads according to instantaneous request rate and application demands.

We prototype ActOp by integrating it with Orleans — a popular open-source actor system [4, 13]. Experiments with realistic workloads show latency improvements of up to 75% for the 99th percentile, up to 63% for the mean, with up to 2x increase in peak system throughput.

BibTeX

Andrew Newell, Gabriel Kliot, Ishai Menashe, Aditya Gopalan, Soramichi Akiyama, Mark Silberstein

[GPGPU 2016] GPUpIO: The Case for I/O-Driven Preemption on GPUs

GPUpIO: The Case for I/O-Driven Preemption on GPUs

BibTeX

@inproceedings{GPUPIO16GPGPU,
author = {Zeno, Lior and Mendelson, Avi and Silberstein, Mark},
title = {GPUpIO: The Case for I/O-Driven Preemption on GPUs},
year = {2016},
publisher = {ACM},
@inproceedings{10.1145/2884045.2884053,
author = {Zeno, Lior and Mendelson, Avi and Silberstein, Mark},
title = {GPUpIO: The Case for I/O-Driven Preemption on GPUs},
year = {2016},
isbn = {9781450341950},
publisher = {ACM},
url = {https://doi.org/10.1145/2884045.2884053},
doi = {10.1145/2884045.2884053},
booktitle = {Proceedings of the 9th Annual Workshop on General Purpose Processing Using Graphics Processing Unit},
pages = {63–71},
numpages = {9},
keywords = {accelerators, GPGPUs, operating systems design, file systems, source-to-source compiliation},
location = {Barcelona, Spain},
series = {GPGPU ’16}
}

GPUpIO: The Case for I/O-Driven Preemption on GPUs

Abstract

As GPUs become general purpose, they are outgrowing the coprocessor model and require convenient I/O abstractions such as files and network sockets. Recent studies have shown the benefits of native GPU I/O layers, in terms of both programmability and performance. However, due to lack of hardware support, the GPU threads performing I/O calls are forced to busy-wait for the completion of I/O operations, resulting in underutilized hardware, higher power consumption, and reduced system throughput.

We argue that I/O-driven preemption improves the performance of existing solutions, despite many challenging system characteristics such as a large kernel state. We analyze the benefits of adding preemption support using a simple system performance model, and, encouraged by the results, explore the design of a software-based preemption mechanism for GPUs. In our prototype, GPUpIO, we implement a source-to-source compiler for state checkpoint and restoration, and a runtime library for scheduling preempted thread-blocks, which together enable I/O-driven preemption for GPUs.

We evaluate our prototype across a variety of system parameters and workloads to determine when preemption is worthwhile. We show that in some workloads the I/O-driven preemption approach may indeed double the effective system throughput by completely hiding the I/O latency behind computations. However, we also observe that the software-only solution is currently limited, not only due to its overheads, but also because it does not have sufficient control of the hardware scheduler queue and therefore may lead to starvation of I/O kernels. We then discuss a new hardware feature that, if added, may render a general I/O-driven preemption mechanism on GPUs practical.

BibTeX

Lior Zeno, Avi Mendelson, Mark Silberstein

[OSDI 2014] GPUnet: Networking Abstractions for GPU Programs

GPUnet: Networking Abstractions for GPU Programs

BibTeX

@inproceedings {gpunet14osdi,
author = {Sangman Kim and Seonggu Huh and Xinya Zhang and Yige Hu and Amir Wated and Emmett Witchel and Mark Silberstein},
title = {GPUnet: Networking Abstractions for {GPU} Programs},
booktitle = {11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14)},
year = {2014},
isbn = { 978-1-931971-16-4},
address = {Broomfield, CO},
pages = {201--216},
url = {https://www.usenix.org/conference/osdi14/technical-sessions/presentation/kim},
publisher = {{USENIX} Association},
month = oct,
}

GPUnet: Networking Abstractions for GPU Programs

Abstract

GPUnet is a native GPU networking layer that provides a socket abstraction and high-level networking APIs for GPU programs. We use GPUnet to streamline the development of high-performance, distributed applications like in-GPU-memory MapReduce and a new class of low-latency, high-throughput GPU-native network services such as a face verification server.

BibTeX

Sangman Kim, Seonggu Huh, Yige Hu, Xinya Zhang, Emmett Witchel, Amir Watad, Mark Silberstein

[CACM 2014] GPUfs: the case for operating system services on GPUs

GPUfs: the case for operating system services on GPUs

BibTeX

@article{gpufs14cacm,
author = {Silberstein, Mark and Ford, Bryan and Witchel, Emmett},
title = {GPUfs: The Case for Operating System Services on GPUs},
year = {2014},
issue_date = {November 2014},
publisher = {ACM},
address = {New York, NY, USA},
volume = {57},
number = {12},
issn = {0001-0782},
url = {https://doi.org/10.1145/2656206},
doi = {10.1145/2656206},
journal = {Commun. ACM},
month = nov,
pages = {68–79},
numpages = {12}
}

GPUfs: the case for operating system services on GPUs

Abstract

This is a non-technical article that covers the main aspects of the GPUfs file system layer for GPU software that makes operating system abstractions available to GPU code.

BibTeX

Mark Silberstein, Bryan Ford, Emmett Witchel

Invited to Communication of ACM

[SYSTOR 2014] Lazy Means Smart: Reducing Repair Bandwidth Costs in Erasure-coded Distributed Storage

Lazy Means Smart: Reducing Repair Bandwidth Costs in Erasure-coded Distributed Storage

BibTeX

@inproceedings{erasurecoding14systor,
author = {Silberstein, Mark and Ganesh, Lakshmi and Wang, Yang and Alvisi, Lorenzo and Dahlin, Mike},
title = {Lazy Means Smart: Reducing Repair Bandwidth Costs in Erasure-Coded Distributed Storage},
year = {2014},
publisher = {ACM},
url = {https://doi.org/10.1145/2611354.2611370},
doi = {10.1145/2611354.2611370},
booktitle = {Proceedings of International Conference on Systems and Storage},
pages = {1–7},
numpages = {7},
keywords = {Distributed storage systems, Erasure codes, Repair bandwidth},
location = {Haifa, Israel},
series = {SYSTOR 2014}
}

Lazy Means Smart: Reducing Repair Bandwidth Costs in Erasure-coded Distributed Storage

Abstract

Erasure coding schemes provide higher durability at lower storage cost, and thus constitute an attractive alternative to replication in distributed storage systems, in particular for storing rarely accessed “cold” data. These schemes, however, require an order of magnitude higher recovery bandwidth for maintaining a constant level of durability in the face of node failures. In this paper we propose lazy recovery, a technique to reduce recovery bandwidth demands down to the level of replicated storage. The key insight is that a careful adjustment of recovery rate substantially reduces recovery bandwidth, while keeping the impact on read performance and data durability low. We demonstrate the benefits of lazy recovery via extensive simulation using a realistic distributed storage configuration and published component failure parameters. For example, when applied to the commonly used RS(14, 10) code, lazy recovery reduces repair bandwidth by up to 76% even below replication, while increasing the amount of degraded stripes by 0.1 percentage points. Lazy recovery works well with a variety of erasure coding schemes, including the recently introduced bandwidth efficient codes, achieving up to a factor of 2 additional bandwidth savings.

BibTeX

Lakshmi Ganesh, Yang Wang, Lorenzo Alvisi, Mike Dahlin,

Best Paper Award

[ACM TOCS 2014] GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs

BibTeX

@article{gpufs14tocs,
author = {Silberstein, Mark and Ford, Bryan and Keidar, Idit and Witchel, Emmett},
title = {GPUfs: Integrating a File System with GPUs},
year = {2014},
issue_date = {February 2014},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {32},
number = {1},
issn = {0734-2071},
url = {https://doi.org/10.1145/2553081},
doi = {10.1145/2553081},
journal = {ACM Trans. Comput. Syst.},
month = feb,
articleno = {Article 1},
numpages = {31},
keywords = {operating systems, GPGPUs, operating systems design, file systems, Accelerators}
}

GPUfs: Integrating a file system with GPUs

Abstract

As GPU hardware becomes increasingly general-purpose, it is quickly outgrowing the traditional, constrained GPU-as-coprocessor programming model. This article advocates for extending standard operating system services and abstractions to GPUs in order to facilitate program development and enable harmonious integration of GPUs in computing systems. As an example, we describe the design and implementation of GPUFs, a software layer which provides operating system support for accessing host files directly from GPU programs. GPUFs provides a POSIX-like API, exploits GPU parallelism for efficiency, and optimizes GPU file access by extending the host CPU’s buffer cache into GPU memory. Our experiments, based on a set of real benchmarks adapted to use our file system, demonstrate the feasibility and benefits of the GPUFs approach. For example, a self-contained GPU program that searches for a set of strings throughout the Linux kernel source tree runs over seven times faster than on an eight-core CPU.

BibTeX

Mark Silberstein, Bryan Ford, Idit Keidar, Emmett Witchel

Extended version of the ASPLOS'13 paper, Fast-track acceptance

[ACM UBIQUITY 2014] GPUs: High-performance Accelerators for Parallel Applications.

GPUs: High-performance Accelerators for Parallel Applications.

BibTeX

@article{uniquity,
author = {Silberstein, Mark},
title = {GPUs: High-Performance Accelerators for Parallel Applications: The Multicore Transformation (Ubiquity Symposium)},
year = {2014},
issue_date = {August 2014},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {2014},
number = {August},
url = {https://doi.org/10.1145/2618401},
doi = {10.1145/2618401},
journal = {Ubiquity},
month = aug,
articleno = {Article 1},
numpages = {13}
}

GPUs: High-performance Accelerators for Parallel Applications.

Abstract

Early graphical processing units (GPUs) were designed as high compute density, fixed-function processors ideally crafted to the needs of computer graphics workloads. Today, GPUs are becoming truly first-class computing elements on par with CPUs. Programming GPUs as self-sufficient general-purpose processors is not only hypothetically desirable, but feasible and efficient in practice, opening new opportunities for integration of GPUs in complex software systems.

BibTeX

Mark Silberstein

Invited to Ubiquity Symposium on Parallel Computing

[Bioinformatics 2013] A System for Exact and Approximate Genetic Linkage Analysis of SNP Data in Large Pedigrees

A System for Exact and Approximate Genetic Linkage Analysis of SNP Data in Large Pedigrees

Abstract

Motivation: The use of dense single nucleotide polymorphism (SNP) data in genetic linkage analysis of large pedigrees is impeded by significant technical, methodological and computational challenges. Here we describe Superlink-Online SNP, a new powerful online system that streamlines the linkage analysis of SNP data. It features a fully integrated flexible processing workflow comprising both well-known and novel data analysis tools, including SNP clustering, erroneous data filtering, exact and approximate LOD calculations and maximum-likelihood haplotyping. The system draws its power from thousands of CPUs, performing data analysis tasks orders of magnitude faster than a single computer. By providing an intuitive interface to sophisticated state-of-the-art analysis tools coupled with high computing capacity, Superlink-Online SNP helps geneticists unleash the potential of SNP data for detecting disease genes.

Results: Computations performed by Superlink-Online SNP are automatically parallelized using novel paradigms, and executed on unlimited number of private or public CPUs. One novel service is large-scale approximate Markov Chain–Monte Carlo (MCMC) analysis. The accuracy of the results is reliably estimated by running the same computation on multiple CPUs and evaluating the Gelman–Rubin Score to set aside unreliable results. Another service within the workflow is a novel parallelized exact algorithm for inferring maximum-likelihood haplotyping. The reported system enables genetic analyses that were previously infeasible. We demonstrate the system capabilities through a study of a large complex pedigree affected with metabolic syndrome.

Availability: Superlink-Online SNP is freely available for researchers at http://cbl-hap.cs.technion.ac.il/superlink-snp. The system source code can also be downloaded from the system website.

Mark Silberstein, Omer Weissbrod, Lars Otten, Anna Tzemach, Andrey Anisenia, Omer Shtark, Dvir Tuberg, Eddie Galfrin, Irena Gannon, Adel Shalata, Zvi U. Borochowitz, Rina Dechter, Elizabeth Thompson, Dan Geiger

Detailed description of Superlink-onlineSNP system

[ASPLOS 2013] GPUfs: integrating a file system with GPUs

GPUfs: integrating a file system with GPUs

BibTeX

@inproceedings{gpufs13ASPLOS,
author = {Silberstein, Mark and Ford, Bryan and Keidar, Idit and Witchel, Emmett},
title = {GPUfs: Integrating a File System with GPUs},
year = {2013},
isbn = {9781450318709},
publisher = {ACM},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/2451116.2451169},
doi = {10.1145/2451116.2451169},
booktitle = {Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems},
pages = {485–498},
numpages = {14},
keywords = {accelerators, operating systems design, file systems, gpgpus},
location = {Houston, Texas, USA},
series = {ASPLOS ’13}
}

GPUfs: integrating a file system with GPUs

Abstract

GPU hardware is becoming increasingly general purpose, quickly outgrowing the traditional but constrained GPU-as-coprocessor programming model. To make GPUs easier to program and easier to integrate with existing systems, we propose making the host’s file system directly accessible from GPU code.

GPUfs provides a POSIX-like API for GPU programs, exploits GPU parallelism for efficiency, and optimizes GPU file access by extending the buffer cache into GPU memory. Our experiments, based on a set of real benchmarks adopted to use our file system, demonstrate the feasibility and benefits of our approach. For example, we demonstrate a simple self-contained GPU program which searches for a set of strings in the entire tree of Linux kernel source files over seven times faster than an eight-core CPU run.

BibTeX

Mark Silberstein, Bryan Ford, Idit Keidar, Emmett Witchel

Best Paper Runner-up

[OSDI 2012] Eternal Sunshine of the Spotless Machine: Protecting Privacy with Ephemeral Channels

Eternal Sunshine of the Spotless Machine: Protecting Privacy with Ephemeral Channels

BibTeX

@inproceedings{spotless12osdi,
author = {Dunn, Alan M. and Lee, Michael Z. and Jana, Suman and Kim, Sangman and Silberstein, Mark and Xu, Yuanzhong and Shmatikov, Vitaly and Witchel, Emmett},
title = {Eternal Sunshine of the Spotless Machine: Protecting Privacy with Ephemeral Channels},
year = {2012},
isbn = {9781931971966},
publisher = {USENIX Association},
address = {USA},
booktitle = {Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation},
pages = {61–75},
numpages = {15},
location = {Hollywood, CA, USA},
series = {OSDI’12}
}

Eternal Sunshine of the Spotless Machine: Protecting Privacy with Ephemeral Channels

Abstract

Modern systems keep long memories. As we show in this paper, an adversary who gains access to a Linux system, even one that implements secure deallocation, can recover the contents of applications’ windows, audio buffers, and data remaining in device drivers–long after the applications have terminated.

We design and implement Lacuna, a system that allows users to run programs in “private sessions.” After the session is over, all memories of its execution are erased. The key abstraction in Lacuna is an ephemeral channel, which allows the protected program to talk to peripheral devices while making it possible to delete the memories of this communication from the host. Lacuna can run unmodified applications that use graphics, sound, USB input devices, and the network, with only 20 percentage points of additional CPU utilization.

BibTeX

Alan Dunn, Mike Lee, Suman Jana, Sangman Kim, Mark Silberstein, Yuanzhong Xu, Vitaly Shmatikov, Emmett Witchel

Privacy Enhancement Technologies Award

[SYSTOR 2012] Scheduling processing of real-time data streams on heterogeneous multi-GPU systems

Scheduling processing of real-time data streams on heterogeneous multi-GPU systems

BibTeX

@inproceedings{verner12systor,
author = {Verner, Uri and Schuster, Assaf and Silberstein, Mark and Mendelson, Avi},
title = {Scheduling Processing of Real-Time Data Streams on Heterogeneous Multi-GPU Systems},
year = {2012},
isbn = {9781450314480},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/2367589.2367596},
doi = {10.1145/2367589.2367596},
booktitle = {Proceedings of the 5th Annual International Systems and Storage Conference},
articleno = {Article 8},
numpages = {12},
location = {Haifa, Israel},
series = {SYSTOR ’12}
}

Scheduling processing of real-time data streams on heterogeneous multi-GPU systems

Abstract

Processing vast numbers of data streams is a common problem in modern computer systems and is known as the “online big data problem.” Adding hard real-time constraints to the processing makes the scheduling problem a very challenging task that this paper aims to address. In such an environment, each data stream is manipulated by a (different) application and each datum (data packet) needs to be processed within a known deadline from the time it was generated. This work assumes a central compute engine which consists of a set of CPUs and a set of GPUs. The system receives a configuration of multiple incoming streams and executes a scheduler on the CPU side. The scheduler decides where each data stream will be manipulated (on the CPUs or on one of the GPUs), and the order of execution, in a way that guarantees that no deadlines will be missed. Our scheduler finds such schedules even for workloads that require high utilization of the entire system (CPUs and GPUs).

This paper focuses on an environment where all CPUs share a main memory, and are controlled by a single operating system (and a scheduler). The system uses a set of discrete graphic cards, each with its own private main memory. Different memory regions do not share information, and coherency is maintained by the use of explicit memory-copy operations. The paper presents a new algorithm for distributing data and scheduling applications that achieves high utilization of the entire system (CPUs and GPUs), while producing schedules that meet hard real-time constraints.

We evaluate our new proposed algorithm by using the AES-CBC encryption kernel on thousands of streams with realistic distribution of rates and deadlines. The paper shows that on a system with a CPU and two GPU cards, our current framework allows up to 87% more data to be processed per time unit than a similar single-GPU system.

BibTeX

Uri Verner, Avi Mendelson, Assaf Schuster, Mark Silberstein

[IPDPS 2012] ExPERT: Pareto-Efficient Task Replication on Grids and a Cloud

ExPERT: Pareto-Efficient Task Replication on Grids and a Cloud

BibTeX

@INPROCEEDINGS{baryehuda12ipdps,
author={O. A. {Ben-Yehuda} and A. {Schuster} and A. {Sharov} and M. {Silberstein} and A. {Iosup}},
booktitle={2012 IEEE 26th International Parallel and Distributed Processing Symposium},
title={{ExPERT: Pareto-Efficient Task Replication on Grids and a Cloud}},
year={2012},
pages={167-178},
}

ExPERT: Pareto-Efficient Task Replication on Grids and a Cloud

Abstract

Many scientists perform extensive computations by executing large bags of similar tasks (BoTs) in mixtures of computational environments, such as grids and clouds. Although the reliability and cost may vary considerably across these environments, no tool exists to assist scientists in the selection of environments that can both fulfill deadlines and fit budgets. To address this situation, we introduce the Expert BoT scheduling framework. Our framework systematically selects from a large search space the Pareto-efficient scheduling strategies, that is, the strategies that deliver the best results for both make span and cost. Expert chooses from them the best strategy according to a general, user-specified utility function. Through simulations and experiments in real production environments, we demonstrate that Expert can substantially reduce both make span and cost in comparison to common scheduling strategies. For bioinformatics BoTs executed in a real mixed grid + cloud environment, we show how the scheduling strategy selected by Expert reduces both make span and cost by 30%-70%, in comparison to commonly-used scheduling strategies.

BibTeX

Orna Agmon Ben-Yehuda, Assaf Schuster, Artyom Sharov, Mark Silberstein, Alex Iosup

[SOSP 2011] PTask: Operating System Abstractions To Manage GPUs as Compute Devices

PTask: Operating System Abstractions To Manage GPUs as Compute Devices

BibTeX

@inproceedings{sosp11rosbach,
author = {Rossbach, Christopher J. and Currey, Jon and Silberstein, Mark and Ray, Baishakhi and Witchel, Emmett},
title = {PTask: Operating System Abstractions to Manage GPUs as Compute Devices},
year = {2011},
isbn = {9781450309776},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/2043556.2043579},
doi = {10.1145/2043556.2043579},
booktitle = {Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles},
pages = {233–248},
numpages = {16},
keywords = {GPUs, operating systems, gestural interface, OS design, accelerators, GPGPU, dataflow},
location = {Cascais, Portugal},
series = {SOSP ’11}
}

PTask: Operating System Abstractions To Manage GPUs as Compute Devices

Abstract

We propose a new set of OS abstractions to support GPUs and other accelerator devices as first class computing resources. These new abstractions, collectively called the PTask API, support a dataflow programming model. Because a PTask graph consists of OS-managed objects, the kernel has sufficient visibility and control to provide system-wide guarantees like fairness and performance isolation, and can streamline data movement in ways that are impossible under current GPU programming models.

Our experience developing the PTask API, along with a gestural interface on Windows 7 and a FUSE-based encrypted file system on Linux show that the PTask API can provide important system-wide guarantees where there were previously none, and can enable significant performance improvements, for example gaining a 5× improvement in maximum throughput for the gestural interface.

BibTeX

Chris Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, Emmett Witchel

[ICS 2011] Processing data streams with hard real-time constraints on heterogeneous systems

Processing data streams with hard real-time constraints on heterogeneous systems

BibTeX

@inproceedings{verner11ics,
author = {Verner, Uri and Schuster, Assaf and Silberstein, Mark},
title = {Processing Data Streams with Hard Real-Time Constraints on Heterogeneous Systems},
year = {2011},
isbn = {9781450301022},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1995896.1995915},
doi = {10.1145/1995896.1995915},
booktitle = {Proceedings of the International Conference on Supercomputing},
pages = {120–129},
numpages = {10},
keywords = {hard real-time, gpu, data streams, batch processing, scheduling, accelerator},
location = {Tucson, Arizona, USA},
series = {ICS ’11}
}

Processing data streams with hard real-time constraints on heterogeneous systems

Abstract

Data stream processing applications such as stock exchange data analysis, VoIP streaming, and sensor data processing pose two conflicting challenges: short per-stream latency — to satisfy the milliseconds-long, hard real-time constraints of each stream, and high throughput — to enable efficient processing of as many streams as possible. High-throughput programmable accelerators such as modern GPUs hold high potential to speed up the computations. However, their use for hard real-time stream processing is complicated by slow communications with CPUs, variable throughput changing non-linearly with the input size, and weak consistency of their local memory with respect to CPU accesses. Furthermore, their coarse grain hardware scheduler renders them unsuitable for unbalanced multi-stream workloads.

We present a general, efficient and practical algorithm for hard real-time stream scheduling in heterogeneous systems. The algorithm assigns incoming streams of different rates and deadlines to CPUs and accelerators. By employing novel stream schedulability criteria for accelerators, the algorithm finds the assignment which simultaneously satisfies the aggregate throughput requirements of all the streams and the deadline constraint of each stream alone.

Using the AES-CBC encryption kernel, we experimented extensively on thousands of streams with realistic rate and deadline distributions. Our framework outperformed the alternative methods by allowing 50% more streams to be processed with provably deadline-compliant execution even for deadlines as short as tens milliseconds. Overall, the combined GPU-CPU execution allows for up to 4-fold throughput increase over highly-optimized multi-threaded CPU-only implementations.

BibTeX

Uri Verner, Assaf Schuster, Mark Silberstein

[CCGrid 2011] Building an online domain-specific computing service over non-dedicated grid and cloud resources: the Superlink-online experience

Building an online domain-specific computing service over non-dedicated grid and cloud resources: the Superlink-online experience

BibTeX

@inproceedings{silberstein2011building,
title={Building an online domain-specific computing service over non-dedicated grid and cloud resources: The superlink-online experience},
author={Silberstein, Mark},
booktitle={2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing},
pages={174--183},
year={2011},
organization={IEEE}
}

Building an online domain-specific computing service over non-dedicated grid and cloud resources: the Superlink-online experience

Abstract

Linkage analysis is a statistical method used by geneti cists in everyday practice for mapping disease-susceptibility genes in the study of complex diseases. An essential first step in the study of genetic diseases, linkage computations may require years of CPU time. The recent DNA sampling revolution enabled unprecedented sampling density, but made the analysis even more computationally demanding. In this paper we describe a high performance online service for genetic linkage analysis, called Superlink-online. The system enables anyone with Internet access to submit genetic data and analyze it as easily and quickly as if using a supercomputer. The analyses are automatically parallelized and executed on tens of thousands distributed CPUs in multiple clouds and grids.

The first version of the system, which employed up to 3,000 CPUs in UW Madison and Technion Condor pools, has been successfully used since 2006 by hundreds of geneticists worldwide, with over 40 citations in the genetics literature. Here we describe the second version, which substantially improves the scalability and performance of first: it uses over 45,000 non-dedicated hosts, in 10 different grids and clouds, including EC2 and the Superlink@Technion community grid. Improved system performance is obtained through a virtual grid hierarchy with dynamic load balancing and multi-grid overlay via the GridBot system, parallel pruning of short tasks for overhead minimization, and cost-efficient use of cloud resources in reliability-critical execution periods.

These enhancements enabled execution of many previously infeasible analyses, which can now be completed within a few hours. The new version of the system, in production since 2009, has completed over 6500 different runs of over 10 million tasks, with total consumption of 420 CPU years.

BibTeX

Mark Silberstein

Second Prize, IEEE International Scalable Computing Challenge

[SYSTOR 2011] An exact algorithm for energy-efficient acceleration of task trees on CPU/GPU architectures

An exact algorithm for energy-efficient acceleration of task trees on CPU/GPU architectures

BibTeX

@inproceedings{Maruyama2011systor,
author = {Silberstein, Mark and Maruyama, Naoya},
title = {An Exact Algorithm for Energy-Efficient Acceleration of Task Trees on CPU/GPU Architectures},
year = {2011},
isbn = {9781450307734},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1987816.1987826},
doi = {10.1145/1987816.1987826},
booktitle = {Proceedings of the 4th Annual International Conference on Systems and Storage},
articleno = {Article 7},
numpages = {7},
location = {Haifa, Israel},
series = {SYSTOR ’11}
}

An exact algorithm for energy-efficient acceleration of task trees on CPU/GPU architectures

Abstract

We consider the problem of energy-efficient acceleration of applications comprising multiple interdependent tasks forming a dependency tree, on a hypothetical CPU/GPU system where both a CPU and a GPU can be powered off when idle. Each task in the tree can be invoked on either a GPU or a CPU, but the performance may vary: some run faster on a GPU, while others prefer a CPU, making the choice of the lowest-energy processor input dependent. Furthermore, greedily minimizing the energy consumption for each task is suboptimal because of the additional energy required for the communication between the tasks executed on different processors.

We propose an efficient algorithm that takes into account the energy consumption of a CPU and a GPU for each task, as well as the communication costs of data transfers between them, and constructs an optimal acceleration schedule with provably minimal total consumed energy.

We evaluate the algorithm in the context of a real application having a task dependency tree structure and show up to 2.5-fold improvement in the expected energy consumption over the best single processor schedule, and up to 50% improvement over the communication unaware schedule on real inputs. We also show how this algorithm can be used to speedup computations rather than minimize power consumption. We achieve achieve up to a 2-fold speedup in real CPU/GPU systems.

BibTeX

Naoya Maruyama, Mark Silberstein

Best Paper Award

[GPU Computing Gems 2010] Applying software managed caching and CPU-GPU scheduling for accelerating dynamic computations

Applying software managed caching and CPU-GPU scheduling for accelerating dynamic computations

BibTeX

@inbook{scratchpad2011GEMS,
author = {Silberstein, Mark and Schuster, Assaf and Owens, John},
year = {2011},
month = {10},
pages = {501-517},
title = {Applying Software-Managed Caching and CPU/GPU Task Scheduling for Accelerating Dynamic Workloads},
journal = {GPU Computing Gems Jade Edition},
doi = {10.1016/B978-0-12-385963-1.00036-8}
}

Applying software managed caching and CPU-GPU scheduling for accelerating dynamic computations

Abstract

In this chapter we cover two difficult problems frequently encountered by GPU developers: optimizing memory access for kernels with complex input-dependent access patterns, and mapping the computations to a GPU or a CPU in
composite applications with multiple dependent kernels. Both pose a formidable challenge as they require dynamic adaptation and tuning of execution policies to allow high performance for a wide range of inputs. Not meeting these requirements leads to substantial performance penalty.

We develop our solution using simple examples, and then apply them to a real application for computing the probability of evidence in probabilistic networks. The combined techniques of memory optimization and dynamic assignment
result in up to three-fold runtime reduction over the non-optimized version on real inputs from the genetic analysis domain, and up to five-fold over an optimized parallel version running on Intel’s latest dual quad-core 16-thread Nehalem machine. In the first part of the chapter we describe our methodology for solving the memory optimization problem via
software-managed caching by efficiently exploiting the fast scratchpad memory. This technique outperforms the cache-less and the texture memory-based approaches on pre-Fermi GPU architectures as well as the one that uses the Fermi hardware cache alone.

The focus of the second part of the chapter is the algorithm for minimizing the total running time of a complete application comprising multiple interdependent kernels. Both a GPU and a CPU can be used to execute the kernels, but the performance varies greatly for different inputs, calling for dynamic assignment of the computations to a GPU or a CPU at runtime. However, instead of greedily choosing the best performing device for each kernel, the algorithm minimizes the runtime of the complete application by evaluating the performance of all the assignments jointly, including the overhead of
the data transfer between the devices.

BibTeX

Mark Silberstein, Assaf Schuster, John Owens Book chapter

[SC 2009] GridBot: Execution of Bags of Tasks in Multiple Grids

GridBot: Execution of Bags of Tasks in Multiple Grids

BibTeX

@inproceedings{gridbot2009SC,
author = {Silberstein, Mark and Sharov, Artyom and Geiger, Dan and Schuster, Assaf},
title = {GridBot: Execution of Bags of Tasks in Multiple Grids},
year = {2009},
isbn = {9781605587448},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1654059.1654071},
doi = {10.1145/1654059.1654071},
booktitle = {Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis},
articleno = {Article 11},
numpages = {12},
location = {Portland, Oregon},
series = {SC ’09}
}

GridBot: Execution of Bags of Tasks in Multiple Grids

Abstract

We present a holistic approach for efficient execution of bags-of-tasks (BOTs) on multiple grids, clusters, and volunteer computing grids virtualized as a single computing platform. The challenge is twofold: to assemble this compound environment and to employ it for execution of a mixture of throughput- and performance-oriented BOTs, with a dozen to millions of tasks each. Our generic mechanism allows per BOT specification of dynamic arbitrary scheduling and replication policies as a function of the system state, BOT execution state, and BOT priority. We implement our mechanism in the GridBot system and demonstrate its capabilities in a production setup. GridBot has executed hundreds of BOTs with over 9 million jobs during three months alone; these have been invoked on 25,000 hosts, 15,000 from the Superlink@Technion community grid and the rest from the Technion campus grid, local clusters, the Open Science Grid, EGEE, and the UW Madison pool.

BibTeX

Mark Silberstein, Artyom Sharov, Dan Geiger, Assaf Schuster

George Michael Memorial HPC Fellowship Honorable Mention Award

[ICS 2008] Efficient computation of sum-products on GPUs through software-managed cache

Efficient computation of sum-products on GPUs through software-managed cache

BibTeX

@inproceedings{scratchpad2008ics,
author = {Silberstein, Mark and Schuster, Assaf and Geiger, Dan and Patney, Anjul and Owens, John D.},
title = {Efficient Computation of Sum-Products on GPUs through Software-Managed Cache},
year = {2008},
isbn = {9781605581583},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/1375527.1375572},
doi = {10.1145/1375527.1375572},
booktitle = {Proceedings of the 22nd Annual International Conference on Supercomputing},
pages = {309–318},
numpages = {10},
keywords = {sum-product, software-managed cache, GPGPU, CUDA},
location = {Island of Kos, Greece},
series = {ICS ’08}
}

Efficient computation of sum-products on GPUs through software-managed cache

Abstract

We present a technique for designing memory-bound algorithms with high data reuse on Graphics Processing Units (GPUs) equipped with close-to-ALU software-managed memory. The approach is based on the efficient use of this memory through the implementation of a software-managed cache. We also present an analytical model for performance analysis of such algorithms.

We apply this technique to the implementation of the GPU-based solver of the sum-product or marginalize a product of functions (MPF) problem, which arises in a wide variety of real-life applications in artificial intelligence, statistics, image processing, and digital communications. Our motivation to accelerate MPF originated in the context of the analysis of genetic diseases, which in some cases requires years to complete on modern CPUs. Computing MPF is similar to computing the chain matrix product of multi-dimensional matrices, but is more difficult due to a complex data-dependent access pattern, high data reuse, and a low compute-to-memory access ratio. Our GPU-based MPF solver achieves up to 2700-fold speedup on random data and 270-fold on real-life genetic analysis datasets on GeForce 8800GTX GPU from NVIDIA over the optimized CPU version on an Intel 2.4GHz Core 2 with a 4MB L2 cache.

BibTeX

Mark Silberstein, Assaf Schuster, Dan Geiger, Anjul Patney, John Owens

[HPDC 2006] Materializing Highly Available Grids

Materializing Highly Available Grids

BibTeX

@INPROCEEDINGS {had2006HPDC,
author = {A. Sharov and M. Livny and G. Kliot and S. Schuster and M. Silberstein},
booktitle = {Proceedings of the 15th IEEE International Symposium on High Performance Distributed Computing},
title = {Short paper: Materializing Highly Available Grids},
year = {2006},
volume = {},
issn = {},
pages = {321-323},
keywords = {virtual computer;highly available grid service;ha condor system;mission-critical component},
doi = {10.1109/HPDC.2006.1652166},
url = {https://doi.ieeecomputersociety.org/10.1109/HPDC.2006.1652166},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
month = {jun}
}

Materializing Highly Available Grids

Abstract

Grids are becoming a mission-critical component in research and industry. The services they provide are thus required to be highly available, contributing to the vision of the grid as a dependable virtual computer of infinite power. However, building highly available services in grid is particularly difficult due to the unique characteristics of the grid environment. We believe that high availability functionality should itself be provided as a service, which can be used by transparently decorating, but not changing, the original services, thus making them highly available. In this work we highlight the major challenges and describe our initial experience in building such a generic high availability service in the context of the Condor system

BibTeX

Mark Silberstein, Gabriel Kliot, Artyom Sharov, Assaf Schuster, Miron Livny Short paper. ,

Integrated with HT-Condor system

[HPDC 2006] Scheduling mixed workloads in multi-grids: the grid execution hierarchy

Scheduling mixed workloads in multi-grids: the grid execution hierarchy

BibTeX

@inproceedings{silberstein2006scheduling,
title={Scheduling mixed workloads in multi-grids: the grid execution hierarchy},
author={Silberstein, Mark and Geiger, Dan and Schuster, Assaf and Livny, Miron},
booktitle={2006 15th IEEE International Conference on High Performance Distributed Computing},
pages={291--302},
year={2006},
organization={IEEE}
}

Scheduling mixed workloads in multi-grids: the grid execution hierarchy

Abstract

Consider a workload in which massively parallel tasks that require large resource pools are interleaved with short tasks that require fast response but consume fewer resources. We aim at achieving high throughput and short response time when scheduling such a workload over a set of uncoordinated grids of varying sizes and performance characteristics.

We propose the concept of a grid execution hierarchy, where available grids are sorted according to their size, and the execution overheads increase with the size of the grids. We devise a scheduling algorithm for this execution hierarchy of grids by adapting the multilevel feedback queue approach to a multi-grid environment. The algorithm finds a grid of the size, availability, and overhead that best matches a task’s resource requirements and expected turnaround time. Our approach is inspired by the shortest processing time first policy (SPTF), in the sense that the task’s processing demands are constantly reevaluated during its run, so that a task is migrated to a more suitable level of the execution hierarchy when appropriate.

We evaluate our approach in the context of the Superlink-online system for processing genetic linkage analysis tasks – a production system consisting of several grids and utilizing tens of thousands of CPU hours a month. With our approach the system provides nearly interactive response time for shorter tasks, while simultaneously serving throughput-oriented massively parallel tasks in an efficient manner.

BibTeX

Mark Silberstein, Dan Geiger, Assaf Schuster, Miron Livny

[American Journal of Human Genetics 2006] Online System for Faster Multipoint Linkage Analysis via Parallel Execution on Thousands of Personal computers

Online System for Faster Multipoint Linkage Analysis via Parallel Execution on Thousands of Personal computers

BibTeX

@article{SILBERSTEIN2006AJHG,
title = "Online System for Faster Multipoint Linkage Analysis via Parallel Execution on Thousands of Personal Computers",
journal = "The American Journal of Human Genetics",
volume = "78",
number = "6",
pages = "922 - 935",
year = "2006",
issn = "0002-9297",
doi = "https://doi.org/10.1086/504158",
url = "http://www.sciencedirect.com/science/article/pii/S0002929707639159",
author = "M. Silberstein and A. Tzemach and N. Dovgolevsky and M. Fishelson and A. Schuster and D. Geiger",
}

Online System for Faster Multipoint Linkage Analysis via Parallel Execution on Thousands of Personal computers

Abstract

Computation of LOD scores is a valuable tool for mapping disease-susceptibility genes in the study of Mendelian and complex diseases. However, computation of exact multipoint likelihoods of large inbred pedigrees with extensive missing data is often beyond the capabilities of a single computer. We present a distributed system called “SUPERLINK-ONLINE,” for the computation of multipoint LOD scores of large inbred pedigrees. It achieves high performance via the efficient parallelization of the algorithms in SUPERLINK, a state-of-the-art serial program for these tasks, and through the use of the idle cycles of thousands of personal computers. The main algorithmic challenge has been to efficiently split a large task for distributed execution in a highly dynamic, nondedicated running environment. Notably, the system is available online, which allows computationally intensive analyses to be performed with no need for either the installation of software or the maintenance of a complicated distributed environment. As the system was being developed, it was extensively tested by collaborating medical centers worldwide on a variety of real data sets, some of which are presented in this article.

BibTeX

Mark Silberstein, Anna Tzemach, Nikolai Dovgolevsky, Maayan Fishelson, Assaf Schuster, Dan Geiger

[TechRep 2004] Designing a CAM-based coprocessor for boosting performance of antivirus software

Designing a CAM-based coprocessor for boosting performance of antivirus software

BibTeX

@article{silberstein2004designing,
title={Designing a CAM-based coprocessor for boosting performance of antivirus software},
author={Silberstein, M},
journal={Technion technical report},
year={2004}
}

Designing a CAM-based coprocessor for boosting performance of antivirus software

Abstract

In this report we investigate the benefits of using a coprocessor coupled with content addrassible memory (CAM) for off-loading of a computation-intensive kernels of antivirus software. Overview of antivirus technologies is presented, followed by performance analysis of real antivirus software to justify the application of coprocessor. High level architecture of the coprocessor and its interaction with main CPU is described. CAM usage is described and performance analysis is presented. A broader perspective of a using CAM-based coprocessor application for string pattern matching, various string operations, e.g. string comparisons, and regular expression matching is discussed.

BibTeX

Mark Silberstein Technion Technical Report