I am the head of the Accelerated Computing Systems Lab.
We use bleeding-edge hardware to build new easy-to-program high-performance systems with strong security. Our projects include new operating systems for GPUs, FPGAs and SmartNICs, data-center scale OS for disaggregated architectures, new defenses against hardware side channels and speculative execution attacks, new compilers and runtimes for secure processors, distributed programs and virtual machines on programmable switches, and machine learning-based hardware systems. Our research results were partially adopted by NVIDIA, Mellanox and Intel. Most of our software is open-source and free.
I teach Intro to OS (046209) and Advanced OS (048080) in Winters, and Accelerated Systems (046278/236278) and Computer Security Principles (046280) in Springs.
I am always looking for ambitious students with a strong technical background, creative and original thinking, and a passion for building innovative computer systems using emerging hardware with real long-term impact on industry and academic research. I also have postdoc and summer internship openings.
We present RQFTL, a demand-based FTL for mobile storage controllers that boosts the effective Logical-To-Physical (L2P) address translation cache capacity over state-of-the-art techniques. RQFTL stores a large part of the L2P cache in a compressed form, and employs a learned data structure called RQRMI that leverages tiny neural nets to quickly find the correct translation entry in the cache. RQFTL uses neural network inference for cache lookups, and rapidly retrains the neural nets to efficiently handle L2P cache updates. It is specifically optimized to achieve high coverage for scattered read accesses, making it suitable for popular read-skewed workloads such as mobile gaming. We evaluate RQFTL on hours-long real-world I/O traces of popular modern mobile apps, including games, video editing, and social networking apps collected on Google Pixel 6a phone. We show that RQFTL outperforms all the state-ofthe-art FTLs in these workloads, increasing the effective L2P cache capacity by over an order of magnitude compared toDFTL and up to 5× over the recent LeaFTL. As a result, it achieves 65%, and 25% lower miss rate compared to DFTL and LeaFTL respectively, under the same SRAM capacity, and allows reduction of the total SRAM capacity of a controller by about a third of that of LeaFTL.
Packet routing in virtual networks requires virtual-to-physical address translation. Whereas the translation mappings are created by a single party (i.e., the network administrator), they are read across the network when routing tenant packets. Existing approaches face an inherent read-write performance tradeoff — either store these mappings in dedicated gateways for fast updates at the cost of slower forwarding or replicate them at end-hosts and suffer from slow updates.
SwitchV2P aims to escape this tradeoff by leveraging the network switches to transparently cache the mappings while learning them from the traffic. This brings the translations closer to the sender, reducing the first packet latency and lookup-related network overheads, and combines that with efficient mapping updates, without interfering with the existing routing policies and deployed gateways. The topology-aware data-plane caching protocol enables the switches to transparently and dynamically adapt to changing network conditions and handle variations of in-switch memory capacity.
Our evaluation shows the benefits of in-network address mapping, including up-to a 7.8X reduction in Flow CompletionTtime, a 4.3X decrease in first packet latency, and a substantial reduction in translation gateway load. Additionally, SwitchV2P achieves up to a 1.9X reduction in bandwidth overhead and requires order-of-magnitude fewer gateways for equivalent performance.
We propose a practical approach to implementing multitenancy on programmable network switches to make in-network acceleration accessible to cloud users. We introduce a Switch Virtual Machine (SwitchVM), that is deployed on the switches and offers an expressive instruction set and program state abstractions. Tenant programs, called data-plane filters (DPFs), are executed on top of SwitchVM in a sandbox with memory, network, and state isolation policies controlled by network operators. The packets that trigger DPF execution include the code to execute or a reference to the dpfs deployed in the switch. dpfs are Turing-complete, may maintain state in the packet and in switch virtual memory, may form a dynamic chain, and may steer packets to desired destinations, all while enforcing the operator’s policies.
We demonstrate that this idea is practical by prototyping projname in P4 on Intel Tofino switches. We describe a variety of use cases that SwitchVM supports, and implement three complex applications from prior works — Key-Value Store cache, Load-aware load balancer and Paxos accelerator. We also show that SwitchVM provides strong performance isolation, zero-overhead runtime programmability, may hold two orders of magnitude more in-switch programs than existing techniques, and may support up to thirty thousand concurrent tenants each with its private state.
Longest Prefix Match engines (LPM) are broadly used in computer systems and especially in modern network devices such as Network Interface Cards (NICs), switches and routers. However, existing LPM hardware fails to scale to millions of rules required by modern
systems, is often optimized for specific applications, and thus is performance-sensitive to the structure of LPM rules.
We describe NeuroLPM, a new architecture for multi-purpose LPM hardware that replaces queries in traditional memory-intensive trie- and hash-table data structures with inference in a lightweight Neural Network-based model, called RQRMI. NeuroLPM scales to millions of rules under small on-die SRAM budget and achieves stable, rule-structure-agnostic performance, allowing its use in a variety of applications. We solve several unique challenges when implementing RQRMI inference in hardware, including minimizing the amount of floating point computations while maintaining query correctness, and scaling the rule-set size while ensuring small, deterministic off-chip memory bandwidth.
We prototype NeuroLPM in Verilog and evaluate it on real-world packet forwarding rule-sets and network traces. NeuroLPM offers substantial scalability benefits without any application-specific optimizations. For example, it is the only algorithm that can serve a 950K-large rule-set at an average of 196M queries per second with 4.5MB of SRAM, only within 2% of the best-case throughput of the state-of-the-art Tree Bitmap and SAIL on smaller rule-sets. With 2MB of SRAM, it reduces the DRAM bandwidth per query, the dominant performance factor, by up to 9× and 3× compared to the state-of-the-art.
Intel® Software Guard Extensions (Intel® SGX) supports the creation of shielded enclaves within unprivileged processes. While enclaves are architecturally protected against malicious system software, Intel SGX’s privileged attacker model could potentially expose enclaves to new powerful side-channel attacks. In this paper, we consider hardware-software co-design countermeasures to an important class of single-stepping attacks that use privileged timer interrupts to precisely step through enclave execution exactly one instruction at a time, as supported, e.g., by the open-source SGX-Step framework. This is a powerful deterministic attack primitive that has been employed in a broad range of high-resolution Intel SGX attacks, but so far remains unmitigated.
We propose AEX-Notify, a flexible hardware ISA extension that makes enclaves interrupt-aware: enclaves can register a trusted handler to be run after an interrupt or exception. AEX-Notify can be used as a building block for implementing countermeasures against different types of interrupt-based attacks in software. With our primary goal to thwart deterministic single-stepping, we first diagnose the underlying hardware behavior to determine the root cause that enables it. We then apply the learned insights to remove this root cause by building an efficient software handler and constant-time disassembler to transparently determine and atomically prefetch the working set of the next enclave application instruction.
The ISA extension we propose in this paper has been incorporated into a revised version of the Intel SGX specification.
Virtual machines (VMs) are used for consolidation, isolation, and provisioning in the cloud, but applications with large working sets are impacted by the overheads of memory address translation in VMs. Existing translation approaches incur non-trivial overheads: (i) nested paging has a worst-case latency that increases with page table depth; and (ii) paravirtualized and shadow paging suffer from high hypervisor intervention costs when updating guest page tables.
We describe Translation Pass-Through (TPT), a new memory virtualization mechanism that achieves near-native performance. TPT enables VMs to control virtual memory translation from guest-virtual to host-physical addresses using one-dimensional page tables. At the same time, inter-VM isolation is enforced by the host by exploiting new hardware support for physical memory tagging in commodity CPUs.
We prototype TPT by modifying the KVM/QEMU hypervisor and enlightening the Linux guest. We evaluate it by emulating the memory tagging mechanism of AMD CPUs. Our conservative performance estimates show that TPT achieves native performance for real-world data center applications, with speedups of up to 2.4 and 1.4 over nested and shadow paging, respectively
Attacks like Spectre abuse speculative execution, one of the key performance optimizations of modern CPUs. Recently, several testing tools have emerged to automatically detect speculative leaks in commercial (black-box) CPUs. However, the testing process is still slow, which has hindered in-depth testing campaigns, and so far prevented the discovery of new classes of leakage. In this paper, we identify the root causes of the performance limitations in existing approaches and propose techniques to overcome these limitations. With these techniques, we improve the testing speed over the state-of-the-art by up to two orders of magnitude. These improvements enable us to run a testing campaign of unprecedented depth on Intel and AMD CPUs. As a highlight, we discover two types of previously unknown speculative leaks (affecting string comparison and division) that have escaped previous manual and automatic analyses.
Tiered memory systems introduce an additional memory level with higher-than-local-DRAM access latency and require sophisticated memory management mechanisms to achieve cost-efficiency and high performance. In particular, recent works focus on byte-addressable tiered memory architectures which offer better performance than pure swap-based systems. We observe that adding disaggregation to a byte-addressable tiered memory architecture requires important design changes that deviate from the common techniques that target lower-latency non-volatile memory systems. Our comprehensive analysis of real workloads on a hardware-emulated tiered memory system shows that the high access latency to disaggregated memory undermines the utility of well-established memory management optimizations. In particular, we find that the use of huge pages and page migration in batches, advocated by recent works, are detrimental to performance in high-latency memory regime due to their negative side effects on the local memory hit rate. Based on these insights, we develop HotBox – a disaggregated memory management subsystem for Linux that strives to maximize the local memory hit rate with low memory management overhead. HotBox introduces only minor changes to the Linux kernel while outperforming state-of-the-art systems on memory-intensive benchmarks by up to 2.25.
We introduce ZNSwap, a novel swap subsystem optimized for the recent Zoned Namespace (ZNS) SSDs. ZNSwap leverages ZNS’s explicit control over data management on the drive and introduces a space-efficient host-side Garbage Collector (GC) for swap storage co-designed with the OS swap logic. ZNSwap enables cross-layer optimizations, such as direct access to the in-kernel swap usage statistics by the GC to enable fine-grain swap storage management, and correct accounting of the GC bandwidth usage in the OS resource isolation mechanisms to improve performance isolation in multi-tenant environments. We evaluate ZNSwap using standard Linux swap benchmarks and two production key-value stores. ZNSwap shows significant performance benefits over the Linux swap on traditional SSDs, such as stable throughput for different memory access patterns, and 10× lower 99th percentile latency and 5× higher throughput for memcached key-value store under realistic usage scenarios.
Disaggregated heterogeneous data centers promise higher efficiency, lower total costs of ownership, and more flexibility for data-center operators. However, the current software stack can levy a high tax on application performance. Applications and OSes are designed for systems where local PCIe-connected devices are centrally managed by CPUs, but this centralization introduces unnecessary messages through the shared data center network in a disaggregated system.
We present FractOS, a distributed OS that is designed to minimize the network overheads of disaggregation in heterogeneous data centers. FractOS elevates disaggregated devices to be first-class citizens in the system enabling direct data transfers and task invocations among them, without centralized application and OS control. FractOS achieves this through:
(1) new mechanisms to express distributed applications across services and disaggregated devices,
(2) new mechanisms that enable devices to securely interact with each other and other data-center services,
(3) a distributed and isolated OS layer that implements these mechanisms and can execute on host CPUs and SmartNICs.
Our prototype shows that FractOS accelerates real-world heterogeneous applications by 47%, while reducing their network traffic by 3x
Open vSwitch (OVS) is a widely used open-source virtual switch implementation. In this work, we seek to scale up OVS to support hundreds of thousands of OpenFlow rules by accelerating the core component of its data-path – the packet classification mechanism. To do so we use NuevoMatch, a recent algorithm that uses neural network inference to match packets, and promises significant scalability and performance benefits. We overcome the primary algorithmic challenge of the slow rule update rate in the vanilla NuevoMatch, speeding it up by over three orders of magnitude. This improvement enables two design options to integrate NuevoMatch with OVS: (1) using it as an extra caching layer in front of OVS’s megaflow cache, and (2) using it to completely replace OVS’s data-path while performing classification directly on OpenFlow rules, and obviating control-path upcalls. Our comprehensive evaluation on real-world packet traces and between 1K to 500K ClassBench rules demonstrates the geometric mean speedups of 1.9times× and 12.3times× for the first and second designs, respectively, with the latter also supporting up to 60K OpenFlow rule updates/second, by far exceeding the original OVS.
Modern datacenters support a wide range of protocols and in-network switch enhancements aimed at improving performance. Unfortunately, most new and legacy protocols and enhancements often don’t coexist gracefully because they inevitably interact via queuing in the network.
In this paper we describe EQDS, a new datagram service for datacenters that moves almost all of the queuing out of the core network and into the sending host. This enables it to support multiple (conflicting) higher layer protocols, while only sending packets into the network when decided by a receiver-driven credit scheme. EQDS can speed-up legacy TCP and RDMA stacks and enable transport protocol evolution, while benefiting from future switch enhancements without needing to modify higher layer stacks. We show through simulation and multiple implementations that EQDS can reduce FCT of legacy TCP by 2x, improve the NVMeOF throughput by 30%, and safely run TCP alongside RDMA on the same network.
We design and evaluate SwiSh, a distributed shared state management layer for data-plane P4 programs. SwiSh enables running scalable stateful distributed network functions on programmable switches entirely in the data-plane. We explore several schemes to build a shared variable abstraction, which differ in consistency, performance, and in-switch implementation complexity.
We introduce the novel Strong Delayed-Writes (SDW) protocol which offers consistent snapshots of shared data-plane objects with semantics known as strong r-relaxed linearizability, enabling implementation of distributed concurrent sketches with precise error bounds.
We implement strong, eventual, and SDW consistency protocols in Tofino switches, and compare their performance in microbenchmarks and three realistic network functions, NAT, DDoS detector, and rate limiter. Our results demonstrate that the general distributed state management in the data plane is practical, and outperforms any centralized solution by up to four orders of magnitude in update throughput and replication latency.
Speculative vulnerabilities such as Spectre and Meltdown expose speculative execution state that can be exploited to leak information across security domains via side-channels. Such vulnerabilities often stay undetected for a long time as we lack the tools for systematic testing of CPUs to find them.
In this paper, we propose an approach to automatically detect microarchitectural information leakage in commercial black-box CPUs. We build on speculation contracts, which we employ to specify the permitted side effects of program execution on the CPU’s microarchitectural state. We propose a Model-based Relational Testing (MRT) technique to empirically assess the CPU compliance with these specifications.
We implement MRT in a testing framework called Revizor, and showcase its effectiveness on real Intel x86 CPUs. Revizor automatically detects violations of a rich set of contracts or indicates their absence. A highlight of our findings is that Revizor managed to automatically surface Spectre, MDS, and LVI, as well as several previously unknown variants.
We propose a new system design for connecting hardware and FPGA accelerators to the network that allows the accelerator to directly control commodity Network Interface Cards (NICs) without using the CPU. This allows us to solve the key challenge of leveraging the existing NIC hardware offloads such as virtualization, tunneling, and RDMA for accelerator networking. Our approach supports a diverse set of use cases, from direct network access for disaggregated accelerators to inline-acceleration of the network stack, all without the complex networking logic in the accelerator.
To demonstrate the feasibility of this approach, we build FlexDriver (FLD), an on-accelerator hardware module that implements a NIC data-plane driver. Our main technical contribution is a mechanism that compresses the NIC control structures by two orders of magnitude, allowing FLD to achieve high networking scalability with low die area cost and no bandwidth interference with the accelerator logic.
The prototype for NVIDIA Mellanox Innova-2 FPGA SmartNICs showcases our design’s utility for three different accelerators: a disaggregated LTE cipher, an IP-defragmentation inline accelerator, and an IoT cryptographic-token authentication offload. These accelerators reach 25 Gbps line rate and leverage the NIC for RDMA processing, VXLAN tunneling, and traffic shaping without CPU involvement.
We propose a novel technique for faster deep neural network training which systematically applies sample-based approximation to the constituent tensor operations, i.e., matrix multiplications and convolutions. We introduce new sampling techniques, study their theoretical properties, and prove that they provide the same convergence guarantees when applied to SGD training.
We apply approximate tensor operations to single and multi-node training of MLP and CNN networks on MNIST, CIFAR-10 and ImageNet datasets. We demonstrate up to 66% reduction in the amount of computations and communication, and up to 1.37x faster training time while maintaining negligible or no impact on the final test accuracy.
Fine-tuning is an increasingly common technique that leverages transfer learning to dramatically expedite the training of huge, high-quality models. Critically, fine-tuning holds the potential to make giant state-of-the-art models pre-trained on high-end super-computing-grade systems readily available for users that lack access to such costly resources.
Unfortunately, this potential is still difficult to realize because the models often do not fit in the memory of a single commodity GPU, making fine-tuning a challenging problem.
We present FTPipe, a system that explores a new dimension of pipeline model parallelism, making multi-GPU execution of fine-tuning tasks for giant neural networks readily accessible on commodity hardware. A key idea is a novel approach to model partitioning and task allocation, called Mixed-pipe. Mixed-pipe partitions the model into arbitrary computational blocks rather than layers, and relaxes the model topology constraints when assigning blocks to GPUs, allowing non-adjacent blocks to be executed on the same GPU. More flexible partitioning affords a much better balance of the compute- and memory-load on the GPUs compared to prior works, yet does not increase the communication overheads. Moreover, and perhaps surprisingly, when applied for asynchronous training, Mixed-pipe has negligible or no effect on the end-to-end accuracy of fine-tuning tasks despite the addition of pipeline stages.
Our extensive experiments on giant state-of-the-art NLP models (BERT-340M, GPT2-1.5B, and T5-3B) show that FTPipie achieves up to 3x speedup and state-of-the-art accuracy when fine-tuning giant transformers with billions of parameters. These models require from 12GB to 59GB of GPU memory, and projname executes them on 8 commodity RTX2080-Ti GPUs, each with 11GB memory and standard PCIe.
Programmable switches provide an appealing platform for running network functions (NFs), such as NATs, firewalls, and DDoS detectors, entirely in data plane, at staggering multi-Tbps processing rates. However, to be used in real deployments with a complex multi-switch topology, one NF instance must be deployed on each switch, which together act as a single logical NF. This requirement poses significant challenges in particular for stateful NFs, due to the need to manage distributed shared NF state among the switches. While considered a solved problem in classical distributed systems, data-plane state sharing requires addressing several unique challenges: high data rate, limited switch memory, and packet loss.
We present the design of SwiShmem, the first distributed shared state management layer for data-plane P4 programs, which facilitates the implementation of stateful distributed NFs on programmable switches. We first, analyze the access patterns and consistency requirements of popular NFs that lend themselves for in-switch execution, and then discuss the design and implementation options while highlighting open research questions.
Data centers of cloud providers hold millions of processor
cores, exabytes of storage, and petabytes of network bandwidth.
Research shows that in 2019, data centers consumed
more than 2% of global electricity production, where 50% of
consumption targeted for cooling infrastructures. While the
most effective solution for thermal distribution is liquid cooling,
technical challenges and complexities make it expensive.
We suggest using living spiders as cooling devices for data
centers. A prior work shows that spider silk has high thermal
conductivity, close to that of copper: the second-best metallic
conductor. Spiders not only generate spider silk but maintain
it. Recruiting spiders for the job requires no more than inserting
bugs to the data center for the spiders to catch. This
solution is effective, self-sustaining, and environment-friendly,
but requires solving a number of non-trivial technical and
zoological challenges on the way to make it practical.
As the first widely-deployed secure enclave hardware, Intel SGX shows promise as a practical basis for confidential cloud computing. However, side channels remain SGX’s greatest security weakness. In particular, the “controlled-channel attack” on enclave page faults exploits a longstanding architectural side channel and still lacks effective mitigation.
We propose Autarky: a set of minor, backward-compatible modifications to the SGX ISA that hide an enclave’s page access trace from the host, and give the enclave full control over its page faults. A trusted library OS implements enclave self-paging policy.
We prototype Autarky on current SGX hardware and the Graphene library OS, implementing three paging schemes: a fast software oblivious RAM system made practical by leveraging the proposed ISA, a novel page cluster abstraction for application-aware secure self-paging, and a rate-limiting paging mechanism for unmodified binaries. Overall, Autarky provides a comprehensive defense for controlled-channel attacks which supports efficient secure demand paging, and adds no overheads in page-fault free execution.
Multi-field packet classification is a crucial component in modern software-defined data center networks. To achieve high throughput and low latency, state-of-the-art algorithms strive to fit the rule lookup data structures into on-die caches; however, they do not scale well with the number of rules. We present a novel approach, NuevoMatch, which improves the memory scaling of existing methods. A new data structure, Range Query Recursive Model Index (RQ-RMI), is the key component that enables NuevoMatch to replace most of the accesses to main memory with model inference computations. We describe an efficient training algorithm which guarantees the correctness of the RQ-RMI-based classification. The use of RQ-RMI allows the packet rules to be compressed into model weights that fit into the hardware cache and takes advantage of the growing support for fast neural network processing in modern CPUs, such as wide vector processing engines, achieving a rate of tens of nanoseconds per lookup. Our evaluation using 500K multi-field rules from the standard ClassBench benchmark shows a geomean compression factor of 4.9X, 8X, and 82X, and average performance improvement of 2.7X, 4.4X and 2.6X in latency and 1.3X, 2.2X, and 1.2X in throughput compared to CutSplit, NeuroCuts, and TupleMerge, all state-of-the-art algorithms.
SpecFuzz is the first tool that enables dynamic testing for speculative execution vulnerabilities (e.g., Spectre). The key is a novel concept of speculation exposure: The program is instrumented to simulate speculative execution in software by forcefully executing the code paths that could be triggered due to mispredictions, thereby making the speculative memory accesses visible to integrity checkers (e.g., AddressSanitizer). Combined with the conventional fuzzing techniques, speculation exposure enables more precise identification of potential vulnerabilities compared to state-of-the-art static analyzers.
Our prototype for detecting Spectre V1 vulnerabilities successfully identifies all known variations of Spectre V1 and decreases the mitigation overheads across the evaluated applications, reducing the amount of instrumented branches by up to 93% given a sufficient test coverage.
This paper explores new opportunities afforded by the growing deployment of compute and I/O accelerators to improve the performance and efficiency of hardware-accelerated com-
puting services in data centers.
We propose Lynx, an accelerator-centric network server architecture that offloads the server data and control planes to the SmartNIC, and enables direct networking from accelerators via a lightweight hardware-friendly I/O mechanism. Lynx enables the design of hardware-accelerated network servers that run without CPU involvement, freeing CPU cores and improving performance isolation for accelerated services. It is portable across accelerator architectures and allows the management of both local and remote accelerators, seamlessly scaling beyond a single physical machine.
We implement and evaluate Lynx on GPUs and the Intel Visual Compute Accelerator, as well as two SmartNIC architectures – one with an FPGA, and another with an 8-core ARM processor. Compared to a traditional host-centric approach, Lynx achieves over 4× higher throughput for a GPU-centric face verification server, where it is used for GPU communications with an external database, and 25% higher throughput for a GPU-accelerated neural network inference service. For this workload, we show that a single SmartNIC may drive 4 local and 8 remote GPUs while achieving linear performance scaling without using the host CPU.
Recent GPUs enable Peer-to-Peer Direct Memory Access (p2p) from fast peripheral devices like NVMe SSDs to exclude the CPU from the data path between them for efficiency. Unfortunately, using p2p to access files is challenging because of the subtleties of low-level non-standard interfaces, which bypass the OS file I/O layers and may hurt system performance. Developers must possess intimate knowledge of low-level interfaces to manually handle the subtleties of data consistency and misaligned accesses.
We present SPIN, which integrates p2p into the standard OS file I/O stack, dynamically activating p2p where appropriate, transparently to the user. It combines p2p with page cache accesses, re-enables read-ahead for sequential reads, all while maintaining standard POSIX FS consistency, portability across GPUs and SSDs, and compatibility with virtual block devices such as software RAID.
We evaluate SPIN on NVIDIA and AMD GPUs using standard file I/O benchmarks, application traces, and end-to-end experiments. SPIN achieves significant performance speedups across a wide range of workloads, exceeding p2p throughput by up to an order of magnitude. It also boosts the performance of an aerial imagery rendering application by 2.6× by dynamically adapting to its input-dependent file access pattern, enables 3.3× higher throughput for a GPU-accelerated log server, and enables 29% faster execution for the highly optimized GPU-accelerated image collage with only 30 changed lines of code.
Centaur is a GPU-centric architecture for building a low-latency approximate k-Nearest-Neighbors network server. We implement a multi-GPU distributed data flow runtime which enables efficient and scalable network request processing on GPUs. The runtime eliminates GPU management overheads from the CPU, making the server throughput and response time largely agnostic to the CPU load, speed or the number of dedicated CPU cores. Our experiments systems show that our server achieves near-perfect scaling for 16 GPUs, beating the throughput of a highly-optimized CPU-driven server by 35% while maintaining about 2msec average request latency. Furthermore, it requires only a single CPU core to run, achieving over an order of magnitude higher throughput than the standard CPU-driven server architecture in this setting.
We propose a principled approach to integrating GPU memory with an OS page cache. We design GAIA, a weakly-consistent page cache that spans CPU and GPU memories. GAIA enables the standard mmap system call to map files into the GPU address space, thereby enabling data-dependent GPU accesses to large files and efficient write-sharing between the CPU and GPUs. Under the hood, GAIA (1) integrates lazy release consistency protocol into the OS page cache while maintaining backward compatibility with CPU processes and unmodified GPU kernels; (2) improves CPU I/O performance by using data cached in GPU memory, and (3) optimizes the readahead prefetcher to support accesses to files cached in GPUs.
We prototype GAIA in Linux and evaluate it on NVIDIA Pascal GPUs. We show up to 3× speedup in CPU file I/O and up to 8× in unmodified realistic workloads such as Gunrock GPU-accelerated graph processing, image collage, and microscopy image stitching.
Hardware secure enclaves are increasingly used to run complex applications. Unfortunately, existing and emerging enclave architectures do not allow secure and efficient implementation of custom page fault handlers. This limitation impedes in-enclave use of secure memory-mapped files and prevents extensions of the application memory layer commonly used in untrusted systems, such as transparent memory compression or access to remote memory.
CoSMIX is a Compiler-based system for Secure Memory Instrumentation and eXecution of applications in secure enclaves. A novel memory store abstraction allows implementation of application-level secure page fault handlers that are invoked by a lightweight enclave runtime. The CoSMIX compiler instruments the application memory accesses to use one or more memory stores, guided by a global instrumentation policy or code annotations without changing application code.
The CoSMIX prototype runs on Intel SGX and is compatible with popular SGX execution environments, including SCONE and Graphene. Our evaluation of several production applications shows how CoSMIX improves their security and performance by recompiling them with appropriate memory stores. For example, unmodified Redis and Memcached key-value stores achieve about 2× speedup by using a self-paging memory store while working on datasets up to 6× larger than the enclave’s secure memory. Similarly, annotating a single line of code in a biometric verification server changes it to store its sensitive data in Oblivious RAM and makes it resilient against SGX side-channel attacks.
With rising network rates, cloud vendors increasingly deploy FPGA-based SmartNICs (F-NICs), leveraging their inline processing capabilities to offload hypervisor networking infrastructure. However, the use of F-NICs for accelerating general-purpose server applications in clouds has been limited.
NICA is a hardware-software co-designed framework for inline acceleration of the application data plane on F-NICs in multi-tenant systems. A new ikernel programming abstraction, tightly integrated with the network stack, enables application control of F-NIC computations that process application network traffic, with minimal code changes. In addition, NICA’s virtualization architecture supports fine-grain time-sharing of F-NIC logic and provides I/O path virtualization. Together these features enable cost-effective sharing of F-NICs across virtual machines with strict performance guarantees.
We prototype NICA on Mellanox F-NICs and integrate ikernels with the high-performance VMA network stack and the KVM hypervisor. We demonstrate significant acceleration of real-world applications in both bare-metal and virtualized environments, while requiring only minor code modifications to accelerate them on F-NICs. For example, a transparent key-value store cache ikernel added to the stock memcached server reaches 40 Gbps server throughput (99% line-rate) at 6 μs 99th-percentile latency for 16-byte key-value pairs, which is 21× the throughput of a 6-core CPU with a kernel-bypass network stack. The throughput scales linearly for up to 6 VMs running independent instances of memcached.
High-level synthesis (HLS) allows developers to be more productive in designing FPGA circuits thanks to familiar programming languages and high-level abstractions. In order to create high-performance circuits, HLS tools, such as Xilinx Vivado HLS, require following specific design patterns and techniques. Unfortunately, when applied to network packet processing tasks, these techniques limit code reuse and modularity, requiring developers to use deprecated programming conventions. We propose a methodology for developing high-speed networking applications using Vivado HLS for C++, focusing on reusability, code simplicity, and overall performance. Following this methodology, we implement a class library (ntl) with several building blocks that can be used in a wide spectrum of networking applications. We evaluate the methodology by implementing two applications: a UDP stateless firewall and a key-value store cache designed for FPGA-based SmartNICs, both processing packets at 40Gbps line-rate.
Foreshadow is a speculative execution attack that allows adversaries to subvert the security guarantees of Intel’s Software Guard eXtensions (SGX). Foreshadow allows access to data across process boundaries, and allows virtual machines (VMs) to read the physical memory belonging to other VMs or the hypervisor.
Datacenters are moving towards a paradigm of pooling resources (e.g., CPUs, storage and accelerators) into separate nodes to lower costs through easier hardware upgradability and higher resource utilization when running applications with heterogeneous demands.
A single request to an application can trigger a chain of accesses to multiple devices, but each device has wildly different hardware capabilities which expose vastly different data and control interfaces. As a result, applications cannot securely span all these devices in a way that keeps the cost and simplicity benefits of disaggregation while maintaining efficiency.
In this paper, we propose extending NICs to implement a model of continuation-based computations inspired in dataflow, which is used to weave the execution flow of applications across hardware devices without the need for each device to know each other’s communication protocol.
To achieve this, we lean on the observation that modern technology trends like device
self-virtualization, multi-queue designs, RDMA and remote device transports (e.g., NVMe over fabric [14]) can be extended to allow devices to interact with each other without the need for intermediate software layers. Existing NICs can be easily extended to trigger such continuations as a response to device command completions, translating a continuation into a request directed at the next device on the processing pipeline.
Trusted execution environments, and particularly the Software Guard eXtensions (SGX) included in recent Intel x86 processors, gained significant traction in recent years. A long track of research papers, and increasingly also real-world industry applications, take advantage of the strong hardware-enforced confidentiality and integrity guarantees provided by Intel SGX. Ultimately, enclaved execution holds the compelling potential of securely offloading sensitive computations to untrusted remote platforms.
We present Foreshadow, a practical software-only microarchitectural attack that decisively dismantles the security objectives of current SGX implementations. Crucially, unlike previous SGX attacks, we do not make any assumptions on the victim enclave’s code and do not necessarily require kernel-level access. At its core, Foreshadow abuses a speculative execution bug in modern Intel processors, on top of which we develop a novel exploitation methodology to reliably leak plaintext enclave secrets from the CPU cache. We demonstrate our attacks by extracting full cryptographic keys from Intel’s vetted architectural enclaves, and validate their correctness by launching rogue production enclaves and forging arbitrary local and remote attestation responses. The extracted remote attestation keys affect millions of devices.
Mobile devices are equipped with increasingly smart batteries designed to provide responsiveness and extended lifetime. However, such smart batteries may present a threat to users’ privacy. We demonstrate that the phone’s power trace sampled from the battery at 1KHz holds enough information to recover a variety of sensitive information.
We show techniques to infer characters typed on a touchscreen; to accurately recover browsing history in an open-world setup; and to reliably detect incoming calls, and the photo shots including their lighting conditions. Combined with a novel exfiltration technique that establishes a covert channel from the battery to a remote server via a web browser, these attacks turn the malicious battery into a stealthy surveillance device. We deconstruct the attack by analyzing its robustness to sampling rate and execution conditions. To find mitigations we identify the sources of the information leakage exploited by the attack. We discover that the GPU or DRAM power traces alone are sufficient to distinguish between different websites. However, the CPU and power-hungry peripherals such as a touchscreen are the primary sources of fine-grain information leakage.
We consider and evaluate possible mitigation mechanisms, highlighting the challenges to defend against the attacks. In summary, our work shows the feasibility of the malicious battery and motivates further research into system and application-level defenses to fully mitigate this emerging threat.
Numerous recent works have experimentally shown that Intel Software Guard Extensions (SGX) are vulnerable to cache timing and page table side-channel attacks which could be used to circumvent the data confidentiality guarantees provided by SGX. Existing mechanisms that protect against these attacks either incur high execution costs, are ineffective against certain attack variants, or require significant code modifications.
We present Varys, a system that protects unmodified programs running in SGX enclaves from cache timing and page table side-channel attacks. Varys takes a pragmatic approach of strict reservation of physical cores to security-sensitive threads, thereby preventing the attacker from accessing shared CPU resources during enclave execution. The key challenge that we are addressing is that of maintaining the core reservation in the presence of an untrusted OS.
Varys fully protects against all L1/L2 cache timing attacks and significantly raises the bar for page table side-channel attacks – all with only 15% overhead on average for Phoenix and PARSEC benchmarks. Additionally, we propose a set of minor hardware extensions that hold the potential to extend Varys’ security guarantees to L3 cache and further improve its performance.
A recent discovery of a new class of microarchitectural attacks called Spectre picked up the attention of the security community as these attacks can circumvent many
traditional mechanisms of defence. One of the attacks— Bounds Check Bypass—can neither be efficiently solved on system nor architectural levels and requires changes in the application itself. So far, the proposed mitigations involved serialization, which reduces the usage of CPU resources and causes high overheads. In this report, we explore methods of delaying the vulnerable instructions
without complete serialization. We discuss several ways of achieving it and compare them with Speculative Load Hardening, an existing solution based on a similar idea. The solutions of this type cause 60% overhead across Phoenix benchmark suite, which compares favourably to the full serialization causing 440% slowdown.
Recent GPUs enable Peer-to-Peer Direct Memory Access (P2P) from fast peripheral devices like NVMe SSDs to exclude the CPU from the data path between them for efficiency. Unfortunately, using P2P to access files is challenging because of the subtleties of low-level nonstandard interfaces, which bypass the OS file I/O layers and may hurt system performance.
SPIN integrates P2P into the standard OS file I/O stack, dynamically activating P2P where appropriate, transparently to the user. It combines P2P with page cache accesses, re-enables read-ahead for sequential reads, all while maintaining standard POSIX FS consistency, portability across GPUs and SSDs, and compatibility with virtual block devices such as software RAID.
We evaluate SPIN on NVIDIA and AMD GPUs using standard file I/O benchmarks, application traces and end-to-end experiments. SPIN achieves significant performance speedups across a wide range of workloads, exceeding P2P throughput by up to an order of magnitude. It also boosts the performance of an aerial imagery rendering application by 2.6× by dynamically adapting to its input-dependent file access pattern, and enables 3.3× higher throughput for a GPU-accelerated log server.
Future systems will be omni-programmable: alongside CPUs, GPUs and FPGAs,
they will execute user code near-storage, near-network, near-memory, or on other
Near-X accelerator Units, NXUs}.
This paper explores the design space of OS support for omni-programmable systems,
aiming to simplify the development of efficient applications that span multiple
heterogeneous processors and near-data accelerators.
OmniX is an accelerator-centric OS architecture that extends standard OS
abstractions, such as task execution and I/O, into NXUs while maintaining a coherent view of the system among all the processors.OmniX enables NXUs to directly invoke
tasks and access I/O services among themselves, excluding the CPU from the performance-critical
control plane operations. The host CPU serves as a controller — for protection,
device configuration and monitoring. We discuss the hardware trends
that motivate our work, outline OmniX design principles, and sketch the core implementation ideas while highlighting missing hardware features, in the hope of motivating hardware vendors to implement them soon.
Intel Software Guard eXtensions (SGX) enable secure and trusted execution of user code in an isolated enclave to protect against a powerful adversary. Unfortunately, running I/O-intensive, memory-demanding server applications in enclaves leads to significant performance degradation. Such applications put a substantial load on the in-enclave system call and secure paging mechanisms, which turn out to be the main reason for the application slowdown. In addition to the
high direct cost of thousands-of-cycles long SGX management instructions, these mechanisms incur the high indirect cost of enclave exits due to associated TLB flushes and processor state pollution.
We tackle these performance issues in Eleos by enabling exit-less system calls and exit-less paging in enclaves. Eleos introduces a novel Secure User-managed Virtual Memory (SUVM) abstraction that implements application-level paging inside the enclave. SUVM eliminates the overheads of
enclave exits due to paging, and enables new optimizations such as sub-page granularity of accesses. We thoroughly evaluate Eleos on a range of microbenchmarks and two real server applications, achieving notable system performance gains. memcached and a face verification server running in-enclave with Eleos, achieves up to 2.2× and 2.3× higher throughput respectively while working on datasets up to 5× larger than the enclave’s secure physical memory.
GPUs have become an integral part of modern systems, but their implications for system security are not yet clear. This paper demonstrates both that discrete GPUs cannot be used as secure
co-processors and that GPUs provide a stealthy platform for malware. First, we examine a recent proposal to use discrete GPUs as secure co-processors and show that the security guarantees of
the proposed system do not hold on the GPUs we investigate. Second, we demonstrate that (under certain circumstances) it is possible to bypass IOMMU protections and create stealthy, long-lived
GPU-based malware. We demonstrate a novel attack that compromises the in-kernel GPU driver and one that compromises GPU microcode to gain full access to CPU physical memory. In general,
we find that the highly sophisticated, but poorly documented GPU hardware architecture, hidden behind obscure close-source device drivers and vendor-specific APIs, not only make GPUs a poor
choice for applications requiring strong security, but also make GPUs into a security threat.
A party running a computation remotely may benefit from misreporting its output,
say, to lower its tax. Cryptographic protocols that detect and prevent such falsities hold the promise to enhance the security of decentralized systems with stringent computational integrity requirements, like Bitcoin [Nak09]. To gain public trust it is imperative to use publicly verifiable protocols that have no “backdoors” and which can be set up using only a short public random string. Probabilistically Checkable Proof (PCP) systems [BFL90, BFLS91, AS98, ALM + 98] can be used to construct astonishingly efficient protocols [Kil92, Mic00] of this nature but some of the main
components of such systems — proof composition [AS98] and low-degree testing via PCPs of Proximity (PCPPs) [BGH + 05, DR06] — have been considered efficient only asymptotically, for unrealistically large computations; recent cryptographic alternatives [PGHR13, BCG + 13a] suffer from a non-public setup phase. This work introduces SCI, the first implementation of a scalable PCP system (that uses both PCPPs and proof composition). We used SCI to prove correctness of executions of up to 2 20 cycles of a simple processor and calculated its break-even
point [SVP + 12, SMBW12]. The significance of our findings is two-fold: (i) it marks the transition of core PCP techniques (like proof composition and PCPs of Proximity ) from mathematical theory to practical system engineering, and (ii) the thresholds obtained are nearly achievable and hence show that PCP-supported computational integrity is closer to reality than previously assumed.
Intel SGX enclaves is a novel technology that holds the promise to revolutionize the way secure and trustworthy applications are built. However, from the perspective of interaction with the rest
of the system, some of the enclave’s characteristics are remarkably similar to the characteristics of traditional hardware accelerators, such as GPUs. For example, enclaves suffer from significant in-
vocation overheads, offer space-constrained private memory, and cannot directly invoke OS services such as network or file I/O. Over the course of GPU computing evolution, there have been developed many techniques to improve system performance and programmability. Our key observation is that the conceptual similarities between enclaves and accelerators may help to build efficient runtime support for enclaves by learning from past experience with GPUs.
We demonstrate this simple idea by implementing SGXIO, a simple yet powerful enhancement to the current SGX runtime which boosts the performance of I/O system calls from enclaves. SGXIO
design is almost identical to the design of GPUfs and GPUnet systems for efficient I/O services for GPU programs. Our preliminary evaluation shows that GXIO improves the performance of
a simple network parameter server for distributed machine learning by up to 3.7×. These promising results suggest new ways to design more efficient runtime and system services for enclaves.
Despite the popularity of GPUs in high-performance and scientific computing, and despite increasingly general-purpose hardware capabilities, the use of GPUs in network servers or distributed systems poses significant challenges.
GPUnet is a native GPU networking layer that provides a socket abstraction and high-level networking APIs for GPU programs. We use GPUnet to streamline the development of high-performance, distributed applications like in-GPU-memory MapReduce and a new class of low-latency, high-throughput GPU-native network services such as a face verification server.
We present GPUrdma, a GPU-side library for performing Remote Direct Memory Accesses (RDMA) across the network directly from GPU kernels. The library executes no code on CPU, directly accessing the Host Channel Adapter (HCA) Infiniband hardware for both control and data. Slow single-thread GPU performance and the intricacies of the GPU-to-network adapter interaction pose a significant challenge. We describe several design options and analyze their performance implications in detail.
We achieve 5usec one-way communication latency and up to 50Gbit/sec transfer bandwidth for messages from 16KB and larger between K40c NVIDIA GPUs across the network. Moreover, GPUrdma outperforms the CPU RDMA for smaller packets ranging from 2 to 1024 bytes by factor of 4.5x thanks to greater parallelism of transfer requests enabled by highly parallel GPU hardware.
We use GPUrdma to implement a subset of the global address space programming interface (GPI) for point-to-point asynchronous RDMA messaging. We demonstrate our preliminary results using two simple applications — ping-pong and a multi-matrix-vector product with constant matrix and multiple vectors — each running on two different machines connected by Infiniband. Our basic ping-pong implementation achieves 5%higher performance than the baseline using GPI-2. The improved ping-pong implementation with per-threadblock communication overlap enables a further 20% improvement. The multi-matrix-vector product is up to 4.5x faster thanks to higher throughput for small messages and the ability to keep the matrix in fast GPU shared memory while receiving new inputs.
GPUrdma prototype is not yet suitable for production systems due to hardware constraints in the current generation of NVIDIA GPUs which we discuss in detail. However, our results highlight the great potential of GPU-side native networking, and encourage further research toward scalable, high-performance, a heterogeneous networking infrastructure.
Using discrete GPUs for processing very large datasets is challenging, in particular when an algorithm exhibit unpredictable, data-driven access patterns. In this paper, we investigate the utility of GPUfs, a library that provides direct access to files from GPU programs, to implement such algorithms. We analyze the system’s bottlenecks, and suggest several modifications to the GPUfs design, including new concurrent hash table for the buffer cache and a highly parallel memory allocator. We also show that by implementing the workload in a warp-centric manner we can improve the performance even further. We evaluate our changes by implementing a real image processing application which creates collages from a dataset of 10 Million images. The enhanced GPUfs design improves the application performance by 5.6× on average over the original GPUfs, and outperforms both 12-core parallel CPU which uses the AVX instruction set, and a standard CUDA-based GPU implementation by up to 2.5× and 3× respectively, while significantly enhancing system programmability and simplifying the application design and implementation.
Finite fields of characteristic 2 — “binary fields” — are used in a variety of applications in cryptography and data storage. Multiplication of two finite field elements is a fundamental operation and a well-known computational bottleneck in many of these applications, as they often require multiplication of a large number of elements. In this work we focus on accelerating multiplication in “large” binary fields of sizes greater than 232. We devise a new parallel algorithm optimized for execution on GPUs. This algorithm makes it possible to multiply large number of finite field elements and achieves high performance via bit-slicing and fine-grained parallelization.
The key to the efficient implementation of the algorithm is a novel performance optimization methodology we call the register cache. This methodology speeds up an algorithm that caches its input in shared memory by transforming the code to use per-thread registers instead. We show how to replace shared memory accesses with the shuffle() intra-warp communication instruction, thereby significantly reducing or even eliminating shared memory accesses. We thoroughly analyze the register cache approach and characterize its benefits and limitations.
We apply the register cache methodology to the implementation of the binary finite field multiplication algorithm on GPUs. We achieve up to 138x speedup for fields of size 232 over the popular, highly optimized Number Theory Library (NTL) [26], which uses the specialized CLMUL CPU instruction, and over 30x for larger fields of size below 2256. Our register cache implementation enables up to 50% higher performance compared to the traditional shared-memory based design.
Modern discrete GPUs have been the processors of choice for accelerating compute-intensive applications, but using them in large-scale data processing is extremely challenging. Unfortunately, they do not provide important I/O abstractions long established in the CPU context, such as memory mapped files, which shield programmers from the complexity of buffer and I/O device management. However, implementing these abstractions on GPUs poses a problem: the limited GPU virtual memory system provides no address space management and page fault handling mechanisms to GPU developers, and does not allow modifications to memory mappings for running GPU programs.
We implement ActivePointers, a software address translation layer and paging system that introduces native support for page faults and virtual address space management to GPU programs, and enables the implementation of fully functional memory mapped files on commodity GPUs. Files mapped into GPU memory are accessed using active pointers, which behave like regular pointers but access the GPU page cache under the hood, and trigger page faults which are handled on the GPU. We design and evaluate a number of novel mechanisms, including a translation cache in hardware registers and translation aggregation for deadlock-free page fault handling of threads in a single warp.
We extensively evaluate ActivePointers on commodity NVIDIA GPUs using microbenchmarks, and also implement a complex image processing application that constructs a photo collage from a subset of 10 million images stored in a 40GB file. The GPU implementation maps the entire file into GPU memory and accesses it via active pointers. The use of active pointers adds only up to 1% to the application’s runtime, while enabling speedups of up to 3.9× over a combined CPU+GPU implementation and 2.6× over a 12-core CPU-only implementation which uses AVX vector instructions.
Distributed actor systems are widely used for developing interactive scalable cloud services, such as social networks and online games. By modelling an application as a dynamic set of lightweight communicating “actors”, developers can easily build complex distributed applications, while the underlying runtime system deals with low-level complexities of a distributed environment.
We present ActOp—a data-driven, application-independent runtime mechanism for optimizing end-to-end service latency of actor-based distributed applications. ActOp targets the two dominant factors affecting latency: the overhead of remote inter-actor communications across servers, and the intra-server queuing delay. ActOp automatically identifies frequently communicating actors and migrates them to the same server transparently to the running application. The migration decisions are driven by a novel scalable distributed graph partitioning algorithm which does not rely on a single server to store the whole communication graph, thereby enabling efficient actor placement even for applications with rapidly changing graphs (e.g., chat services). Further, each server autonomously reduces the queuing delay by learning an internal queuing model and configuring threads according to instantaneous request rate and application demands.
We prototype ActOp by integrating it with Orleans — a popular open-source actor system [4, 13]. Experiments with realistic workloads show latency improvements of up to 75% for the 99th percentile, up to 63% for the mean, with up to 2x increase in peak system throughput.
As GPUs become general purpose, they are outgrowing the coprocessor model and require convenient I/O abstractions such as files and network sockets. Recent studies have shown the benefits of native GPU I/O layers, in terms of both programmability and performance. However, due to lack of hardware support, the GPU threads performing I/O calls are forced to busy-wait for the completion of I/O operations, resulting in underutilized hardware, higher power consumption, and reduced system throughput.
We argue that I/O-driven preemption improves the performance of existing solutions, despite many challenging system characteristics such as a large kernel state. We analyze the benefits of adding preemption support using a simple system performance model, and, encouraged by the results, explore the design of a software-based preemption mechanism for GPUs. In our prototype, GPUpIO, we implement a source-to-source compiler for state checkpoint and restoration, and a runtime library for scheduling preempted thread-blocks, which together enable I/O-driven preemption for GPUs.
We evaluate our prototype across a variety of system parameters and workloads to determine when preemption is worthwhile. We show that in some workloads the I/O-driven preemption approach may indeed double the effective system throughput by completely hiding the I/O latency behind computations. However, we also observe that the software-only solution is currently limited, not only due to its overheads, but also because it does not have sufficient control of the hardware scheduler queue and therefore may lead to starvation of I/O kernels. We then discuss a new hardware feature that, if added, may render a general I/O-driven preemption mechanism on GPUs practical.
Despite the popularity of GPUs in high-performance and scientific computing, and despite increasingly general-purpose hardware capabilities, the use of GPUs in network servers or distributed systems poses significant challenges.
GPUnet is a native GPU networking layer that provides a socket abstraction and high-level networking APIs for GPU programs. We use GPUnet to streamline the development of high-performance, distributed applications like in-GPU-memory MapReduce and a new class of low-latency, high-throughput GPU-native network services such as a face verification server.
This is a non-technical article that covers the main aspects of the GPUfs file system layer for GPU software that makes operating system abstractions available to GPU code.
Erasure coding schemes provide higher durability at lower storage cost, and thus constitute an attractive alternative to replication in distributed storage systems, in particular for storing rarely accessed “cold” data. These schemes, however, require an order of magnitude higher recovery bandwidth for maintaining a constant level of durability in the face of node failures. In this paper we propose lazy recovery, a technique to reduce recovery bandwidth demands down to the level of replicated storage. The key insight is that a careful adjustment of recovery rate substantially reduces recovery bandwidth, while keeping the impact on read performance and data durability low. We demonstrate the benefits of lazy recovery via extensive simulation using a realistic distributed storage configuration and published component failure parameters. For example, when applied to the commonly used RS(14, 10) code, lazy recovery reduces repair bandwidth by up to 76% even below replication, while increasing the amount of degraded stripes by 0.1 percentage points. Lazy recovery works well with a variety of erasure coding schemes, including the recently introduced bandwidth efficient codes, achieving up to a factor of 2 additional bandwidth savings.
As GPU hardware becomes increasingly general-purpose, it is quickly outgrowing the traditional, constrained GPU-as-coprocessor programming model. This article advocates for extending standard operating system services and abstractions to GPUs in order to facilitate program development and enable harmonious integration of GPUs in computing systems. As an example, we describe the design and implementation of GPUFs, a software layer which provides operating system support for accessing host files directly from GPU programs. GPUFs provides a POSIX-like API, exploits GPU parallelism for efficiency, and optimizes GPU file access by extending the host CPU’s buffer cache into GPU memory. Our experiments, based on a set of real benchmarks adapted to use our file system, demonstrate the feasibility and benefits of the GPUFs approach. For example, a self-contained GPU program that searches for a set of strings throughout the Linux kernel source tree runs over seven times faster than on an eight-core CPU.
Early graphical processing units (GPUs) were designed as high compute density, fixed-function processors ideally crafted to the needs of computer graphics workloads. Today, GPUs are becoming truly first-class computing elements on par with CPUs. Programming GPUs as self-sufficient general-purpose processors is not only hypothetically desirable, but feasible and efficient in practice, opening new opportunities for integration of GPUs in complex software systems.
Motivation: The use of dense single nucleotide polymorphism (SNP) data in genetic linkage analysis of large pedigrees is impeded by significant technical, methodological and computational challenges. Here we describe Superlink-Online SNP, a new powerful online system that streamlines the linkage analysis of SNP data. It features a fully integrated flexible processing workflow comprising both well-known and novel data analysis tools, including SNP clustering, erroneous data filtering, exact and approximate LOD calculations and maximum-likelihood haplotyping. The system draws its power from thousands of CPUs, performing data analysis tasks orders of magnitude faster than a single computer. By providing an intuitive interface to sophisticated state-of-the-art analysis tools coupled with high computing capacity, Superlink-Online SNP helps geneticists unleash the potential of SNP data for detecting disease genes.
Results: Computations performed by Superlink-Online SNP are automatically parallelized using novel paradigms, and executed on unlimited number of private or public CPUs. One novel service is large-scale approximate Markov Chain–Monte Carlo (MCMC) analysis. The accuracy of the results is reliably estimated by running the same computation on multiple CPUs and evaluating the Gelman–Rubin Score to set aside unreliable results. Another service within the workflow is a novel parallelized exact algorithm for inferring maximum-likelihood haplotyping. The reported system enables genetic analyses that were previously infeasible. We demonstrate the system capabilities through a study of a large complex pedigree affected with metabolic syndrome.
Availability: Superlink-Online SNP is freely available for researchers at http://cbl-hap.cs.technion.ac.il/superlink-snp. The system source code can also be downloaded from the system website.
GPU hardware is becoming increasingly general purpose, quickly outgrowing the traditional but constrained GPU-as-coprocessor programming model. To make GPUs easier to program and easier to integrate with existing systems, we propose making the host’s file system directly accessible from GPU code.
GPUfs provides a POSIX-like API for GPU programs, exploits GPU parallelism for efficiency, and optimizes GPU file access by extending the buffer cache into GPU memory. Our experiments, based on a set of real benchmarks adopted to use our file system, demonstrate the feasibility and benefits of our approach. For example, we demonstrate a simple self-contained GPU program which searches for a set of strings in the entire tree of Linux kernel source files over seven times faster than an eight-core CPU run.
Modern systems keep long memories. As we show in this paper, an adversary who gains access to a Linux system, even one that implements secure deallocation, can recover the contents of applications’ windows, audio buffers, and data remaining in device drivers–long after the applications have terminated.
We design and implement Lacuna, a system that allows users to run programs in “private sessions.” After the session is over, all memories of its execution are erased. The key abstraction in Lacuna is an ephemeral channel, which allows the protected program to talk to peripheral devices while making it possible to delete the memories of this communication from the host. Lacuna can run unmodified applications that use graphics, sound, USB input devices, and the network, with only 20 percentage points of additional CPU utilization.
Processing vast numbers of data streams is a common problem in modern computer systems and is known as the “online big data problem.” Adding hard real-time constraints to the processing makes the scheduling problem a very challenging task that this paper aims to address. In such an environment, each data stream is manipulated by a (different) application and each datum (data packet) needs to be processed within a known deadline from the time it was generated. This work assumes a central compute engine which consists of a set of CPUs and a set of GPUs. The system receives a configuration of multiple incoming streams and executes a scheduler on the CPU side. The scheduler decides where each data stream will be manipulated (on the CPUs or on one of the GPUs), and the order of execution, in a way that guarantees that no deadlines will be missed. Our scheduler finds such schedules even for workloads that require high utilization of the entire system (CPUs and GPUs).
This paper focuses on an environment where all CPUs share a main memory, and are controlled by a single operating system (and a scheduler). The system uses a set of discrete graphic cards, each with its own private main memory. Different memory regions do not share information, and coherency is maintained by the use of explicit memory-copy operations. The paper presents a new algorithm for distributing data and scheduling applications that achieves high utilization of the entire system (CPUs and GPUs), while producing schedules that meet hard real-time constraints.
We evaluate our new proposed algorithm by using the AES-CBC encryption kernel on thousands of streams with realistic distribution of rates and deadlines. The paper shows that on a system with a CPU and two GPU cards, our current framework allows up to 87% more data to be processed per time unit than a similar single-GPU system.
Many scientists perform extensive computations by executing large bags of similar tasks (BoTs) in mixtures of computational environments, such as grids and clouds. Although the reliability and cost may vary considerably across these environments, no tool exists to assist scientists in the selection of environments that can both fulfill deadlines and fit budgets. To address this situation, we introduce the Expert BoT scheduling framework. Our framework systematically selects from a large search space the Pareto-efficient scheduling strategies, that is, the strategies that deliver the best results for both make span and cost. Expert chooses from them the best strategy according to a general, user-specified utility function. Through simulations and experiments in real production environments, we demonstrate that Expert can substantially reduce both make span and cost in comparison to common scheduling strategies. For bioinformatics BoTs executed in a real mixed grid + cloud environment, we show how the scheduling strategy selected by Expert reduces both make span and cost by 30%-70%, in comparison to commonly-used scheduling strategies.
We propose a new set of OS abstractions to support GPUs and other accelerator devices as first class computing resources. These new abstractions, collectively called the PTask API, support a dataflow programming model. Because a PTask graph consists of OS-managed objects, the kernel has sufficient visibility and control to provide system-wide guarantees like fairness and performance isolation, and can streamline data movement in ways that are impossible under current GPU programming models.
Our experience developing the PTask API, along with a gestural interface on Windows 7 and a FUSE-based encrypted file system on Linux show that the PTask API can provide important system-wide guarantees where there were previously none, and can enable significant performance improvements, for example gaining a 5× improvement in maximum throughput for the gestural interface.
Data stream processing applications such as stock exchange data analysis, VoIP streaming, and sensor data processing pose two conflicting challenges: short per-stream latency — to satisfy the milliseconds-long, hard real-time constraints of each stream, and high throughput — to enable efficient processing of as many streams as possible. High-throughput programmable accelerators such as modern GPUs hold high potential to speed up the computations. However, their use for hard real-time stream processing is complicated by slow communications with CPUs, variable throughput changing non-linearly with the input size, and weak consistency of their local memory with respect to CPU accesses. Furthermore, their coarse grain hardware scheduler renders them unsuitable for unbalanced multi-stream workloads.
We present a general, efficient and practical algorithm for hard real-time stream scheduling in heterogeneous systems. The algorithm assigns incoming streams of different rates and deadlines to CPUs and accelerators. By employing novel stream schedulability criteria for accelerators, the algorithm finds the assignment which simultaneously satisfies the aggregate throughput requirements of all the streams and the deadline constraint of each stream alone.
Using the AES-CBC encryption kernel, we experimented extensively on thousands of streams with realistic rate and deadline distributions. Our framework outperformed the alternative methods by allowing 50% more streams to be processed with provably deadline-compliant execution even for deadlines as short as tens milliseconds. Overall, the combined GPU-CPU execution allows for up to 4-fold throughput increase over highly-optimized multi-threaded CPU-only implementations.
Linkage analysis is a statistical method used by geneti cists in everyday practice for mapping disease-susceptibility genes in the study of complex diseases. An essential first step in the study of genetic diseases, linkage computations may require years of CPU time. The recent DNA sampling revolution enabled unprecedented sampling density, but made the analysis even more computationally demanding. In this paper we describe a high performance online service for genetic linkage analysis, called Superlink-online. The system enables anyone with Internet access to submit genetic data and analyze it as easily and quickly as if using a supercomputer. The analyses are automatically parallelized and executed on tens of thousands distributed CPUs in multiple clouds and grids.
The first version of the system, which employed up to 3,000 CPUs in UW Madison and Technion Condor pools, has been successfully used since 2006 by hundreds of geneticists worldwide, with over 40 citations in the genetics literature. Here we describe the second version, which substantially improves the scalability and performance of first: it uses over 45,000 non-dedicated hosts, in 10 different grids and clouds, including EC2 and the Superlink@Technion community grid. Improved system performance is obtained through a virtual grid hierarchy with dynamic load balancing and multi-grid overlay via the GridBot system, parallel pruning of short tasks for overhead minimization, and cost-efficient use of cloud resources in reliability-critical execution periods.
These enhancements enabled execution of many previously infeasible analyses, which can now be completed within a few hours. The new version of the system, in production since 2009, has completed over 6500 different runs of over 10 million tasks, with total consumption of 420 CPU years.
We consider the problem of energy-efficient acceleration of applications comprising multiple interdependent tasks forming a dependency tree, on a hypothetical CPU/GPU system where both a CPU and a GPU can be powered off when idle. Each task in the tree can be invoked on either a GPU or a CPU, but the performance may vary: some run faster on a GPU, while others prefer a CPU, making the choice of the lowest-energy processor input dependent. Furthermore, greedily minimizing the energy consumption for each task is suboptimal because of the additional energy required for the communication between the tasks executed on different processors.
We propose an efficient algorithm that takes into account the energy consumption of a CPU and a GPU for each task, as well as the communication costs of data transfers between them, and constructs an optimal acceleration schedule with provably minimal total consumed energy.
We evaluate the algorithm in the context of a real application having a task dependency tree structure and show up to 2.5-fold improvement in the expected energy consumption over the best single processor schedule, and up to 50% improvement over the communication unaware schedule on real inputs. We also show how this algorithm can be used to speedup computations rather than minimize power consumption. We achieve achieve up to a 2-fold speedup in real CPU/GPU systems.
In this chapter we cover two difficult problems frequently encountered by GPU developers: optimizing memory access for kernels with complex input-dependent access patterns, and mapping the computations to a GPU or a CPU in
composite applications with multiple dependent kernels. Both pose a formidable challenge as they require dynamic adaptation and tuning of execution policies to allow high performance for a wide range of inputs. Not meeting these requirements leads to substantial performance penalty.
We develop our solution using simple examples, and then apply them to a real application for computing the probability of evidence in probabilistic networks. The combined techniques of memory optimization and dynamic assignment
result in up to three-fold runtime reduction over the non-optimized version on real inputs from the genetic analysis domain, and up to five-fold over an optimized parallel version running on Intel’s latest dual quad-core 16-thread Nehalem machine. In the first part of the chapter we describe our methodology for solving the memory optimization problem via
software-managed caching by efficiently exploiting the fast scratchpad memory. This technique outperforms the cache-less and the texture memory-based approaches on pre-Fermi GPU architectures as well as the one that uses the Fermi hardware cache alone.
The focus of the second part of the chapter is the algorithm for minimizing the total running time of a complete application comprising multiple interdependent kernels. Both a GPU and a CPU can be used to execute the kernels, but the performance varies greatly for different inputs, calling for dynamic assignment of the computations to a GPU or a CPU at runtime. However, instead of greedily choosing the best performing device for each kernel, the algorithm minimizes the runtime of the complete application by evaluating the performance of all the assignments jointly, including the overhead of
the data transfer between the devices.
We present a holistic approach for efficient execution of bags-of-tasks (BOTs) on multiple grids, clusters, and volunteer computing grids virtualized as a single computing platform. The challenge is twofold: to assemble this compound environment and to employ it for execution of a mixture of throughput- and performance-oriented BOTs, with a dozen to millions of tasks each. Our generic mechanism allows per BOT specification of dynamic arbitrary scheduling and replication policies as a function of the system state, BOT execution state, and BOT priority. We implement our mechanism in the GridBot system and demonstrate its capabilities in a production setup. GridBot has executed hundreds of BOTs with over 9 million jobs during three months alone; these have been invoked on 25,000 hosts, 15,000 from the Superlink@Technion community grid and the rest from the Technion campus grid, local clusters, the Open Science Grid, EGEE, and the UW Madison pool.
We present a technique for designing memory-bound algorithms with high data reuse on Graphics Processing Units (GPUs) equipped with close-to-ALU software-managed memory. The approach is based on the efficient use of this memory through the implementation of a software-managed cache. We also present an analytical model for performance analysis of such algorithms.
We apply this technique to the implementation of the GPU-based solver of the sum-product or marginalize a product of functions (MPF) problem, which arises in a wide variety of real-life applications in artificial intelligence, statistics, image processing, and digital communications. Our motivation to accelerate MPF originated in the context of the analysis of genetic diseases, which in some cases requires years to complete on modern CPUs. Computing MPF is similar to computing the chain matrix product of multi-dimensional matrices, but is more difficult due to a complex data-dependent access pattern, high data reuse, and a low compute-to-memory access ratio. Our GPU-based MPF solver achieves up to 2700-fold speedup on random data and 270-fold on real-life genetic analysis datasets on GeForce 8800GTX GPU from NVIDIA over the optimized CPU version on an Intel 2.4GHz Core 2 with a 4MB L2 cache.
Grids are becoming a mission-critical component in research and industry. The services they provide are thus required to be highly available, contributing to the vision of the grid as a dependable virtual computer of infinite power. However, building highly available services in grid is particularly difficult due to the unique characteristics of the grid environment. We believe that high availability functionality should itself be provided as a service, which can be used by transparently decorating, but not changing, the original services, thus making them highly available. In this work we highlight the major challenges and describe our initial experience in building such a generic high availability service in the context of the Condor system
Consider a workload in which massively parallel tasks that require large resource pools are interleaved with short tasks that require fast response but consume fewer resources. We aim at achieving high throughput and short response time when scheduling such a workload over a set of uncoordinated grids of varying sizes and performance characteristics.
We propose the concept of a grid execution hierarchy, where available grids are sorted according to their size, and the execution overheads increase with the size of the grids. We devise a scheduling algorithm for this execution hierarchy of grids by adapting the multilevel feedback queue approach to a multi-grid environment. The algorithm finds a grid of the size, availability, and overhead that best matches a task’s resource requirements and expected turnaround time. Our approach is inspired by the shortest processing time first policy (SPTF), in the sense that the task’s processing demands are constantly reevaluated during its run, so that a task is migrated to a more suitable level of the execution hierarchy when appropriate.
We evaluate our approach in the context of the Superlink-online system for processing genetic linkage analysis tasks – a production system consisting of several grids and utilizing tens of thousands of CPU hours a month. With our approach the system provides nearly interactive response time for shorter tasks, while simultaneously serving throughput-oriented massively parallel tasks in an efficient manner.
Computation of LOD scores is a valuable tool for mapping disease-susceptibility genes in the study of Mendelian and complex diseases. However, computation of exact multipoint likelihoods of large inbred pedigrees with extensive missing data is often beyond the capabilities of a single computer. We present a distributed system called “SUPERLINK-ONLINE,” for the computation of multipoint LOD scores of large inbred pedigrees. It achieves high performance via the efficient parallelization of the algorithms in SUPERLINK, a state-of-the-art serial program for these tasks, and through the use of the idle cycles of thousands of personal computers. The main algorithmic challenge has been to efficiently split a large task for distributed execution in a highly dynamic, nondedicated running environment. Notably, the system is available online, which allows computationally intensive analyses to be performed with no need for either the installation of software or the maintenance of a complicated distributed environment. As the system was being developed, it was extensively tested by collaborating medical centers worldwide on a variety of real data sets, some of which are presented in this article.
In this report we investigate the benefits of using a coprocessor coupled with content addrassible memory (CAM) for off-loading of a computation-intensive kernels of antivirus software. Overview of antivirus technologies is presented, followed by performance analysis of real antivirus software to justify the application of coprocessor. High level architecture of the coprocessor and its interaction with main CPU is described. CAM usage is described and performance analysis is presented. A broader perspective of a using CAM-based coprocessor application for string pattern matching, various string operations, e.g. string comparisons, and regular expression matching is discussed.