Systems research I'm watching for in 2018

Jan 2018 This post outlines systems technology that I hope to see form the basis for research projects in the upcoming year. Most of the technologies covered can stand alone as a key enabler for individual research directions when paired with an interesting problem. But I’m most excited about the possibilities when these technologies are combined in unique ways. I’ll cover big memory systems, languages and run-times, and tools and frameworks. While I don’t have suggestions for specific research topics, what I hope to do in this post is get everyone excited about the technologies that I haven’t been able to stop thinking about all of last year.

Sharding the LRU node cache in CruzDB

Jun 2017 In our previous post on CruzDB performance we examined initial performance results for the YCSB benchmark. The results showed that a larger cache had a significant benefit for performance (no surprise), but we also observed that even for read-only workloads throughput was not scaling with the number of threads. In this post we address this issue and present new scalability results. CruzDB is structured as a copy-on-write red-black tree where every new version is stored in an underlying shared-log.

Debuting CruzDB performance results

Jun 2017 The CruzDB database system is a log-structured key-value store that provides serialized transactions over a distributed high-performance log. We have been working on this system for a little over a year and we just finished bindings for the YCSB workload generator. Today we’ll be previewing some of our initial performance results. We’ll cover some of the basics of the system, but will save a discussion of many of the technical details for another day.

Slow cold-cache reads from placement group in Ceph

Dec 2016 In Adding a new placement group operation in Ceph I demonstrated how to add a new operation in RADOS that operates at the placement group level, allowing one operation to operate on multiple objects. Recently I’ve been experimenting more with operations at the placement group level, and found interesting performance behavior when reading multiple objects within a single PG operation. The two graphs below show the results of four experiments that each read 1000 small objects from a placement group with eight PGs.

Adding a new placement group operation in Ceph

Dec 2016 In this post we are going to create a librados operation in Ceph that operates at the level of the placement group (most RADOS operations act upon objects). As a demonstration we’ll build an interface that computes the checksum of all object data in a placement group. This probably isn’t useful to anyone, but it exercises a lot of interesting internal machinery. The overall approach is adapted from the code paths used to list objects in a pool.

Measuring userfaultfd page-fault latency

Oct 2016 The userfaultfd feature in the Linux kernel allows userspace to handle page faults and some other memory management tasks. For example a missing page can be handled by paging in from a remote source, or write-protecting pages and handling write events. The initial user of this feature is QEMU post-copy live migration where a live VM running on a destination node is demand paging-in guest memory, and QEMU is handling the network transfer.

Introduction to the ZLog transaction key-value store

Aug 2016 Today I’m going to introduce a new project that we have started intended to provide a general purpose, distributed, transactional key-value store on top ZLog. Readers of this blog should be familiar with ZLog as the high-performance distributed shared-log that runs on top of Ceph and RADOS. For readers unfamiliar with the ZLog project I recommend reading this. In a nutshell the ZLog project provides a high-performance shared-log abstraction with serializability guarantees.

ZLog project update (mid-2016 edition)

Jun 2016 This post provides an update on the ZLog project, including the latest performance enhancements, and where we are heading next. If you aren’t familiar with ZLog it is a high-performance distributed shared-log. It is unique in that it maintains serializability while providing high append throughput, which makes it an ideal candidate in building things like replicated state-machines, distributed block devices, and databases. The design of ZLog is based on the CORFU protocol, but adapted for a software-defined storage environment.

RADOS object class development activity

Jan 2016 The object interface in RADOS can be customized using a feature called object classes. Object classes can be authored in C++, or dynamically injected using Lua scripts attached to each request, loaded from a file system, or managed by Ceph. This post takes a quick look at how much this facility is used. When object interfaces change, the interface to data changes and must be managed carefully to avoid losing access to data or degrading performance.

Toward dynamic RADOS object class management

Dec 2015 Standard object classes in RADOS are managed using a static versioning and distribution scheme, but this may be restrictive for dynamically defined interfaces. In this post we describe a proof-of-concept implementation for dynamically managing object interfaces. The stable version of cls_lua that enables object classes to be written in Lua is located in the cls-lua branch of the upstream Ceph repository ( now located in upstream master. The code described in this post is a work-in-progress and is maintained in a separate branch: https://github.

Load Lua RADOS classes from local file system

Dec 2015 Object classes in RADOS written in Lua have up until now been limited to scripts that are embedded into every client request. This post describes how we have extended RADOS to load Lua scripts from the local file system, supporting a new way to manage object interfaces written in Lua. Introduction The RADOS object store that powers Ceph supports an active storage-like feature called object classes that allow custom object interfaces to be defined.

ZLog asynchronous I/O support

Sep 2015 As previously discussed ZLog is an implementation of the CORFU distributed log protocol on top of Ceph. In the post describing ZLog we didn’t dig too deeply into the client API. In this post we will discuss the basics of using the API, and provide details on the new asynchronous API design. ZLog API Example First create a connection to the sequencer service zlog::SeqrClient seqr("localhost", "5678"); Next create a brand new log.

Hadoop on Ceph: usability survey

Jul 2015 As we saw in the last post on setting up Hadoop on Ceph there were a lot of steps that cause usability to suffer. In this post we’ll check out a variety of storage systems that can function as an alternative to HDFS in Hadoop environments to see what other systems are doing to ease the pain. Many of these alternative systems are listed on the Hadoop File System Compatibility wiki page: https://wiki.

Hadoop on Ceph: diving in

Jul 2015 It has been possible for several years now to run Hadoop on top of the Ceph file system using a shim layer that maps between the HDFS abstraction and the underlying Ceph file interface. Since then bug fixes and performance enhancements have found their way into the shim, but usability has remained a sore area primarily due to the lack of documentation, and low-level setup required in many instances. This post marks the beginning of a series of posts on using Hadoop on top of Ceph.

Distributed search for Beal's Conjecture counterexamples

Jan 2015 In this post we’ll take a stab at finding a counterexample to Beal’s Conjecture, which states that if a^x + b^y = c^z, where a, b, c, x, y, and z are positive integers and x, y and z are all greater than 2, then a, b, and c must have a common prime factor. There is a monetary prize offered by Andrew Beal for a proof or counterexample to the conjecture.

Build Ceph on aarch64 Ubuntu 14.10

Jan 2015 In this post we’ll create an arm64 build of Ceph. The main issue faced is an unmet dependency on Ubuntu 14.10 (arch=arm64) for building the Ceph Debian packages. [email protected]:~$ git clone --recursive Cloning into 'ceph'... remote: Counting objects: 276504, done. ... [email protected]:~$ cd ceph [email protected]:~/ceph$ ./ Reading package lists... Done Building dependency tree Reading state information... Done dpkg-dev is already the newest version. The following packages were automatically installed and are no longer required: liblzo2-2 libpkcs11-helper1 Use 'apt-get autoremove' to remove them.

Setting up iSER-enabled TGT RAM disk

Jan 2015 In this post we’ll create a network-backed tmpfs by constructing a RAID-0 array of remote RAM disks using TGT and iSCSI. We’ll export two 25 GB remote RAM disks from a remote note, and use mdadm to create a local RAID device. Then we’ll format them with ext4 and disable journaling for a fast in-memory file system. On the target (server): # ramdisk 1 mkdir /tmp/rd1 mount -t tmpfs -o size=25G tmpfs /tmp/rd1 dd if=/dev/zero of=/tmp/rd1/lun bs=1M # ramdisk 2 mkdir /tmp/rd2 mount -t tmpfs -o size=25G tmpfs /tmp/rd2 dd if=/dev/zero of=/tmp/rd2/lun bs=1M # target 1 tgtadm --lld iser --op new --mode target --tid 1 -T iqn.

Remote RAM disk with RDMA

Jan 2015 In this post I’ll show you how to use iSER, iSCSI, and LIO to setup a remote RAM disk. This is useful if you need high IOPS but don’t have access to a bunch of SSDs or NVRAM. Note that the performance achieved in this post is quite low compared to what you should be able to achieve with different hardware. Currently the arm64 machines we are using aren’t getting the performance expected, and tuning is on going.

Setting up RoCE on aarch64 Ubuntu

Jan 2015 Notes on setting up RoCE (RDMA over Converged Ethernet) on aarch64 running Ubuntu Server. The instructions below are based off the guide found here However, some package dependencies are not available on arm64. I’ve updated the list to work with the Ubuntu packages available on arm64. I am going to be using the Utah CloudLab cluster, which contains a bunch of HP Moonshot nodes with the following hardware: HP Moonshot m400 Eight 64-bit ARMv8 (Atlas/A57) cores at 2.

ZLog sequencer performance

Nov 2014 In a previous post we discussed the design of zlog, our implementation of the CORFU distributed shared-log protocol on top of Ceph. A key component of the system is the sequencer server that reports the current log tail to clients. In this post we’ll discuss the implementation and performance of the sequencer in zlog. The fast path of the sequencer server is simple. It contains an in-memory counter that is incremented when a client requests the next position in the log.

ZLog: a distributed shared-log on Ceph

Oct 2014 Distributed logs have been receiving a lot of attention lately. And rightfully so—as a building block, they are a basic concept that in many instances can simplify the construction of distributed systems. But building a distributed log is no simple task. In this post I will share the design of zlog, our implementation of a globally consistent distributed log on top of Ceph. The implementation of zlog is based on the novel CORFU protocol for building high-performance distributed shared-logs.

Ceph OSD request processing latency

Jun 2014 How fast can RADOS process a request? The answer depends on a lot of factors such as network and I/O performance, operation type, and all sorts of flavors of contention that limit concurrency. Today we’ll focus on the latency added due to request processing inside an OSD. We are going to do our performance analysis by post-processing execution traces collected using LTTng-UST. Check out Tracing Ceph With LTTng for more information on instrumenting Ceph.

Tracing Ceph with LTTng

Jun 2014 This post demonstrates how to use LTTng-UST to collect execution traces from Ceph. As a driving example we’ll use the traces to identify all instances of lock ownership, and how long each lock is held. This type of analysis could be useful for things like identifying sources of latency. While specific to Ceph, the tracing techniques shown can be applied to any application as a powerful tool for performance analysis and debugging.

Latency impact of RADOS placement group splitting

May 2014 When failure occurs in Ceph, or when more OSDs are added to a cluster, data moves around to re-replicate objects or to re-balance data placement. This movement is minimized by design, but sometimes it is necessary to scale the system in a way that causes a lot of data movement, and will have an impact on performance (though in practice this is a rare event for which scheduled downtime may be reasonable).

OpRequest flow in RADOS OSD server

May 2014 This post is a quick tour of the life cycle of an OpRequest in the Ceph/RADOS storage server. We’ll follow the request from the time the generic message arrives off the network, to the point that the resulting transaction for an object operation hits the low-level object store layer as a transaction. Update June 2017: The request flow described below is out of date, but not severely. Some of the file, class, and method names have changed.

Performance of compound RADOS operations (part 2)

Feb 2014 In Part 1 of this series I looked at the cost of performing a guarded append operation on a single object with varying levels of concurrency. Without parallel journaling mode enabled, the performance of the guarded append doesn’t scale with the number of clients writing because each operation dirties the object, forcing a flush to the data drive to satisfy the read necessary for guard. In contrast, an append-only workload scales well with the number of clients appending.

Performance of compound RADOS operations (part 1)

Feb 2014 A powerful feature of Ceph/RADOS is its ability to atomically execute multiple operations on a single object. For instance, object writes can be combined with updates to a secondary index, and RADOS will guarantee consistency by applying the updates in a transactional context. This functionality is used extensively to construct domain-specific interfaces in projects such as the RADOS-Gateway and RBD. This transactional capability can also make it easier to construct distributed applications through the use of custom interfaces, a simple example being an atomic compare-and-swap primitive.

Fixing a Ceph performance WTF

Jan 2014 Here are initial performance results for a simple write workload on a new Ceph cluster. There are 6 nodes in the cluster with 2 OSDs per node. Each OSD is has a dedicated data drive formatted with XFS, and both OSDs share an SSD for the journal. [email protected]:~$ rados -p iod1 bench 60 write Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds or 0 objects Object prefix: benchmark_data_issdm-9_7039 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 16 32 16 63.

Dynamic RADOS object interfaces with Lua

Jan 2014 In this post I’m going to demonstrate how to dynamically extend the interface of objects in RADOS using the Lua scripting language, and then build an example service for image thumbnail generation and storage that performs remote image processing inside a target object storage device (OSD). We’re gonna have a lot of fun. Note that this is a re-post of the article appearing at which was published on October 29, 2013.


Jan 2014 I recently needed to port the following line of networking code in Ceph to OS X (Ceph is developed almost exclusively on Linux). The MSG_MORE flag is an optimization used to inform the networking layer that more data is going to be sent shortly. The MSG_NOSIGNAL flag is used to block SIGPIPE. Unfortunately both of these macros are not defined on OS X. sendmsg(sd, msg, MSG_NOSIGNAL | (more ? MSG_MORE : 0)); First, since the MSG_MORE flag is an optimization, we can turn it off if it isn’t available (note that while reading about MSG_MORE, it seems that a solution based on TCP_CORK may be possible, but that is for another time).

Tool: SS-DB NetCDF loader

May 2013 A challenge in designing systems for scientific data analysis is a lack of representative data sets and queries. In the world of relational database systems, the TPC benchmarks serve as a common tool for comparing performance. However, there has been little work done in producing benchmarks representative of scientific data analysis workloads. One such solution is the SS-DB benchmark. From the Science Benchmark (SS-DB) website: SS-DB is representative of the processing performed in a number of scientific domains in addition to astronomy, including earth science, oceanography, and medical image analysis.

Digital preservation and economic faults: an excerpt

May 2013 A well articulated description of how economic faults are a threat to digital preservation and access to information. Organizations often stretch their limited budgets simply to get their collections online, leaving little or nothing to ensure continued accessibility. There are ongoing costs for power, cooling, bandwidth, system administration, equipment space, domain registration, renewal of equipment, and so on. Information in digital form is much more vulnerable to interruptions in the money supply than information on paper, and budgets for digital preservation must be expected to vary up and down, possibly even to zero, over time.

Lua GC and linked native heap objects

Feb 2013 I’ve been working on a Lua project that wraps a C++ interface. Included in the interface are two objects that are created with a parent-child relationship. If a reference to the parent disappears and Lua garbage collection reclaims the parent object, using the child object will cause things to blow up. It took me a while to find an example of how to use a weak table to record these relationships that indirectly result in the correct GC policy.

Writing RADOS object class handlers in Lua

Feb 2013 This post describes the API of the Lua object class handler system. In a previous post I provided some motivation for the project, and provided a description of the Lua object class error handling design. Another helpful resource is the Lua script used for internal unit testing that has working examples of the entire API. The previous link is to the C++ unit test suite, but at the top of the file is a long Lua script that is compiled into a string and used in the unit tests.

Just some very good house

Feb 2013 A buddy from grad school has put up another great mix. Thanks Andrew, and enjoy!

Lua bindings for RADOS object class handlers

Feb 2013 The Ceph distributed file system is built on top of a scalable object store called RADOS, which is also used as a basis for several products including RADOS Gateway and RBD. One feature of RADOS is the Object Class system, providing the ability to allow developers to define new object behavior by writing C++ plugins that execute within the context of the storage system nodes, and operate on object data using arbitrary functions.

Simple moving average Awk script

Jan 2013 I often find myself with a lot of multivariate time series data. It’s also usually quite noisy, which makes for hard-to-interpret plots. Taking a simple moving average over the variables is a good way to smooth things out. I use the Awk script below to process my data files, which normally have a format in which the first column is time and the remaining columns contains the value of each variable.

Custom Lua VM panic handler

Jan 2013 Extending applications with Lua is amazingly powerful. The task can be a little mind-bending, but with a bit of practice it all begins to make sense. One challenge with embedding the Lua VM is avoiding the possibility of crashing the host application. This is especially important for high-availability systems such as file system servers. It is good practice to execute everything in a Lua protected environment in which case errors are reported through the normal lua_error path.