Cephalocon 2019 was held in Barcelona right before KubeCon. Barcelona and the convention center there are really great for holding conferences. The talk focused on building RADOS object classes with C++ and Lua, and prompted some really interesting questions and follow-up conversations.
Cephalocon 2018 was held in Beijing. We flew over from California and returned three days later. That was some brutal travel, but the event was great. The talk focused on building a distributed log on top of Ceph which was part of my graduate school research.
Mar 2019 In this post we are going to take a high-level look at the smf rpc framework, and walk through the process of creating a minimal client/server example. This post will show to setup a build tree, and bootstrap the client and server to exchange messages. We’ll then briefly expand the example to include server-side state management. The goal of the smf project is provide a high-performance rpc implementation for the Seastar framework.
I gave a talk last fall on running Lua inside of the Ceph distributed storage system at the 2017 Lua Workshop in San Francisco. This was a small close-knit group of people that were really passionate about the Lua language. I can’t say I’m a Lua expert, or even a frequent user of Lua, but I had a great time giving the talk and listening to feedback.
The Ceph OSD contains a state machine that encodes different operational aspects of the system such as peering and recovery. The state machine is built using the Boost Statechart library, and using a plugin for Clang that state machine can be programmatically extract and transformed into a graphical state machine representation. I’ve extracted the latest version of the state machine, and posted my notes on reproducing the visualization.
Welcome back to our series on the architecture of CruzDB. In the previous post we discussed how afterimages and persistent pointers are managed in order to enable parallel I/O. Today’s post is short, and covers the implementation of the database catalog.
It’s time for the fourth post in our on-going series about the architecture of CruzDB. In the previous post we took a detailed look at transaction processing and what exactly is happening when a transaction intention in the log is replayed. We saw how a commit or abort decision is made, and how a new database state is created when a transaction commits. In this post we take a look at the challenge of increasing transaction throughput by writing database snapshots in parallel.
This is the third post in a series of articles exploring the architecture of CruzDB. In the previous post we saw how the copy-on-write tree structure used in CruzDB is serialized into the log, and today we’ll examine how intentions in the log are replayed to produce new versions of the database. We’ll see how transactions are analyzed to determine if they commit or abort, and also how metadata is stored in the database itself to accelerate the conflict analysis process.
This is the second post in a series of articles examining the architecture of CruzDB. In the previous post we examined the basics of how transactions are stored in an underlying shared-log, and began to discuss the distributed operation of the database. In this post we are going to examine how the database is physically stored in the log as a serialized copy-on-write binary tree. In addition, we’ll cover some complications that appear in a distributed setting such as duplicate snapshots, and how they are handled.
CruzDB is a distributed shared-data key-value store that manages all its data in a single, high-performance distributed shared-log. It’s been one the most challenging and interesting projects I’ve hacked on, and this post is the first in a series that will explore the current implementation of the system, and critically, where things are headed.
Jan 2018 This post outlines systems technology that I hope to see form the basis for research projects in the upcoming year. Most of the technologies covered can stand alone as a key enabler for individual research directions when paired with an interesting problem. But I’m most excited about the possibilities when these technologies are combined in unique ways. I’ll cover big memory systems, languages and run-times, and tools and frameworks. While I don’t have suggestions for specific research topics, what I hope to do in this post is get everyone excited about the technologies that I haven’t been able to stop thinking about all of last year.
Dec 2017 in the second part we are going to tackle the bug we saw in the first that occurred when the sequencer restarted Articles in this series: This post is part of the following series: Background The basic issue in the previous version was that objects allowed client operations to succeed when those clients had stale knowledge about the system state. CORFU handles this through a mechanism called seal and retry logic in the client.
Dec 2017 Introduction blurb about this post. Articles in this series: This post is part of the following series: ZLog Links Basic ZLog Is it correct? Modeling Basic idea about modeling Other approaches We are using Spin in this article A ZLog Spin Model We start with a constrained and simplified model of the system. Over the course of several posts we will increase the detail of the model until it accurately reflects the real-world implementation.
Jun 2017 In our previous post on CruzDB performance we examined initial performance results for the YCSB benchmark. The results showed that a larger cache had a significant benefit for performance (no surprise), but we also observed that even for read-only workloads throughput was not scaling with the number of threads. In this post we address this issue and present new scalability results. CruzDB is structured as a copy-on-write red-black tree where every new version is stored in an underlying shared-log.
Jun 2017 The CruzDB database system is a log-structured key-value store that provides serialized transactions over a distributed high-performance log. We have been working on this system for a little over a year and we just finished bindings for the YCSB workload generator. Today we’ll be previewing some of our initial performance results. We’ll cover some of the basics of the system, but will save a discussion of many of the technical details for another day.
Dec 2016 In Adding a new placement group operation in Ceph I demonstrated how to add a new operation in RADOS that operates at the placement group level, allowing one operation to operate on multiple objects. Recently I’ve been experimenting more with operations at the placement group level, and found interesting performance behavior when reading multiple objects within a single PG operation. The two graphs below show the results of four experiments that each read 1000 small objects from a placement group with eight PGs.
Dec 2016 In this post we are going to create a librados operation in Ceph that operates at the level of the placement group (most RADOS operations act upon objects). As a demonstration we’ll build an interface that computes the checksum of all object data in a placement group. This probably isn’t useful to anyone, but it exercises a lot of interesting internal machinery. The overall approach is adapted from the code paths used to list objects in a pool.
Oct 2016 The userfaultfd feature in the Linux kernel allows userspace to handle page faults and some other memory management tasks. For example a missing page can be handled by paging in from a remote source, or write-protecting pages and handling write events. The initial user of this feature is QEMU post-copy live migration where a live VM running on a destination node is demand paging-in guest memory, and QEMU is handling the network transfer.
Aug 2016 Today I’m going to introduce a new project that we have started intended to provide a general purpose, distributed, transactional key-value store on top ZLog. Readers of this blog should be familiar with ZLog as the high-performance distributed shared-log that runs on top of Ceph and RADOS. For readers unfamiliar with the ZLog project I recommend reading this. In a nutshell the ZLog project provides a high-performance shared-log abstraction with serializability guarantees.
Jun 2016 This post provides an update on the ZLog project, including the latest performance enhancements, and where we are heading next. If you aren’t familiar with ZLog it is a high-performance distributed shared-log. It is unique in that it maintains serializability while providing high append throughput, which makes it an ideal candidate in building things like replicated state-machines, distributed block devices, and databases. The design of ZLog is based on the CORFU protocol, but adapted for a software-defined storage environment.
Jan 2016 The object interface in RADOS can be customized using a feature called object classes. Object classes can be authored in C++, or dynamically injected using Lua scripts attached to each request, loaded from a file system, or managed by Ceph. This post takes a quick look at how much this facility is used. When object interfaces change, the interface to data changes and must be managed carefully to avoid losing access to data or degrading performance.
Standard object classes in RADOS are managed using a static versioning and distribution scheme, but this may be restrictive for dynamically defined interfaces. In this post we describe a proof-of-concept implementation for dynamically managing object interfaces.
Dec 2015 Object classes in RADOS written in Lua have up until now been limited to scripts that are embedded into every client request. This post describes how we have extended RADOS to load Lua scripts from the local file system, supporting a new way to manage object interfaces written in Lua. Introduction The RADOS object store that powers Ceph supports an active storage-like feature called object classes that allow custom object interfaces to be defined.
As previously discussed ZLog is an implementation of the CORFU distributed log protocol on top of Ceph. In the post describing ZLog we didn’t dig too deeply into the client API. In this post we will discuss the basics of using the API, and provide details on the new asynchronous API design.
As we saw in the last post on setting up Hadoop on Ceph there were a lot of steps that cause usability to suffer. In this post we’ll check out a variety of storage systems that can function as an alternative to HDFS in Hadoop environments to see what other systems are doing to ease the pain.
Jul 2015 It has been possible for several years now to run Hadoop on top of the Ceph file system using a shim layer that maps between the HDFS abstraction and the underlying Ceph file interface. Since then bug fixes and performance enhancements have found their way into the shim, but usability has remained a sore area primarily due to the lack of documentation, and low-level setup required in many instances. This post marks the beginning of a series of posts on using Hadoop on top of Ceph.
Jan 2015 In this post we’ll take a stab at finding a counterexample to Beal’s Conjecture, which states that if a^x + b^y = c^z, where a, b, c, x, y, and z are positive integers and x, y and z are all greater than 2, then a, b, and c must have a common prime factor. There is a monetary prize offered by Andrew Beal for a proof or counterexample to the conjecture.
In this post we’ll create an arm64 build of Ceph. The main issue faced is an unmet dependency on Ubuntu 14.10 (arch=arm64) for building the Ceph Debian packages.
In this post we’ll create a network-backed
tmpfs by constructing a RAID-0
array of remote RAM disks using TGT and iSCSI. We’ll export two 25 GB remote
RAM disks from a remote note, and use
mdadm to create a local RAID device.
Then we’ll format them with
ext4 and disable journaling for a fast in-memory
In this post I’ll show you how to use iSER, iSCSI, and LIO to setup a remote RAM disk. This is useful if you need high IOPS but don’t have access to a bunch of SSDs or NVRAM. Note that the performance achieved in this post is quite low compared to what you should be able to achieve with different hardware. Currently the arm64 machines we are using aren’t getting the performance expected, and tuning is on going. However, the description of the steps here are relevant for other installations. Once you create several remote RAM disks, tie them together with RAID-0 or dm-linear.
Notes on setting up RoCE (RDMA over Converged Ethernet) on aarch64 running Ubuntu Server.
Nov 2014 In a previous post we discussed the design of zlog, our implementation of the CORFU distributed shared-log protocol on top of Ceph. A key component of the system is the sequencer server that reports the current log tail to clients. In this post we’ll discuss the implementation and performance of the sequencer in zlog. The fast path of the sequencer server is simple. It contains an in-memory counter that is incremented when a client requests the next position in the log.
Oct 2014 Distributed logs have been receiving a lot of attention lately. And rightfully so—as a building block, they are a basic concept that in many instances can simplify the construction of distributed systems. But building a distributed log is no simple task. In this post I will share the design of zlog, our implementation of a globally consistent distributed log on top of Ceph. The implementation of zlog is based on the novel CORFU protocol for building high-performance distributed shared-logs.
Jun 2014 How fast can RADOS process a request? The answer depends on a lot of factors such as network and I/O performance, operation type, and all sorts of flavors of contention that limit concurrency. Today we’ll focus on the latency added due to request processing inside an OSD. We are going to do our performance analysis by post-processing execution traces collected using LTTng-UST. Check out Tracing Ceph With LTTng for more information on instrumenting Ceph.
Jun 2014 This post demonstrates how to use LTTng-UST to collect execution traces from Ceph. As a driving example we’ll use the traces to identify all instances of lock ownership, and how long each lock is held. This type of analysis could be useful for things like identifying sources of latency. While specific to Ceph, the tracing techniques shown can be applied to any application as a powerful tool for performance analysis and debugging.
May 2014 When failure occurs in Ceph, or when more OSDs are added to a cluster, data moves around to re-replicate objects or to re-balance data placement. This movement is minimized by design, but sometimes it is necessary to scale the system in a way that causes a lot of data movement, and will have an impact on performance (though in practice this is a rare event for which scheduled downtime may be reasonable).
This post is a quick tour of the life cycle of an
OpRequest in the
Ceph/RADOS storage server. We’ll follow the request from the time the generic
message arrives off the network, to the point that the resulting transaction
for an object operation hits the low-level object store layer as a
Feb 2014 In Part 1 of this series I looked at the cost of performing a guarded append operation on a single object with varying levels of concurrency. Without parallel journaling mode enabled, the performance of the guarded append doesn’t scale with the number of clients writing because each operation dirties the object, forcing a flush to the data drive to satisfy the read necessary for guard. In contrast, an append-only workload scales well with the number of clients appending.
Feb 2014 A powerful feature of Ceph/RADOS is its ability to atomically execute multiple operations on a single object. For instance, object writes can be combined with updates to a secondary index, and RADOS will guarantee consistency by applying the updates in a transactional context. This functionality is used extensively to construct domain-specific interfaces in projects such as the RADOS-Gateway and RBD. This transactional capability can also make it easier to construct distributed applications through the use of custom interfaces, a simple example being an atomic compare-and-swap primitive.
Here are initial performance results for a simple write workload on a new Ceph cluster. There are 6 nodes in the cluster with 2 OSDs per node. Each OSD is has a dedicated data drive formatted with XFS, and both OSDs share an SSD for the journal.
In this post I’m going to demonstrate how to dynamically extend the interface of objects in RADOS using the Lua scripting language, and then build an example service for image thumbnail generation and storage that performs remote image processing inside a target object storage device (OSD). We’re gonna have a lot of fun.
I recently needed to port the following line of networking code in Ceph to OS X (Ceph is developed almost exclusively on Linux). The MSG_MORE flag is an optimization used to inform the networking layer that more data is going to be sent shortly. The MSG_NOSIGNAL flag is used to block SIGPIPE. Unfortunately both of these macros are not defined on OS X.
May 2013 A challenge in designing systems for scientific data analysis is a lack of representative data sets and queries. In the world of relational database systems, the TPC benchmarks serve as a common tool for comparing performance. However, there has been little work done in producing benchmarks representative of scientific data analysis workloads. One such solution is the SS-DB benchmark. From the Science Benchmark (SS-DB) website: SS-DB is representative of the processing performed in a number of scientific domains in addition to astronomy, including earth science, oceanography, and medical image analysis.
May 2013 A well articulated description of how economic faults are a threat to digital preservation and access to information. Organizations often stretch their limited budgets simply to get their collections online, leaving little or nothing to ensure continued accessibility. There are ongoing costs for power, cooling, bandwidth, system administration, equipment space, domain registration, renewal of equipment, and so on. Information in digital form is much more vulnerable to interruptions in the money supply than information on paper, and budgets for digital preservation must be expected to vary up and down, possibly even to zero, over time.
Feb 2013 I’ve been working on a Lua project that wraps a C++ interface. Included in the interface are two objects that are created with a parent-child relationship. If a reference to the parent disappears and Lua garbage collection reclaims the parent object, using the child object will cause things to blow up. It took me a while to find an example of how to use a weak table to record these relationships that indirectly result in the correct GC policy.
Feb 2013 This post describes the API of the Lua object class handler system. In a previous post I provided some motivation for the project, and provided a description of the Lua object class error handling design. Another helpful resource is the Lua script used for internal unit testing that has working examples of the entire API. The previous link is to the C++ unit test suite, but at the top of the file is a long Lua script that is compiled into a string and used in the unit tests.
Feb 2013 A buddy from grad school has put up another great mix. Thanks Andrew, and enjoy!
Feb 2013 The Ceph distributed file system is built on top of a scalable object store called RADOS, which is also used as a basis for several products including RADOS Gateway and RBD. One feature of RADOS is the Object Class system, providing the ability to allow developers to define new object behavior by writing C++ plugins that execute within the context of the storage system nodes, and operate on object data using arbitrary functions.
Jan 2013 I often find myself with a lot of multivariate time series data. It’s also usually quite noisy, which makes for hard-to-interpret plots. Taking a simple moving average over the variables is a good way to smooth things out. I use the Awk script below to process my data files, which normally have a format in which the first column is time and the remaining columns contains the value of each variable.
Jan 2013 Extending applications with Lua is amazingly powerful. The task can be a little mind-bending, but with a bit of practice it all begins to make sense. One challenge with embedding the Lua VM is avoiding the possibility of crashing the host application. This is especially important for high-availability systems such as file system servers. It is good practice to execute everything in a Lua protected environment in which case errors are reported through the normal lua_error path.