keyboard_backspace back home

Modeling ZLog in Promela: Part 2

9 December 2017

in the second part we are going to tackle the bug we saw in the first that occurred when the sequencer restarted

Articles in this series:

This post is part of the following series:

Background

The basic issue in the previous version was that objects allowed client operations to succeed when those clients had stale knowledge about the system state. CORFU handles this through a mechanism called seal and retry logic in the client. The basic idea is that anytime the system state changes for instance the sequencer restarts, objects are told to reject clients with an old state. When a rejection occurs, the clients refresh their state and try again.

Add sealing to the model

We add sealing to the model described in the first part of this post series. In the interest of incrementally adding complexity to our model, we’ll keep things simple. The result is a step closer towards a model that reflects the real-world implementation, and we’ll point out the missing pieces along the way.

In order for objects in the system to notice an out-of-date client, each message is now tagged with a numeric epoch value. A new message reply type called stale_epoch is added to indicate to a client that they are using an old epoch value. A new operation and message type called seal is added for use by the sequencer in updating the epoch value in an object.

diff --git a/qa/spin/zlog.pml b/qa/spin/zlog.pml
index 41248f2..01800c3 100644
--- a/qa/spin/zlog.pml
+++ b/qa/spin/zlog.pml
@@ -3,18 +3,27 @@
 
 mtype {
   ok,
-  write
+  write,
+  stale_epoch,
+
+  maxpos,
+  seal
 };

We use the seal_req channel now for both sealing and requesting the maximum position, and the channel for entry operations is updated to include the epoch value from clients.

 // channel for requesting maximum position from object
-chan maxpos_req[NUM_OBJECTS] = [0] of { chan };
+chan seal_req[NUM_OBJECTS] = [0] of { mtype, chan, byte };
 
 // channel for object entry operations (read, write, etc...)
-chan entry_req[NUM_OBJECTS] = [0] of { mtype, chan, byte };
+chan entry_req[NUM_OBJECTS] = [0] of { mtype, chan, byte, byte };
 
 // channel for sequencer next position requests
 chan nextpos_req = [0] of { chan };

The system state in ZLog is managed by a consensus engine and stores various information like log striping information and the identity of the sequencer. This facility is also used to enforce the global ordering of epoch values. Our model is currently simple enough that we do not need additional global configuration state, and can get by with just a global value for the epoch counter.

+// TODO: this epoch value must be replaced by an oracle service such as paxos
+// which also manages the system view state. for now we just want it to
+// increment each time the sequencer restarts.
+byte global_epoch = 0;
+
 #define p2o(pos) (pos % NUM_OBJECTS)

Objects are initialized as having no epoch value. In terms of semantics this is like having an epoch value that always sorts lower than any epoch value that a client message is tagged with. The initial state is represented by the has_epoch boolean value being set to false and obj_epoch holds the current stored epoch value, if any.

Object

 proctype Object(byte id) {
@@ -27,44 +36,71 @@ proctype Object(byte id) {
   byte i, max;
   bool bare;
 
-  // handling write requests
+  // handling entry requests
   byte pos;

+  // epoch guards and sealing
+  bool has_epoch = false;
+  byte obj_epoch;
+  byte req_epoch;
+

The seal request handler is invoked by the sequencer before initialization. This is the mechanism used to force clients that race with sequencer initialization to retry their operations.

 end:
   do
+  // handle seal request
+  :: seal_req[id] ? seal(reply, req_epoch) ->
+    if
+    :: has_epoch && (req_epoch <= obj_epoch) -> assert(false)
+    :: else ->
+      d_step {
+        obj_epoch = req_epoch
+        has_epoch = true
+      }
+      reply ! ok
+    fi
+
   // handle maxpos request. if bare, pos is undefined

Now that the object contains an epoch value, each operation handled by the object needs to check if the operation’s epoch tag is out-of-date. Here we enhance the maxpos handler to verify against the stored epoch. Note here that we make a simplification that makes an invalid epoch tag be a failure. This is fine in the current model because there is only one sequencer in existence. In later iterations of the model we will want to generalize this.

-  :: maxpos_req[id] ? reply ->
-    d_step {
-      bare = true
-      for (i in entries) {
-        if
-        :: entries[i] > 0 ->
-          bare = false
+  :: seal_req[id] ? maxpos(reply, req_epoch) ->
+    assert(has_epoch)
+    if
+    :: req_epoch != obj_epoch -> assert(false)
+    :: else ->
+      d_step {
+        bare = true
+        for (i in entries) {
           if
-          :: i > max -> max = i
+          :: entries[i] > 0 ->
+            bare = false
+            if
+            :: i > max -> max = i
+            :: else -> skip
+            fi
           :: else -> skip
           fi
-        :: else -> skip
-        fi
+        }
       }
-    }
-    reply ! bare, max
+      reply ! ok, bare, max
+    fi

The write handler, however, does return a stale epoch reply when the tagged epoch is out-of-date. This forms the basic mechanism by which a sequencer restart can indirectly force clients to re-initialize from global state and retry their operation. Notice here that we still fail if a duplicate entry is written, but this won’t be triggered in the current model when we add retry based on stale epochs. We’ll have to add more complexity to the model to trigger this case, but for now it solves the problem we had in the first version of the model.

   // handle write request
-  :: entry_req[id] ? write(reply, pos) ->
+  :: entry_req[id] ? write(reply, req_epoch, pos) ->
     if
-    :: entries[pos] > 0 ->
-      assert(false)
+    :: has_epoch && (req_epoch < obj_epoch) -> reply ! stale_epoch
     :: else ->
-      entries[pos] = 1
-      reply ! ok
+      if
+      :: entries[pos] > 0 -> assert(false)
+      :: else ->
+        entries[pos] = 1
+        reply ! ok
+      fi
     fi
   od;
 }

Sequencer

The primary change to the sequencer is in the initialization process. The basic approach is to seal each object before requesting the maximum position. This fixes the race between initialization and in-flight client operations. The first step is to select a new epoch value, which is chosen to be the next value after the current global epoch value. This epoch is used to seal each object.

 proctype Sequencer() {
@@ -72,25 +108,45 @@ proctype Sequencer() {
   // initialization
   byte i, max;
   bool bare;
+  byte next_epoch;
 
   do
+  // initialization
   :: true ->
-    // initialization. phase 1: send maxpos requests
-    for (i in maxpos_req) {
-      maxpos_req[i] ! maxpos_reply
+    next_epoch = global_epoch + 1
+
+    // phase 1: seal objects
+    for (i in seal_req) {
+      seal_req[i] ! seal(seal_reply, next_epoch)
+    }
+

Note that here the sealing occurs in a pre-defined order. The model could be generalized to explore all orderings if that was useful. Next the sequencer waits for replies to each seal request, and then dispatches the maximum position request to each object, tagged with the expected epoch value.

+    for (i in seal_req) {
+      seal_reply ? status
+      assert(status == ok)
+    }
+
+    // phase 2: dispatch maxpos request to objects
+    for (i in seal_req) {
+      seal_req[i] ! maxpos(maxpos_reply, next_epoch)
     }

Next the replies are collected and the initialization is completed by storing the new sequence value, just as in the first version of the model.

-    // initialization. phase 2: reduce replies
+    // phase 3: reduce maxpos requests to init val
     seq = 0;
-    for (i in maxpos_req) {
-      maxpos_reply ? bare, max;
+    for (i in seal_req) {
+      maxpos_reply ? status, bare, max;
+      assert(status == ok)
       if
       :: !bare && (max + 1) > seq
         -> seq = max + 1
       :: else -> skip
       fi;
     }

The final step is to update the global epoch to reflect the new state. Note that in the real-world implementation this is far more complex and simply setting a global value, and includes ensuring that it hasn’t changed. Since we only have a single sequencer process and the state is restricted to just the epoch value, the following will be sufficient.

+
+    // TODO: far more complex. need a separate process to manage the view state,
+    // but right now the epoch value is the view state.
+    global_epoch = next_epoch;
+
     initok = true
 
   // handle nextpos request

Client

The client behaves similar to before. First it requests a position from the sequencer and then submits the write to the appropriate object. The change comes in the form of a retry. When the object reports a stale epoch, the client grabs the new global epoch value and repeats by first contacting the sequencer for a new position. Below we show the diff for one round, but we duplicate it so that we are doing two writes like before.

@@ -109,19 +165,21 @@ proctype Client() {
   chan entry_reply = [0] of { mtype };
   mtype status;
 
-  nextpos_req ! reply
-  reply ? pos
-  oid = p2o(pos)
-  printf("writing at pos %d\n", pos)
-  entry_req[oid] ! write(entry_reply, pos)
-  entry_reply ? status;
+  byte epoch = global_epoch
 
+write0:
   nextpos_req ! reply
   reply ? pos
   oid = p2o(pos)
-  printf("writing at pos %d\n", pos)
-  entry_req[oid] ! write(entry_reply, pos)
+  entry_req[oid] ! write(entry_reply, epoch, pos)
   entry_reply ? status;
+  if
+  :: status == stale_epoch ->
+    epoch = global_epoch
+    goto write0
+  :: else -> skip
+  fi
+  assert(status == ok)
 }

Running the model

Now we can have Spin generate a verifier for us and run the model. Right off the bat we get an error reporting that we’ve exceeded a search depth.

[[email protected] spin]$ ./pan 
error: max search depth too small
Depth=    9999 States=    1e+06 Transitions= 1.25e+06 Memory=   227.948 t=      0.9 R=   1e+06
Depth=    9999 States=    2e+06 Transitions=  2.5e+06 Memory=   327.069 t=     1.81 R=   1e+06

That can easily be increased by running with the -m option. This fixes the error, but now we observe a new assertion violation.

[[email protected] spin]$ ./pan -m999999
pan:1: assertion violated 0 (at depth 14808)
pan: wrote zlog.pml.trail

Checking the trail file that the verifier produced we see that Spin is reporting a variable overflow. I’m still not sure how to get more detail about this error, but effectively what is happening is that we are seeing an execution in which the sequencer is continually re-initializing, such that clients are never making progress. At some point the epoch value overflows.

[[email protected] spin]$ spin -t zlog.pml
spin: zlog.pml:117, Error: value (256->0 (8)) truncated in assignment
spin: zlog.pml:52, Error: assertion violated

There are many ways to deal with this problem in Spin. In a later post we’ll show how prune these uninteresting executions, and add scheduling fairness, but at this point we are going to just use a simple hack. The hack limits the number of times that a sequencer can re-initialize. This way every execution will eventually give the client a chance to run.

diff --git a/qa/spin/zlog.pml b/qa/spin/zlog.pml
index 01800c3..fe70465 100644
--- a/qa/spin/zlog.pml
+++ b/qa/spin/zlog.pml
@@ -108,12 +108,12 @@ proctype Sequencer() {
   // initialization
   byte i, max;
   bool bare;
-  bool initok = false;
   byte next_epoch;
+  byte init_count = 0;
 
   do
   // initialization
-  :: true ->
+  :: true && init_count < 5 ->
     next_epoch = global_epoch + 1
 
     // phase 1: seal objects
@@ -147,10 +147,10 @@ proctype Sequencer() {
     // but right now the epoch value is the view state.
     global_epoch = next_epoch;
 
-    initok = true
+    init_count++
 
   // handle nextpos request
-  :: initok ->
+  :: init_count > 0 ->
 end:
     nextpos_req ? nextpos_reply
     nextpos_reply ! seq

We’ll also disable the client retry logic to verify that we encounter the stale epoch in the verifier.

@@ -173,12 +173,12 @@ write0:
   oid = p2o(pos)
   entry_req[oid] ! write(entry_reply, epoch, pos)
   entry_reply ? status;
-  if
-  :: status == stale_epoch ->
-    epoch = global_epoch
-    goto write0
-  :: else -> skip
-  fi
+//  if
+//  :: status == stale_epoch ->
+//    epoch = global_epoch
+//    goto write0
+//  :: else -> skip
+//  fi
   assert(status == ok)
 }

Which in fact we do as shown here.

[[email protected] spin]$ ./pan 
pan:1: assertion violated (status==5) (at depth 313)
pan: wrote zlog.pml.trail

[[email protected] spin]$ spin -t zlog.pml
spin: zlog.pml:182, Error: assertion violated

When enabling the retry logic in the client, things work much better. The verifier runs without encountering any assertion failures.

[[email protected] spin]$ ./pan 

(Spin Version 6.4.7 -- 19 August 2017)
        + Partial Order Reduction

Full statespace search for:
        never claim             - (none specified)
        assertion violations    +
        acceptance   cycles     - (not selected)
        invalid end states      +

State-vector 163 byte, depth reached 351, errors: 0
    18865 states, stored
     3983 states, matched
    22848 transitions (= stored+matched)
        9 atomic steps
hash conflicts:        24 (resolved)

This is really great, but there is a lot more we need to add to the model and how we are running the verifier. Later in this series we are going to explore different ways to deal with non-progress states that we saw in this post, and we’ll also be introducing additional clients.