Implementing Active/Standby Master Process

Zookeeper has been designed so that we can develop distributed applications easier. For example, we can use it to switch from a failing active process to a standby one, in an active/standby configuration. This is what this blog post is about. It demonstrates the ability to have a process switch from standby to active role, when another active process crashes.

This is the 3rd story on Zookeeper. The previous ones can be found here:

  1. Introduction To Zookeeper
  2. Zookeeper Service Availability

Design and Architecture

Before we delve into actual code let’s see an overview of the architecture of this demo.

Master - Worker Architecture with Active Standby Master Processes

The above is a hypothetical system in which a task producer gives commands to a master. The master distributes the commands to workers. Although we could have had only 1 master process, we want to make master functionality highly available. In order to do that, we bring up multiple processes to play the role of master. However, only 1 master process will be active. The other ones will be standby.

That particular bit of the whole system architecture is the one that we are going to implement here, i.e. the part of the multiple master processes, one being active, whereas, the others being standby.

The rest of the system architecture will be implemented and demonstrated in following Zookeeper stories.

Zookeeper Coordinates Active/Standby Election

In order to implement the active/standby architecture, we will rely on Zookeeper to keep track of the master process. How can we do that? These are the rules of the game:

  1. Assume that the active process owns the active token, whereas the standby does not.
  2. When a process starts, it will request Zookeeper the active token.
  3. If Zookeeper has not given the active token to any other process, it will give the token to the process that is requesting it.
  4. If Zookeeper has given the active token to another process, it will deny the token to the requesting process.
  5. When a process is denied the active token, then it asks Zookeeper to be notified when the active token is free.
  6. When a process is notified that the active token is free, it requests the active token from Zookeeper. It might get it, or it might not get it. This is because in between the notification that active token is free and the request to acquire the token, another process might have grabbed it.
  7. If a process that has the active token crashes, Zookeeper will consider the active token free, and will give it to the next process that will request it.

Zookeeper Properties

The Zookeeper properties that will help us implement this active token concept are:

  1. When Zookeeper is requested to create a node, it will return an error if the node already exists.
  2. Requests to create nodes are serialized and are atomic.
  3. There is a node type which is called ephemeral and which lives as long as the client session lives. It is automatically deleted when the session with the client terminates.
  4. A client can request a watch on an existing node. If the node is deleted, then client is notified.

Implement Active Token

Given the above Zookeeper properties

  1. When a process creates an ephemeral node it will become the active process. We assume that it takes the active token.
  2. Any other process that will try to create the same ephemeral node, it will fail. We assume that this process will be the standby one, without the active token.
  3. The standby process will set a watch on the ephemeral node. Hence, it will be notified when that node will be deleted.
  4. When the active process fails/crashes for any reason, its session with Zookeeper will terminate and Zookeeper will delete the ephemeral node. Hence, any other standby process will be notified.
  5. The standby processes are notified and the first that manages to create the new ephemeral node will become the active node. The others, will ask to be notified again, by installing a new watch on the ephemeral node.

Example Interaction

In the following diagram, you can see how the ephemeral node is being used in order to implement the concept of the active token.

Active token concept implemented with an ephemeral Zookeeper node

The master active process is the one that initially has created the ephemeral Zookeeper node /master. The standby master process fails to create the node, and installs a watch. When the master active process crashes, Zookeeper notifies the standby process that /master ephemeral node does not exist any more. When standby process receives the notification, it creates the /master ephemeral node and becomes the active master process.

Implementation using Ruby

We have had enough of theory. Let’s write some code.

Note: The demo code can be found in this Github repo (tag: active-stand-by-master-nodes)

Master Process

The master process code is given below:

# File: master.rb
#
master_app = MasterApp.new
master_app.connect_to_zk
result = master_app.register_as_active
master_app.watch_for_failing_active unless result
while true
  sleep 3
  puts "I am #{master_app.mode} master"
end

The process does not actually deal with tasks. This will be implemented in future Zookeeper stories. This version above, all that it does is to be in a infinite loop printing its mode in between 3 second sleeps.

The master process mode will be either active or standby.

  1. The implementation relies on the class MasterApp. Read about it further down in the post.
  2. The process initially connects to the Zookeeper server: master_app.connect_to_zk
  3. Then it tries to register as an active master process: result = master_app.register_as_active
  4. If it manages to register as active (result being true), it goes to the main loop.
  5. If it does not manage to register as active (result being false), it registers a watch to be notified if the active process crashes/fails. Then it goes to the main loop.

Note that our client master process has another thread waiting for notifications from Zookeeper. Hence, while it may be in the while loop, this does not prevent the standby process from being notified by the Zookeeper that ephemeral node has been deleted.

MasterApp

The MasterApp class is the one that has the logic to connect to Zookeeper and register as active or standby.

# File: master_app.rb
#
require 'zookeeper'
require 'zookeeper_client_configuration'
require 'zookeeper_client_api_result'
require 'zookeeper_client_watcher_callback'

class MasterApp
  MASTER_NODE = '/master'

  attr_reader :mode

  def initialize
    self.mode = :not_connected
  end

  def connect_to_zk
    @zookeeper_client = Zookeeper.new(zookeeper_client_configuration.servers)
  end

  def register_as_active
    result = create_ephemeral_node(MASTER_NODE, Process.pid)
    if result.no_error?
      self.mode = :active
      true
    elsif result.node_already_exists?
      self.mode = :standby
      false
    else
      raise "ERROR: Cannot start master app: #{result.inspect}"
    end
  end

  def watch_for_failing_active
    watcher_callback = Zookeeper::Callbacks::WatcherCallback.create do |callback_object|
      callback_object = ZookeeperClientWatcherCallback.new(callback_object)

      if callback_object.node_deleted?(MASTER_NODE)
        result = register_as_active
        watch_for_failing_active unless result
      end
    end

    zookeeper_client.stat(path: MASTER_NODE, watcher: watcher_callback)
  end

  private

  attr_reader :zookeeper_client
  attr_writer :mode

  def zookeeper_client_configuration
    ZookeeperClientConfiguration.instance
  end

  def create_ephemeral_node(node, data)
    ZookeeperClientApiResult.new(zookeeper_client.create(path: node, data: data.to_s, ephemeral: true))
  end
end

zookeeper gem

The communication of the master process with the Zookeeper server relies on the zookeeper gem.

Keeping Track of Mode

There is an @mode instance variable that takes the values :not_connected, :active and :standby. The process that manages to create the ephemeral node, it will be in mode :active. Otherwise, :standby. When initializing, it is :not_connected.

MasterApp#register_as_active

This is the method that tries to register the process as active.

# File: master_app.rb
#
...
def register_as_active
  result = create_ephemeral_node(MASTER_NODE, Process.pid)
  if result.no_error?
    self.mode = :active
    true
  elsif result.node_already_exists?
    self.mode = :standby
    false
  else
    raise "ERROR: Cannot start master app: #{result.inspect}"
  end
end
...  
  1. It tries to create the ephemeral node.
  2. If it succeeds, then it changes its mode to :active and returns true
  3. If it fails and the error code is that the node already exists, then it stays to :standby and returns false

Watch For Failing Active

If the master process does not manage to get the active token, i.e. to create the ephemeral node, then it will attach a watch on the existing ephemeral node, in order to be notified by Zookeeper for any change in that node.

# File: master_app.rb
#
...
def watch_for_failing_active
  watcher_callback = Zookeeper::Callbacks::WatcherCallback.create do |callback_object|
    callback_object = ZookeeperClientWatcherCallback.new(callback_object)
    if callback_object.node_deleted?(MASTER_NODE)
      result = register_as_active
      watch_for_failing_active unless result
    end
  end
  zookeeper_client.stat(path: MASTER_NODE, watcher: watcher_callback)
end
...  

In order for a client process to install a watch, it first creates the callback and then calls the #stat method to install it.

Defining Watch Callback

This is how the watch callback is defined:

# File: master_app.rb
#
...
watcher_callback = Zookeeper::Callbacks::WatcherCallback.create do |callback_object|
  callback_object = ZookeeperClientWatcherCallback.new(callback_object)
  if callback_object.node_deleted?(MASTER_NODE)
    result = register_as_active
    watch_for_failing_active unless result
  end
end
...

The code inside the block, is going to be called / executed when Zookeeper notifies our process. It is given a callback_object, that we then wrap into ZookeeperClientWatcherCallback in order to interpret its content. If the callback_object informs us that the ephemeral node has been deleted, then we call register_as_active. If register_as_active succeeds, then we are now the active process, but if it fails, we call the watch_for_failing_active method again.

Registering Watch

After defining the callback logic, we need to register the watch:

# File: master_app.rb
#
...
zookeeper_client.stat(path: MASTER_NODE, watcher: watcher_callback)
...

The method #stat tells Zookeeper to return back to us the status of the node given in the path argument. Also, we pass the watcher argument specifying how we want to get notified for any changes in that node.

Creating Ephemeral Nodes

The method that creates the ephemeral node is:

# File: master_app.rb
#
...
def create_ephemeral_node(node, data)
  ZookeeperClientApiResult.new(zookeeper_client.create(path: node, data: data.to_s, ephemeral: true))
end
...

The #create() method called on the Zookeeper client, is also sent the ephemeral: true argument. This is the way to create an ephemeral node.

Interpreting API Result

The MasterApp class relies on ZookeeperClientApiResult class in order to interpret the result/response from Zookeeper.

# File: zookeeper_client_api_result.rb
#
class ZookeeperClientApiResult
  def initialize(result)
    @result = result
  end

  def request_id
    result[:req_id]
  end

  alias :req_id :request_id

  def node_already_exists?
    result[:rc] == Zookeeper::Constants::ZNODEEXISTS
  end

  def no_error?
    result[:rc] == 0
  end

  attr_reader :result
end
  1. If the result code (:rc) is equal to Zookeeper::Constants::ZNODEEXISTS, then we know that we have tried to create a node that was already there.
  2. If the result code (:rc) is 0, then no error has been returned and everything succeeded at the Zookeeper side.
  3. The :reg_id key returns the request identifier.

Interpreting Watcher Callback

When Zookeeper calls our master process back, which happens when the ephemeral node is deleted, it wraps information inside an object that exposes two important properties:

  1. the type property, which tells us what type of event has taken place, and
  2. the path property, which tells us which is the node the event has taken place against.

We take advantage of these pieces of information above within the class ZookeeperClientWatcherCallback:

# File: zookeeper_client_watcher_callback.rb
#
class ZookeeperClientWatcherCallback
  def initialize(callback_object)
    @callback_object = callback_object
  end

  def node_deleted?(node)
    callback_object.type == Zookeeper::Constants::ZOO_DELETED_EVENT && callback_object.path == node
  end

  private

  attr_reader :callback_object
end

The method #node_deleted?() checks whether the node given is the one that has just been deleted.

Example Run

The README.md file, in the source code of this demo, explains how you can configure and run this demo.

Closing Note

In this blog post, we have learnt

  1. About ephemeral Zookeeper nodes.
  2. About Zookeeper commands being synchronized and atomic.
  3. How to use the zookeeper gem to create ephemeral nodes and how to install watches on them.
  4. How to process callback calls from Zookeeper when an ephemeral node is deleted.
  5. How to create an active/standby process architecture using all the above.

Thank you for reading this blog post. Your comments below are more than welcome. I am willing to answer any questions that you may have and give you feedback on any comments that you may post. Don’t forget that, I learn from you as much as you learn from me.

About the Author

Panayotis Matsinopoulos works as Development Lead at Simply Business and, on his free time, enjoys giving and taking classes about Web development at Tech Career Booster.

Footer