Zookeeper Service Availability

Zookeeper is a service that can be replicated so that one can have multiple servers running on different machines, essentially having a distributed Zookeeper service. This blog post explains, using a Ruby example, when Zookeeper service is available and when not.

This is the 2nd story on our encounter with Zookeeper. If you have not read the first one, you can read it here.

Client Switches To Available Server

Having a cluster of multiple Zookeeper servers, a Zookeeper client can connect to anyone of them and start requesting information and issuing commands. But, what does it happen if the server, that the client is connected to, goes down? The client is designed in such a way so that it can pick up the next available server in order to continue with its interaction with the Zookeeper cluster.

Let’s see how this is done.

Note: Please, read the first part of these Zookeeper stories in which we explain how you can set up 3 Zookeeper servers to run on your local machine.

Start 3 Zookeeper Servers

Let’s start 3 Zookeeper servers on different terminals. We will use the start-foreground option, so that we can watch their logs:

Start 1st Server

In Zookeper folder:

$ bin/zkServer.sh start-foreground ~/Documents/zookeeper-3.4.10/conf/1.cfg
... console output will be displayed here ...

Start 2nd Server

Using another terminal window start 2nd server:

In Zookeeper folder:

$ bin/zkServer.sh start-foreground ~/Documents/zookeeper-3.4.10/conf/2.cfg
... console output will be displayed here ...

Start 3rd Server

Using another terminal window start 3rd server:

In Zookeeper folder:

$ bin/zkServer.sh start-foreground ~/Documents/zookeeper-3.4.10/conf/3.cfg
... console output will be displayed here ...

Write Ruby Client

Now, let’s write the Ruby client program. Save the following as main.rb:

In your working folder:

require 'zookeeper'

zookeeper = Zookeeper.new('127.0.0.1:2181,127.0.0.1:2182,127.0.0.1:2183')

while true
  data = zookeeper.get(path: '/test')
  puts data.inspect
  sleep(5)
end

This is a very simple client that will be fetching data for the znode /test, repeatedly, sleeping for 5 seconds between consecutive fetches. On line 3, we tell this client that we have a cluster of 3 Zookeeper services. The idea here is that the client will pick up one and set up a connection with it, in order to do the fetches. It will be continuously connected to this server, as long as this server is alive. However, if this server fails for some reason, it will automatically switch to the next available server.

Note: We assume that you have a znode created with the name /test. If you don’t have that, read the first part of these Zookeeper stories, in order to learn how you can do it.

Start Ruby Program

Now start the Ruby program and watch the logs of the 3 servers.

Note: I am assuming that you have a Gemfile created in your project folder with the following content:

source :rubygems

gem 'zookeeper'

In your working folder:

$ bundle exec ruby main.rb
{:req_id=>0, :rc=>0, :data=>"my_data", :stat=>#<Zookeeper::Stat:0x007fcfb39afa80 @exists=true, @czxid=8589934594, @mzxid=8589934594, @ctime=1504002919154, @mtime=1504002919154, @version=0, @cversion=1, @aversion=0, @ephemeralOwner=0, @dataLength=7, @numChildren=1, @pzxid=8589934597>}
{:req_id=>1, :rc=>0, :data=>"my_data", :stat=>#<Zookeeper::Stat:0x007fcfb3c77560 @exists=true, @czxid=8589934594, @mzxid=8589934594, @ctime=1504002919154, @mtime=1504002919154, @version=0, @cversion=1, @aversion=0, @ephemeralOwner=0, @dataLength=7, @numChildren=1, @pzxid=8589934597>}
... more output here ... every 5 seconds ...

Watching Logs

On one of your servers you will see something like this:

... (more output here) ...
... (more output here) ... Accepted socket connection from /127.0.0.1:50674
... (more output here) ... Connection request from old client /127.0.0.1:50674; will be dropped if server is in r-o mode
... (more output here) ... Client attempting to establish new session at /127.0.0.1:50674
... (more output here) ... INFO  [CommitProcessor:2:ZooKeeperServer@687] - Established session 0x25e562c9e7a0000 with negotiated timeout 20001 for client /127.0.0.1:50674

This will be the server that would have accepted the connection from your client. Note that this is not necessarily the first server.

Watch the id of the established session (look at line 4 in the log above). In my case this was 0x25e562c9e7a0000. This is the unique session id and you will notice that it will persist even if we switch to another server.

Kill Server that Serves Client

Now, go ahead and use Ctrl + C to stop the server that received the client connection. When you do that, you will see another server picking up, and your Ruby client will continue fetching data.

This is what you will see on another server:

... (more output here) ...
... (more output here) ... Accepted socket connection from /127.0.0.1:50708
... (more output here) ... Connection request from old client /127.0.0.1:50708; will be dropped if server is in r-o mode
... (more output here) ... Client attempting to renew session 0x25e562c9e7a0000 at /127.0.0.1:50708
... (more output here) ... Established session 0x25e562c9e7a0000 with negotiated timeout 20001 for client /127.0.0.1:50708

That is a long output. But the last 4 lines show that the Client attempting to renew session 0x25e562c9e7a0000. This is very clear. Our client does not give up when the original server fails. It automatically tries to connect to another server and renew the session. Look at the session id displayed now: 0x25e562c9e7a0000. It is the same session id that was printed by the server the client was originally connected to.

The above experiment proves that:

  1. Our client will try to reestablish a failed connection, by connecting to another server in the Zookeeper cluster.
  2. It will connect to the new server and will keep on being in the same Zookeeper session.

Kill Zookeeper Again

Now, go ahead and kill the next Zookeeper server, the one that our client is talking to at the moment. What will happen? Out of the 3 initial Zookeeper servers, we are going to have only 1. How will client behave?

Unfortunately, this is a situation in which Zookeeper cluster will not work and client will get a session time out exception, and, stop sending requests to fetch data. It will terminate.

What has happened here is that Zookeeper cluster is not working because there are not enough servers to make sure that we have proper replication and Leader election. Number of live servers is less than the majority/quorum needed in order for the cluster to work.

Stop the Last Server

Now, go ahead and stop the last server too.

Repeat with 5 servers

You can repeat the previous experiment but with 5 Zookeeper servers in the cluster. You will see that Zookeeper fails to serve client requests when you are left with 2 servers.

Although you may have 2 Zookeeper servers running (out of 5 initially in the cluster), the communication between the client and the Zookeeper cluster fails to be reestablished. This is, again, because the number of Zookeeper servers running are less than the majority of the servers being alive. The number of the servers being alive and running, it needs to be greater than or equal to the majority (quorum) of the servers in the cluster. Otherwise, Zookeeper cluster does not serve requests.

In more formal terms, the quorum is calculated by the formula ceil(N/2), where N is a number of servers in a cluster (or ensemble in Zookeeper nomenclature). For a 3 server ensemble, that means 2 servers must be up at any time, for a 5 server ensemble, 3 servers need to be up at any time.

Note: The quorum is necessary in order for the Leader to be selected. We will talk about the Leader in a next story.

Closing Note

In this blog post:

  1. We have demonstrated how the Zookeeper client is clever enough to switch to another server when the originally connected server goes down.
  2. Also, we have demonstrated how Zookeeper client keeps on using the same Zookeeper session (identified by the same id) even if it switches to another server.
  3. We have also seen that the Zookeeper cluster needs the majority of the servers alive in order to work. Otherwise, it is not possible to serve requests, being unable to select a Leader.

Thank you for reading this blog post. Your comments below are more than welcome. I am willing to answer any questions that you may have and give you feedback on any comments that you may post. Don’t forget that, I learn from you as much as you learn from me.

About the Author

Panayotis Matsinopoulos works as Development Lead at Simply Business and, on his free time, enjoys giving and taking classes about Web development at Tech Career Booster.