One of the main purposes of testing is to enable us to trust the performance of our code in an automated way. Unfortunately, several problems often arise with automated tests. One common and particularly annoying problem that we began experiencing in our codebase was flakiness.

A flaky test is one which sometimes passes and sometimes fails: it’s supposed to be green, but fails periodically due to unknown causes.

There are three main reasons that make them especially infuriating:

They are very difficult to reproduce You may not be able to investigate sporadic failures easily, and if you have a fix, it will likely be hard to test. In some cases, you can have an idea of the probability of failure. If it’s failing often (let’s say 50% of the time), it’s not a big issue. However, if this happens only once every hundred runs, you may have to try different solutions in parallel and hope that one of them will work.
You don’t know why they are failing The test is supposed to be green, so you don’t really know why this is happening. There are many reasons why a test can be flaky, and these reasons may be very far-fetched and unexpected. This point is related to the previous one: if you do know the reason, you may be able to reproduce it, or vice versa.
They are easy to just ignore (but you shouldn’t!) This point is more subtle but it can be very damaging. Say you run your tests and find that one is unexpectedly failing. You rerun it, and now it’s green—hooray! It seems natural to forget about the failure and pick up something else to do. It only happens once a week, so why should you waste time fixing something so minor? Time progresses, and soon you have multiple failures happening all the time. Essentially, your tests have became completely unreliable. Remember the first sentence of this article—if you can’t trust the tool that is supposed to give you confidence in your code, then how can you trust your code?

Common causes

The first step in attempting to fix these tests is researching the cause. You may never be able to properly identify it, but the major ones we encountered within our code are listed below:

Order of execution of your tests: If you run your tests in a random order, try to see if running them with a seed where your test failed is making it consistent. Something can be leaking from one test and corrupt a following test. It could be records in a database, an environment variable, delayed jobs, or stubbing. You should always make your tests independent of what was happening before.
Time: Is your test failing only during runs around midnight, at the beginning of the year, at the beginning of the month, or during the full moon? If you find a pattern of failures like this, it might be due to time or date issues.
Environment: You should also try to understand if it happens in a particular environment, i.e. does it only fail on staging, on a specific OS, etc… anything you can find in common about the failures may be useful in understanding if this is specific to where you run your tests.
Race conditions: Last but not least, any race condition can make a test flaky. An example would be if your test is dependent on an AJAX request the page triggers, and the test runs before the request is loaded. This cause is very common and may be very difficult to identify. However, because of its frequency, testing tools generally have ways to deal with it.

Capybara, a patient animal

Capybara has native mechanisms to prevent failures caused by race conditions. Every time you try to find an element (there is also an equivalent for the absence of an element, like has_no_selector?), it will wait until it can find it. It will only wait for a specific time (which you can specify with Capybara.default_wait_time) and then will timeout if it fails. Please note that Capybara should only wait when something is wrong. Increasing default_wait_time should not impact performance on green tests.

Capybara does not do anything fancy like analysing AJAX requests or loading elements (if you are interested by these other approaches, you can read this article. Capybara’s approach is very high-level, which makes it both simple and powerful.

However, everybody should be aware of a trap: some Capybara methods wait for elements to appear and some do not.

There are some very good reasons why; sometimes it just doesn’t make sense to wait (mainly because you don’t know what to wait for). For example, if you try to find a red button on a page, you know you need to wait for this button. But if you try to find “all red buttons”, then you don’t know the scope anymore— should you wait until you find one, two, or 99 red buttons?

Here’s a quick overview of methods that do wait:

find, find_field, find_link and other finders
within
has_selector?, has_no_selector? and similar
click_link, fill_in, check, select, choose and other actions

and the methods that don’t wait:

visit(path)
current_path
all(selector)
first(selector)
execute_script, evaluate_script
simple accessors: text, value, title, etc

For example, the following code snippet could cause random failures:

expect(all('.cats').count).to eq 9

Why? Neither all or count are Capybara methods that will wait for the specified number of instances to be loaded.

Instead, you should always use something like this:

expect(page).to have_css '.cats', 9

Our tests and best practices

A few weeks ago, our shakedown tests had these flaky failures. Many of these failures have since been solved by following some best practices, discussed below.

One major cause was the use of this code (and similar snippets):

page.execute_script "$('input[id$=\"_no\"]').attr('checked',true)"

This was used to check multiple buttons on the page. This technique has several drawbacks:

execute_script is not waiting
we are not using Capybara’s native methods
this is not how a real user would be interacting with the form

A better approach was to use:

click_radio_button 'question_one_no'
click_radio_button 'question_two_no'
click_radio_button 'question_three_no'
...

Unfortunately, it means we have to select an answer to each question separately, but now we’re using the full functionality of Capybara and also more accurately modeling user interactions! Our takeaway was to never use page.execute_script unless there was a compelling reason to do so.

Another interesting one was:

page.execute_script "$('button:contains(#{button})').click()"

Here we actually needed to use execute_script (to fix some obscure bug…) but how could we have made our tests more robust? Check out this code below:

find("button", text: button)
page.execute_script "$('button:contains(#{button})').click()"

Now, we first use the patient method find to leverage Capybara’s built-in functionality. Once this line passes, we can safely run our Javascript method.

Using the same idea, we also added this small helper:

#  page.first does not use Capybara mechanism to wait for an element to be present
#  so we need to use `find` (which does) to patiently find our first element
#  use this method if you think this would make your step more robust

def find_first(selector)
  page.first(selector) if find(selector, match: :first)
end

Additional note about performance

We tested increasing the Capybara.default_wait_time and noticed it actually increased the overall run time of our test suite. If you recall what was written above—that increasing the default_wait_time should not impact any performance on green tests—this was quite surprising.

Changing this code:

!actual.has_css?("#lol_cat", visible: true)

to this

actual.has_no_css?("#lol_cat", visible: true)

fixed it!

Since we were actually expecting the absence of this CSS snippet, has_css? was waiting the entirety of the default_wait_time. Additionally, since this code was in a step used by many tests, changing this time to 10 seconds was enough to increase the overall time to run the test suite by a few minutes!

If you want to detect similar issues in your code, you may want to use this monkey patch (for capybara-2.4.4):

module Capybara
  module Node
    class Base
      def synchronize(seconds=Capybara.default_wait_time, options = {})
        start_time = Time.now

        if session.synchronized
          yield
        else
          session.synchronized = true
          begin
            yield
          rescue => e
            session.raise_server_error!
            raise e unless driver.wait?
            raise e unless catch_error?(e, options[:errors])
            if (Time.now - start_time) >= seconds
              warn "Capybara's timeout limit reached - if your tests are green, something is wrong"
              raise e
            end
            sleep(0.05)
            raise Capybara::FrozenInTime, "time appears to be frozen, Capybara does not work with libraries which freeze time, consider using time travelling instead" if Time.now == start_time
            reload if Capybara.automatic_reload
            retry
          ensure
            session.synchronized = false
          end
        end
      end
    end
  end
end

This adds the line warn "Capybara's timeout limit reached - if your tests are green, something is wrong" into the core timeout mechanism of Capybara. Whenever you see this output during a green test, it means you have unecessarily waited and there’s an opportunity to to optimise the test.

I’ve tried everything but I still get these failures!

Knowledge of these good practices may not be enough. If you don’t know what to do, it may be because you don’t have enough information about your errors. In this case, using heavy logging may be very useful. Try enabling logging:

# driver should be something like an instance of Capybara::Webkit::Driver (for example) - it depends of your tests
driver.enable_logging

Flaky tests are a common and difficult issue. Perseverance, heavy logging and good testing practices are the best way to tackle them. It may be very time consuming to fix them, but ignoring them will eventually be much worse!

Ready to start your career at Simply Business?

Want to know more about what it’s like to work in tech at Simply Business? Read about our approach to tech, then check out our current vacancies.

Find out more

Flaky tests & Capybara best practices