Notes from ElixirConf EU 2018 (part 3)

Continuing our series of posts describing the experience of joining this year’s edition of ElixirConf EU 2018, today we will talk about the tools around processing big amounts of data using GenStage and Flow.

Elixir is a young language

Previously, we’ve mentioned that Elixir is a 7-year-old language, which actually makes it quite young compared to other technologies. This fact, though, puts the Elixir Core team in an interesting position when considering the future of a language - there is a lot in other technologies, that could be considered a good addition to the ecosystem.

For example, Elixir’s pipe operator (|>) was taken from the idea of piping commands in command line; the powerful macro tooling was inspired by Lisp macros; and the concept of tests parsing the documentation of the code, and using examples as tests was an idea taken from Python.

BEAM, the virtual machine that Elixir programs run on, is a powerful tool that enables easier concurrency - this is a solid ground for building applications that would be able to handle big amounts of data. If only there was a good abstraction for doing that…

The problem

You might have experienced this problem before: you have to implement a data pipeline that consists of multiple stages, such as reading through a CSV file, enriching or aggregating the parsed data and - at the end - have it “consumed” by a slower process (e.g. sending emails or publishing to your queuing system).

The problem lies in how the pipeline is structured: it’s not uncommon that the “first stage” of your data processing will be progressing faster than your “last stage”! One approach for that might be to introduce throttling in the “file reader” to read through the file slower - slow enough, so the consumer will be able to process the data as it arrives.

How do you decide on the throttling in this first stage, though? Make the throttle with too short windows, and the problem remains: too much data produced, so the consumer is unable to catch up; make the window too wide, and the program will be progressing through the data for too long, making it inefficient.

GenStage and Flow

What if the data flow was described in reverse order? What if a stage requested more data to process from the previous stage, which would then ask its previous stage, and so on, up to the very first stage - here, the data reader?

It turns out, the problem has already been solved by the Akka community. The backpressure mechanism became an inspiration for GenStage, Flow finds inspiration in Apache Spark and Apache Beam for a high-level description of concurrent programming. Think of Flow as an API built on top of GenStage describing the computation, and the framework will help in adjusting the processes to perform it.

Since the official announcement of GenStage I was looking for some more resources to learn about the tools, and in this year’s ElixirConfEU edition there were a couple of presentations that talked exactly about it!

My picks are:

Jusabe Guedes - Orchestrating Consumers and Producers Like an Octopus

Jusabe, in his presentation, talks about the experience using GenStage in production system that consume events from Amazon’s SQS Service.

László Bácsi - Robust Data Processing Pipeline with Elixir and Flow

László, on the other hand, was playing with the higher-level API using Flow while building the system for UK’s parcel service company - CollectPlus. He was tasked with implementing a data pipeline to process billions of rows from MySQL and CSV documents and to load them into Redshift. You can find more information in the recording of his session.

Summary

Both presentations were focusing on the practical aspect of GenStage and Flow, and, if you want to know more, your best bet is to check the videos: they include a lot of code examples for approaching solving the problem at hand.

Footer