Continuing our series of posts describing the experience of joining this year’s
edition of ElixirConf EU 2018, today we will talk
about the tools around processing big amounts of data using
Elixir is a young language
Previously, we’ve mentioned that Elixir is a 7-year-old language, which actually makes it quite young compared to other technologies. This fact, though, puts the Elixir Core team in an interesting position when considering the future of a language - there is a lot in other technologies, that could be considered a good addition to the ecosystem.
For example, Elixir’s pipe operator (
|>) was taken from the idea of piping commands
in command line; the powerful macro tooling was inspired by Lisp macros; and the concept
of tests parsing the documentation of the code, and using examples as tests was an
idea taken from Python.
BEAM, the virtual machine that Elixir programs run on, is a powerful tool that enables easier concurrency - this is a solid ground for building applications that would be able to handle big amounts of data. If only there was a good abstraction for doing that…
You might have experienced this problem before: you have to implement a data pipeline that consists of multiple stages, such as reading through a CSV file, enriching or aggregating the parsed data and - at the end - have it “consumed” by a slower process (e.g. sending emails or publishing to your queuing system).
The problem lies in how the pipeline is structured: it’s not uncommon that the “first stage” of your data processing will be progressing faster than your “last stage”! One approach for that might be to introduce throttling in the “file reader” to read through the file slower - slow enough, so the consumer will be able to process the data as it arrives.
How do you decide on the throttling in this first stage, though? Make the throttle with too short windows, and the problem remains: too much data produced, so the consumer is unable to catch up; make the window too wide, and the program will be progressing through the data for too long, making it inefficient.
What if the data flow was described in reverse order? What if a stage requested more data to process from the previous stage, which would then ask its previous stage, and so on, up to the very first stage - here, the data reader?
It turns out, the problem has already been solved by the Akka
community. The backpressure mechanism became an inspiration for
finds inspiration in Apache Spark and
Apache Beam for a high-level description of concurrent
programming. Think of
Flow as an API built on top of
the computation, and the framework will help in adjusting the processes to
Since the official announcement of GenStage I was looking for some more resources to learn about the tools, and in this year’s ElixirConfEU edition there were a couple of presentations that talked exactly about it!
My picks are:
Jusabe Guedes - Orchestrating Consumers and Producers Like an Octopus
László Bácsi - Robust Data Processing Pipeline with Elixir and Flow
László, on the other hand, was playing with the higher-level API using Flow while building the system for UK’s parcel service company - CollectPlus. He was tasked with implementing a data pipeline to process billions of rows from MySQL and CSV documents and to load them into Redshift. You can find more information in the recording of his session.
Both presentations were focusing on the practical aspect of
and, if you want to know more, your best bet is to check the videos: they include a lot of code examples for
approaching solving the problem at hand.