Go routines for stateless AWS Lambda functions

I love serverless. I’ve written about it before in my previous post on how to implement a third-party logs ingestion workflow, but I now want to tackle the question: ‘Is it possible to scale stateless serverless functions on the runtime as opposed to leveraging function concurrency?’

In short, my answer is yes.

Stateless vs stateful applications

In the context of AWS Lambda, serverless makes it easier to scale up applications in multiple parallel executions. However, this also implies that stateful applications need to rely on some form of external storage to share memory. This is the only asynchronous way by which function invocations can ‘communicate’ with each other about their evolving status.

This adds to the complexity. It is somewhat harder to design a system that does not store its shared status on any form of local resource, be it volatile memory or the filesystem. But is it always necessary to avoid all forms of local sharing in the serverless runtime environment?

I believe the answer is no. Purists may shudder at this but I will try to make my case.

It really depends on whether the application you’re writing is stateful as opposed to stateless.

If you’re writing code that runs periodically and perpetually, with a need to communicate its evolving status, (and this is what I mean by stateful) then pushing the runtime for more parallelism is not an idea that will significantly simplify your application design. You’ll still need to have external storage in which to track the status, as the runtime existence is ephemeral and bound in time.

If instead your application is stateless, then much can be gained by leveraging parallel executions on the runtime itself.

Let’s consider an example. You want to write a scavenger function that periodically checks whether entries on a DynamoDB table are old enough to be deleted, so that the table is kept minimal and scan operations consume less read capacity. In short, you want to save on your DynamoDB bill.

This is an example of a stateless application. The single application run – or Lambda function invocation – does not need to know anything about previous runs.

By increasing the parallelism on the Lambda runtime environment, we get more speed, and possibly some savings on the Lambda bill.

If we were instead invoking the Lambda function multiple times in parallel, we would possibly be getting multiple cold starts, together with having to share an application status. That would be necessary to avoid having the same segments in our DynamoDB table being read by different Lambda invocations, thus giving up the benefits of parallel execution. In short, we’d be introducing a status for a conceptually stateless invocation. It doesn’t sound right.

Language considerations

Now this is where I get excited. Increasing the parallelism of code is something that can be done in any language supporting multithreading.

Golang has been my language of choice, because I think the Go pattern of sharing memory by communicating fits very well with stateless AWS Lambda functions.

Go support for cheap routines and channels is outstanding. Together with the support for context cancellation provided by the package context, it makes a great candidate language for programming a parallel function for the Lambda runtime.


The parallel implementation

Let’s revisit the example of the scavenger function. How do we translate our thoughts into code?

Here’s the code sample for our implementation, and some guidelines for interpreting the code snippet:

  • The Lambda handler is conceptually divided in two blocks: One reads items from the DynamoDB table, while the other deals with deleting them.
  • Both reads and deletions are parallelised.
  • A channel is used to send items from read to deletion.
  • The channel is closed when the reading phase is complete, without having to rely on knowing exactly how many items have been read.
  • Context objects are used to expire routines without leaking them on the Lambda execution timeout expiration.

Defers the context cancellation so that all the routines, that can be cancelled upon the context being so, are and do not leak resources.

The scan can be parallelised among multiple readers with a set number of segments. The segment and totalSegments can be set as part of the input fields. The scanned items are made available on the readChannel.

Here use a type TableItem that is application-defined and matches the structure of items from the DynamoDB table the application is reading from.

Unmarshall the scanned items to the slice of TableItem structures and send them to the readChannel.

Once the deleteItem reads the object from the channel it can extract the relevant properties and add them to the DeleteItemInput structure it needs to build to send the delete request to DynamoDB.

It is not possible to simply have the scanned items passed over to the delete function as the DeleteItemInput structure is not compatible with the custom table-dependent item structure that is returned by the scan operation.

Cast totalSegments to int64 as that is the type expected by SetSegment and SetTotalSegments.

This is necessary to ensure that the old entries are also returned by the scan operation, as originally Completed was of type ‘string’.

The routine that is launched here is responsible for closing the communication channel between producers and consumers to avoid the Lambda running until the execution timeout expires. Given that we are listening on the communication channel with a ‘range’ directive, without having to predict how many items are going to be sent on the channel and having to use another mechanism to communicate that the sending is over, the receiving routine can also end on retrieving all elements from the channel.

I believe that the devised solution of using a waitgroup before closing the communication channel and then allowing for the channel to be emptied is more elegant than using another out-of-band shared variable to check that we are done with sending. This is also more Go-like because we can share memory by communicating, rather than communicating by sharing memory.

Here we keep any utility variables that are used by date operations within the deleteItem function.

Now we have the section of the handler that is responsible for the deletion of the scanned items.

Here the idea is that if there is space in the sizeWaitGroup for a routine then launch it, but if not we need to wait until there is availability so as not to overwhelm the DynamoDB table, which can retain most of the capacity for the main read and update operations performed by the other Lambdas in the orchestration framework.



For the analysis, a sequential version of the code above was used as a comparison – essentially the same code but with the parallelism variables set to 1.

Then two specular copies of a DynamoDB table were prepared by leveraging AWS DataPipelines to export tables from S3 and reimport them. The tables contained approximately 130000 items and the functions conditionally deleted all but about 3300 items that did not meet the deletion criteria.

A number of tests were then run.

The parameterised settings used for the analysis are shown in the following table:

Allocated memory (MB)#deletion routines#reading routinesBilled duration (ms)Max memory used (MB)Billed cost ($)

* The maximum memory used is the average consumed by the 3 consecutive sequential executions.

Let’s comment on the first two execution setups reported in the table; similar considerations can easily be deduced for the others.

The memory for the parallel Lambda function was set to 512MB and 128MB was allocated for the sequential function. This changes the billing unitary cost per ms, as per AWS Lamdba pricing, which on eu-west-1 translates to:

The parallelism for both the reading and the deleting parts of our handler was set to 10 in the parallel Lambda function. This worked out well for the chosen amount of memory, but the same cannot be said for all the other tested setups (see table above).

The functions were executed with the same timeout, set to the currently allowed AWS maximum of 15 minutes.

The parallel code successfully deleted all the items from the DynamoDB table in just over 700000ms, or about 12 minutes:

Report results 1

while the sequential implementation took three executions. Two of these ran down to the wire until the expiration of the execution timeout, thus a billed duration of 2 x 900000ms. The last one is shown below:

Report results 2

In summary, the total costs for running the parallel and the sequential implementation were:


The sequential implementation turns out to be cheaper by about 29% when it comes to total cost, albeit the execution takes 180% longer than in the first parallel case reported above.

The dataset allows us to draw a few more soft conclusions:

  • Because DynamoDB is an HTTP database, we have a substantial impact on the billed duration, which only on average scales with the allocated memory and parallelism.
  • The billing incurred when processing sequentially is not matched by any of the parallel tests, however the sheer amount of time that the execution takes is significantly longer than any others.
  • The execution duration on average scales down with the allocated memory and parallelism, and seems to be more sensitive to the allocated memory, which for Lambda carries a relationship with CPU power more than to parallelism. The fact that at 1.8GB allocated memory a second virtual core is assigned to the function, as informally known from some AWS reInvent talks, seems to not carry any additional benefits.

Getting back to the initial question – whether there is a case for stretching the parallelism of the execution of a function on the Lambda runtime, I think the analysis provided suggests that it does.

In particular, if the focus is on reducing the execution duration of a stateless application, a Golang implementation seems like a good idea.

Happy coding!

See our latest technology team opportunities

If you see a position that suits, why not apply today?

Vincenzo Zambianchi

This block is configured using JavaScript. A preview is not available in the editor.