Jul 16, 2022 3 min read Big Data

Streaming data on object storage: Thoughts

Object stores are the gold standard for cloud native data persistence. So, it is natural to want to store streaming data to object stores. But here is the inherent problem when storing a streaming dataset on object storage.

Streaming workloads, by definition are blocks of continuous, almost never ending data. Object store API (S3/Blob API) on the other hand is desgined based on the concept of a well defined file (GET, PUT Files) being uploaded to the object store.

So, if a stream has to be persisted to object store, stream data has to be converted to a well defined file and then uploaded to object store. And then this process has to be kept repeating over and over - because, the stream by definition doesn't end.

Let's get the terminology clear before we dive in further: stream oriented data here means any data source or paradigm that creates a continuous, ongoing flow of data: application logs, website traffic data etc are good examples. Object store here means API driven, cloud native storage platform. Object stores are available as native public cloud offerings, or as platforms like MinIO that offer the flexibility to run your object store anywhere.

Possible approaches

So let us discuss about some of the possible approaches on how to best store streaming data to object stores.

Append files on object store

A simple solution comes to mind, especially who are yet to familiarize themselves with object storage API.

Why not create one file on S3 and then keep appending to this file

Can't do that because there is no AppendObject API in object store. In fact, one of the major reasons why object storage is much more scalable than typical file systems is because, object storage API removed this requirement of keeping track of all the clients and their open files (as in POSIX). Hence, letting go of append option for objects.

Batch / Mini-batch before upload

We could stage the data and create files as stream progresses. This would effectively convert the stream to a batch or mini-batch file before it is persisted.

But this approach has shortcomings too:

This invariably leads to a large number of small files. That is a problem.
Staging data before pushing to reliable object storage needs a reliable staging store or the data in staging can be lost.
Stream volume is generally difficult to judge, i.e. difficult to decide when to cut off the stream from a staging file, and start creating a new file. If you cut off based on time, i.e. create a new file from stream every 5 minutes, you'll invariably end with files of various sizes as stream traffic ebbs and flows. If you cut off the file based on number if stream events (create a new file from stream every 1000 requests), you don't know when 1000 requests will complete, so a file could be left in staging for too long.

Multipart API for Streaming data on Object Storage

S3 already has the multipart upload API. This API allows uploading a file in smaller chunks - so you could start uploading a file even before it is fully formed. We could use the this approach to upload streams. Think of these small events as smaller chunks of a larger file that will be eventually created.

This is not free of shortcomings either:

The S3 API spec for multipart upload doesn't have GetObject for these smaller chunks. You can GET a file / object only when it is fully created. This means, the application using this multipart upload API to upload streams can't see the last set of events (for which CompleteMultiPartUpload has not been called yet).

Additionally, the file format would be an important aspect here too. A file format that has a footer (like Parquet) may not be able to work with this approach because it might not be possible for CompleteMultiPartUpload to properly write the file footer.

No silver bullet

As we saw, there is no single approach that works perfect for all kinds of scenarios. Like most technical decisions, you'll need to make a call based on tradeoffs you can live with.

However, most common approach I have seen people taking is to actually use a streaming data platform purpose built for ingesting streams at high volume. It comes at an extra cost but offers great deal of peace of mind vs managing streams internally.

Also, it is precisely due to these issues of streaming data fitting onto object storage, that there are new platforms coming up on top of object storage that are purpose built for streaming data like logs.