Dec 1, 2020 4 min read storage

The Curious Case of Small Files

Background

Most of the files, by the virtue of their average size and usage patterns are clearly cut out for certain storage platforms. For example unstructured data like images, archives, larger log files are almost always best suited for object stores. Structured data like user information, transaction data are best suited for relational databases. We have a good handle on the well known, structured and unstructured data types.

But, as modern data paradigms evolve, the boundary between structured and unstructured is getting blurry. Take for example

Event / IoT data - Data from sources like mobile apps, IoT sensors in electronics, industrial automation systems, and all the new devices getting connected to the Internet can be considered in this category. Such data is schemaless, denormalized, and volumes are unpredictable at times.
Application logs - Application logs are unstructured, and a typical microservices style platforms can easily create thousands of smaller log files per second.
Click-stream data - Web and mobile analytics data, like click events, user behavior and other data is another example of unstructured, high volume data, which at an individual level is small files.

Challenge

Smaller files generated from above platforms are critical because almost all such data is generally useful for analytics. Finding the right platform for such data poses a challenge. This is because from purely storage perspective, these files can fit in RDBMS, NoSql DBs and even object stores. So there seem to be many options,

RDBMS - MySQL table has a maximum row size limit of 65,535 bytes. The rules vary for different data types as explained here.
NoSQL - MongoDB BSON Document can be 16 MBs. Details are available here.
Object Storage - Individual Amazon S3 objects can range in size from a minimum of 0 bytes to a maximum of 5 terabytes.

But, practically, these files are are too big (or too many) for databases and too small for other storage systems like object storage. Additionally, due to the analytics application of such files, access patterns are important to understand. We have seen cases where teams found out after going to production, that a database would have been better for them instead of an object store and vice-versa.

In this post we look at various storage platforms and relevant architectural approaches which can protect you from situations where object size is small and unknown, ingestion volumes are variable and data query pattens are bound to change over time.

Apache Druid

Apache Druid is developed from scratch for click-stream and other time-stamped data points that need high ingestion rate and fast query performance, but it doesn't reinvent the wheel in terms of storage. Druid takes the approach of using the right tools for right data type (HDFS / Object Storage for blob storage and RDBMS for metadata storage), while it does all the heavy lifting of managing the know how of ingestion, splitting, storage and querying the data. For external applications Druid looks like a OLAP DB.

Druid works by splitting data into time based ranges called chunks. These chunks are then further partitioned into one or more segments. A segment is a single file, typically comprising up to a few million rows of data. While building segments, data is converted to columnar format, indexed and compressed. Segments are then stored in deep storage aka object stores or HDFS. Information about the segments is stored to a metadata store which is essentially a relational database.

Druid architecture ensures a scalable and performant approach to handling smaller data points at a high volume. Take a look at Druid ingestion and query patterns to understand how it can integrate with your applications.

Apache NiFi

Apache Nifi is not a storage platform per se, it was developed as a system to automate the flow of data between different systems. It supports a high ingestion rate and integrates with all the major software systems.

In cases where an opinionated approach like Druid may not work, Nifi can be used as the platform to ingest the data and then massage it based on specific requirements, before pushing it to relevant storage platforms. Take for example a flow where you need to build a catalog for all the ingested data in a certain format, then store it in a compressed form on object storage. This can be achieved with Nifi quite easily, given the pluggable design and easy to build integrations.

Object Storage

Platforms like Druid and Nifi can come in handy, but they also add additional burden of management, infrastructure and maintenance. In some cases, it is just easier to adjust the client to align with storage platform.

A simple approach to handle small files on object storage systems is to club multiple files in a single bigger file. Think of smaller json objects or csv rows, clubbed in a bigger json, csv file. This approach makes sense in cases where individual data points have lesser meaning, than the aggregate.

Once the aggregation mechanism is in place, you can tweak the number of entries you club in a single file, to get an idea of what is a good number from performance perspective. For example Kafka plugin allows configuring the number of messages to be combined before write to an S3 compatible system.

Relational and Other Databases

cases where individual files or the smallest data unit requires strong consistency, and other ACID properties, Relational databases can serve well. However, it is important to understand the volume constraints in such cases. Databases can handle few TBs of overall capacity, and scaling beyond that is just not possible.

A simple approach in these cases can be keeping the live / active data in the RDBMS, while regularly backing older data to object storage systems using native RDBMS tools. This will need significant understanding of which RDBMS rows to backup, because the backup is done in the native DB format and there is no way to read the data directly. So, do this only in cases where only the latest data is relevant and rest of it can be effectively archived.

Summary

In this post we tried to understand various paradigms for ingesting and storing high volume, high throughput, small sized files or data points. From custom built platforms like Druid, to high configurable Nifi, to native storage using some intelligence in the client itself. Based on various considerations, one of these approaches is generally applicable and allows you to scale as your requirements grow.

You may also want to take a look at our post on Data Lake Overview.