Sep 1, 2020 6 min read streaming

Streaming Data Tools & Techniques

Introduction

Streaming data is exactly what it sounds like, a continuously flowing stream of data generated by one or multiple sources. Most of the times, streaming data is generated on an event basis, which means a single unit of such data streams is typically small, in the order of few Kilobytes. However these streams seldom stop, so the overall data volumes grow to huge numbers. Some of the common examples of streaming data sources are:

Industrial and farm & agricultural IoT, home automation technologies: This is one of the biggest sources of pure stream oriented data. If handled and analysed properly, this data has a great impact on overall productivity.
Financial transactions, stock market data: Banking and financial transactions, or any other transaction for that matter is naturally a streaming data candidate, if there are a high number of transactions happening.
Website visits, clicks tracking: Website visit events, click trails and other user interaction data is another native streaming kind of data.
Application logs: Applications, especially micro-services generate high volumes of logs which can be consumed in a streaming fashion to allow debugging, root causing and general observability.

Most data streams also have a time axis. I mean, time when the data is captured plays an important role in the overall scheme of things. In some cases, looking at how data changed over time gives a unique edge. In case of IoT data for example, looking at a sensor's health status over time at scale can give a fairly decent idea about when it may fail.

In some other cases, the importance of streaming events reduce over time. For example, an application log event is very important in case the application failed, but a log event from say 6 months back, may not be very important. The duration for streaming event to be considered stale depends on several factors though - a banking transaction from 6 years back may be considered important.

Streaming data is a unique data paradigm with its own challenges and patterns. In this post, we'll take a deep look at Streaming Data and contrast some of the storage platforms (TimeSeries DBs) relevant to this paradigm. In the next post, we'll take a look at Streaming data analysis tools.

Streaming vs Batch processing

Before streaming data was a thing, real time data was collected as files that have a predefined beginning and end (generally based on a time window). This data was then fed to batch processing tools that were really good at handling such data. It made sense from an engineering perspective, developers could have simple mechanisms to create a new file every lets say 10000 events and then batch processing could process these files as they were created. But, as you can imagine, the moment you put a time window on a real time event data stream, it is no longer real time. Because of its very nature, batch processing will lag by few data points when compared to stream processing.

As businesses started to depend more on real time data, stream processing came into picture, and tools evolved to handle these streams. Stream processing is best for use cases where time elapsed between an event and business's response to the event matters, and batch processing works well when all the data has been collected. Instead of thinking one is better than the other, it is better to think of batch vs stream as using the approach that suites business objective.

Apache Spark

One of the most popular products in data analytics space, Apache Spark, is great for batch processing. Recently Spark also added support for Spark Streaming. While the name indicates a native streaming support, we know from experience that this is micro-batch processing, with its own limitations. For teams already using Spark for batch processing, it makes sense to leverage Spark Streaming as well. However, it is important to understand these points before using Spark for Streaming data processing.

Spark Streaming process data streams in batches. This is fine for some applications such as simple counts and ETL into Hadoop, but the lack of true record-by-record processes makes strict time-series analytics impossible.
Spark Streaming is still relatively new and offers limited libraries of stream functions. This means there may be few functions missing which are other wise available on other platforms.
Micro-batch approach means data from multiple remote sources which may be delayed or missing can't be detected.

Streaming data on TimeSeries DB

Majority of streaming data use cases make a great fit for TimeSeries DBs. These databases can not only ingest data at high rate from a multitude of sources, they also allow querying capability for users to query and understand their data sets. Since, TimeSeries DBs are built to use time as an additional axis for the data points it is a natural fit for cases where a perspective on how data changes wrt time is useful. Lets take a look at some of the most common cloud-native and open source DNA, TimeSeries DBs out there and discuss each of their intended use cases and design decisions.

Prometheus + (Thanos / Cortex)

Probably the most popular of the lot, Prometheus was built primarily as a systems monitoring and alerting toolkit. Accordingly, one of the most common use cases for Prometheus is as monitoring system that can scrape data from a wide (really wide) range of applications. Prometheus's model of pulling data from sources (instead of a source pushing the data to Prometheus) makes it easier to integrate. With self discovery of applications on platforms like Kubernetes and other cloud environments, Prometheus is almost the defacto monitoring platform in the cloud native world.

👍🏽 Self discovery of new services based on simple tag based configuration.
👍🏽 Pull / Scrape model to fetch monitoring data from applications.
😐 PromQL for data querying, adds a learning curve for developers, and may not work with existing SQL based systems.
👎🏽 Prometheus by design is somewhat limited to a single node and it is difficult to build a HA Prometheus cluster.
👎🏽 While Prometheus is great at collecting data, there is no clear approach to store or backup the scraped data.

Community has taken notice of Prometheus shortcomings and there are two new major initiatives already in place that have Prometheus DNA, but also enable building a HA, Large Scale Prometheus cluster. These projects are Cortex and Thanos.

Here is a great comparison of Thanos and Cortex by Grafana Labs team. While Prometheus, Cortex and Thanos are all CNCF projects, there are other independent OSS platforms that are out there solving streaming / time series data challenges.

Influx

InfluxDB is a general purpose TimeSeries DB that caters to wide range of use cases. Apart from the DB, the produce suite includes APIs for storing and querying data, processing it in the background for ETL or monitoring and alerting purposes, user dashboards, and visualizing and exploring the data and more.

👍🏽 Better scalability with cloud native design. Resource efficiency and performance.
👍🏽 Simple to deploy with a single binary providing most of the functionality.
😐 Since Crate is based on.
👎🏽 The clustering feature is not available in OSS, users will have to upgrade to Enterprise or Cloud offerings to use clustering.
👎🏽 Shards and retention periods are closely tied and there is no support for exporting older, lesser important data shards to S3 or similar storage platforms which are more cost effective.

Crate

CrateDB is a distributed SQL database primarily focused on machine data. It supports a variety of complex data structures and allows time series, text search and AI/ML analytics. One of the outstanding features is the support for ANSI SQL via PostgreSQL protocol, JDBC and even REST interfaces. This allows lot of systems to integrate with CrateDB out of the box.

👍🏽 Cloud-native design with shared nothing nodes. Self-healing and scaling by adding nodes as needed.
👍🏽 Out of the box integration with almost all platforms used in IoT & Machine data space. This includes MQTT, Telegraf, Kafka, Flink etc.
😐 While CrateDB supports SQL interface, it doesn't support transactions and strong ACID compliance.
👎🏽 CrateDB uses Elasticsearch code to handle cluster management, node discovery and communication. This makes it prone to dirty reads and lost updates. Refer this SO post for details.

TimeScale

TimeScaleDB is the newest entrant in the TimeSeries DB space. It is a TimeSeries SQL database optimized for fast ingest and complex queries. You'll notice, like Crate, TimeScale also is PostgreSQL compliant, and hence will integrate with SQL tools out of the box. TimeScaleDB is one step further in PostgreSQL compliance though, it is a PostgreSQL extension. So, PostgreSQL tools like pg_basebackupwill be able to backup data stored in TimescaleDB. Similarly, restore and point-in-time recovery will work.

👍🏽 Native PostgreSQL interface support. Postgres backup and restore tools work with TimeScale as well.
👍🏽 High performance and simple scaling model.

TimeScale team wrote a great post on TimeScaleDB vs InfluxDB, refer the blog post here.