In contrast to traditional data analytics systems that collect and periodically process huge — static — volumes of data, streaming analytics systems avoid putting data at rest and process it as it becomes available, thus minimising the time a single data item spends in the processing pipeline. Systems that routinely achieve latencies of several seconds or even subsecond latency between receiving data and producing output are often described as “real-time”.
In an attempt to combine the best of both worlds, an architectural pattern called the Lambda Architecture has become quite popular that complements the slow batch-oriented processing with an additional real-time component and thus targets both the Volume and the Velocity challenge of Big Data at the same time.
The Lambda Architecture was proposed by Nathan Marz (the creator of Apache Storm) in his commendable book Big Data. As illustrated in Figure 2, the Lambda Architecture describes a system comprising three layers: Data is stored in a persistence layer like HDFS from which it is ingested and processed by thebatch layer periodically (e.g. once a day), while the speed layerhandles the portion of the data that has not-yet been processed by the batch layer, and the serving layer consolidates both by merging the output of the batch and the speed layer.
The obvious benefit of having a real-time system compensate for the high latency of batch processing is paid for by increased complexity in development, deployment and maintenance. If the batch layer is implemented with a system that supports both batch and stream processing (e.g. Spark), the speed layer often can be implemented with minimal overhead by using the corresponding streaming API (e.g. Spark Streaming) to make use of existing business logic and the existing deployment. For Hadoop-based and other systems that do not provide a streaming API, however, the speed layer is only available as a separate system. Using an abstract language like Summingbird to write the business logic enables automatic compilation of code for both the batch and the stream processing system (e.g. Hadoop and Storm) and thus eases development in those cases where batch and speed layer can use (parts of) the same business logic, but the overhead for deployment and maintenance still remains.
A 1000 feet view on the landscape of scalable stream processing systems and their latency vs throughput trade-offs: