Apache Arrow is an in-memory data structure specification for use by engineers building data systems. It has several key benefits:
- A in-memory columnar layout permitting O(1) random access. The layout is highly efficient in analytics workloads and permits SIMD optimizations with modern processors. Developers can create very fast algorithms which process Arrow data structures.
- Efficient and fast data interchange between systems without the serialization costs associated with other systems like Thrift, Avro, and Protocol Buffers.
- A flexible structured data model supporting complex types that handles flat tables as well as real-world JSON-like data engineering workloads.
Arrow is not an engine, or even a storage system. It’s a set of formats and algorithms for working with hierarchical, in-memory, columnar data and an emerging set of programming language bindings for working with the formats and algorithms.
As Arrow is adopted as the internal representation in each system, systems have the same internal representation of data. This means that one system can hand data directly to the other system for consumption. And when these systems are collocated on the same node, the copy described above can also be avoided through the use of shared memory. This means that in many cases, moving data between two systems will have zero overhead.
Arrow improves the performance for data movement within a cluster in these ways:
- Two processes utilizing Arrow as their in-memory data representation can “relocate” the data from one process to the other without serialization or deserialization. For example, Spark could send Arrow data to a Python process for evaluating a user-defined function.
- Arrow data can be received from Arrow-enabled database-like systems without costly deserialization on receipt. For example, Kudu could send Arrow data to Impala for analytics purposes.
For the Python and R communities, Arrow is extremely important, as data interoperability has been one of the biggest roadblocks to tighter integration with big data systems (which largely run on the JVM).
SAP HANA should be an “Arrow-enabled” also, because in that way it will be integrated with all emerging Big Data and Analytics technologies.