Distributed data processing platforms aim to provide a balance between ease of use and performance. The question is: do they succeed? Systems like Apache Spark or Apache Flink offer a high-level programming model that results in simple and concise definition of the processing tasks, abstracting away most of the concerns associated to concurrency and distribution but at the cost of a large performance gap with custom programs that use low-level primitives to control distribution and resource usage. May we fill this gap? May alternative design choices yield better performance without sacrificing simplicity?This paper answers the above questions by introducing RStream, a novel data processing platform written in Rust. RStream provides a high-level programming model similar to that of mainstream data processing systems, which supports batch and stream processing, data transformations, grouping, aggregation, iterative computations, and time-based analytics, incurring in a much lower overhead, closer to that of custom, low-level code. In numerical terms, our evaluation shows that RStream programs present nearly identical complexity as similar programs written in Flink, delivering from 2X to 20X the throughput of Flink, rivaling custom MPI implementations.
RStream: Simple and Efficient Batch and Stream Processing at Scale
Alessandro Margara;Gianpaolo Cugola;
2021-01-01
Abstract
Distributed data processing platforms aim to provide a balance between ease of use and performance. The question is: do they succeed? Systems like Apache Spark or Apache Flink offer a high-level programming model that results in simple and concise definition of the processing tasks, abstracting away most of the concerns associated to concurrency and distribution but at the cost of a large performance gap with custom programs that use low-level primitives to control distribution and resource usage. May we fill this gap? May alternative design choices yield better performance without sacrificing simplicity?This paper answers the above questions by introducing RStream, a novel data processing platform written in Rust. RStream provides a high-level programming model similar to that of mainstream data processing systems, which supports batch and stream processing, data transformations, grouping, aggregation, iterative computations, and time-based analytics, incurring in a much lower overhead, closer to that of custom, low-level code. In numerical terms, our evaluation shows that RStream programs present nearly identical complexity as similar programs written in Flink, delivering from 2X to 20X the throughput of Flink, rivaling custom MPI implementations.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.