Write Ahead Logs (also known as a journal) are used in database and file systems to ensure the durability of any data operations. The intention of the operation is first written down into a durable log , and then the operation is applied to the data. If the system fails in the middle of applying the operation, it can recover by reading the log and reapplying the operations it had intended to do.
When write ahead logs are enabled, all the received data is also saved to log files in a fault-tolerant file system. This allows the received data to durable across any failure in Spark Streaming. Additionally, if the receiver correctly acknowledges receiving data only after the data has been to write ahead logs, the buffered but unsaved data can be resent by the source after the driver is restarted. These two together can ensure that there is zero data loss – all data is either recovered from the logs or resent by the source.
When the logs are enabled, all receivers enjoy the benefit of recovering data that were reliably received. It is recommended that the in-memory replication be disabled (by setting the appropriate persistence level in the input stream) as the fault-tolerant file system used for the write ahead log likely to be replicating the data as well.
Additionally, if you want to recover even the buffered data, you will have to use a source that support acking (like Kafka, Flume and Kinesis), and implement a reliable receiver that correctly acks the source when data is reliably stored in the log. The built in Kafka and Flume Polling receivers already are reliable. http://spark.apache.org/docs/1.2.0/streaming-flume-integration.html
Finally, it is worth noting that there may be a slight reduction in the data ingestion throughput on enabling the write ahead logs. Since all the received data will be written to a fault-tolerant file system, the write throughput of the file system, and the network bandwidth used for the replication can become potential bottlenecks. In that case, either create more receivers be used for increasing the parallelism of receiving the data and/or use better hardware to increase the throughput of the fault-tolerant file system.