No, this isn’t a nature hike. We’re talking about Cloud Computing, Stream Processing and Big (sea of) Data.
According to IBM estimates, the world creates 2.5 quintillion bytes (that’s a 2.5 million terabytes) of new data every day. 90% of the world’s data has been created in the last two years alone. We’re in the era of Big Data. There’s an unlimited amount of it. The question is how do you deal with it?
The answer is that you’ve got to share the load. Give your laptop that well-deserved break and fire up an account with your favorite cloud services provider. They’ll provide you with a handful of instances – a virtual machine, complete with operating system, that runs on shared hardware with other instances – that you can distribute your computations across. This is Cloud Computing. It’s the practice of using a network of computers to share in the storage, computation and processing of data.
Now that you’ve got multiple machines to work with, the next step is figuring out how to make use of them. The common technique is what’s referred to as batch processing. Batch processing itself is the processing of a collection of jobs in single execution. In a cloud computing environment, it’s the same except you can split up the data so that each of your machines takes on a portion of the workload.
Let’s compare batch processing in computing to automobile assembly. So, batch processing would be like having one “station” that built 100 trucks. And at that single station, they would start with 100 frames, then add 400 wheels before moving onto adding 100 engines and so on. Now, in a cloud environment, it would be like adding three more stations. Each station is then responsible for building 25 trucks, which would cut the time it took to build the 100 trucks by a factor of 4. The one important thing to note is that all of those trucks would be finished at the end of the “batch”, since you can’t start the next step until every truck has completed the previous step.
The problem is, maybe you don’t want to wait for all 100 trucks to be finished before getting the first one out to a customer. This is where stream processing comes in, which is much like an assembly line. Now instead, you take your four stations and assign them each their own task – say, frame, engine, body and interior. Each truck moves through all four stations, so that by the time the first truck is all done, the fifth one is just being started. This allows you to finish them in a “stream”, getting them out to customers incrementally rather than waiting for the entire fleet to be completed.
Stream processing provides the same effect in computing. It allows you process data piece by piece, getting the results for the first chunks without having to wait for the rest. While challenging to implement (particularly on pre-existing software), it can be very useful if it’s something your application can really benefit from. Especially in a world where new data comes continually crashing in in waves.