Realtime distributed computing is tough, especially at scale: managing a large data pipeline is tough, and it’s even tougher to keep latency low and availability high when processing tens of thousands of items per second. Many people turn in despair to Java or Scala when it comes time to scale up, but we can do it in Python: Apache Storm is a distributed realtime computation system that can let you scale up- and no need to reach for a new language!
This talk will walk the audience through the basics of Apache Storm and how it’s an elegant, useful solution to realtime distributed computing, as well as how streamparse can let you write your storm components in Python by writing some code and a basic storm topology in Python. We’ll also look at how Parsely uses Storm in production to handle billions of realtime events a month. If we have time, we’ll go a bit into how Storm has several advantages over other common Python computing data streaming solutions, like Spark’s microbatching.
Goals:
At the end of the talk, ideally you should be able to understand: What Apache Storm is, how it works generally, and what scenarios it’s useful for How streamparse can be used to write your Storm topologies How Storm + streamparse is used in an actual high-availability, low-latency production environment