Building a stream processing pipeline with Kafka, Storm and Cassandra – Part 1: Introducing the components
When done right, computer clusters are very powerful tools. They can bring great advantages in speed, scalability and availability. But the extra power comes at the cost of additional complexity. If you don’t stay on top of that complexity, you’ll soon become bogged down by it all and risk losing all the benefits that clustering brings.
In this three-part series, we’re going to explain how you can simplify the setup and operation of a computing cluster. Using a real-world, non-trivial example, you’ll see how you can keep the complexity under control when building a cluster by using containerized applications. After that, you’ll also get to see how everything can be simplified even further by deploying your apps on CoreOS.
To illustrate our explanations, we’re going to build a high-performance, real-time data processing pipeline. The apps we will use are: Kafka, Storm and Cassandra (all provided by the Apache project).
Apache Kafka is a highly-available, high-throughput, distributed message broker that handles real-time data feeds. Kafka was originally developed by LinkedIn and open sourced in January 2011. Since then, it’s found use at Yahoo, Twitter, Spotify and many others.
Apache Storm is a distributed real-time computation system. It is often compared with Apache Hadoop, a similar system albeit one that is batch-oriented, unlike Storm which processes a stream of data. Storm was initially developed by BackType and then acquired and open sourced by Twitter in September 2011. Like Kafka, Storm is also used by Yahoo, Twitter and Spotify and many others.
Apache Cassandra is a distributed, decentralized, high-throughput and highly-available database with no single point of failure. It was initially developed by Facebook and open sourced in July 2008. Along with Thrift, Cassandra has its own CQL (Cassandra Query Language) which is similar to SQL. With an SQL-like interface, Cassandra is a superior solution for storing and managing huge amounts of data. According to their own benchmarks, the more nodes your cluster has, the higher its performance.
Let’s briefly examine how each of these products function independently.
A critical dependency of Apache Kafka is Apache Zookeeper, which is a distributed configuration and synchronization service. Kafka stores basic metadata in Zookeeper such as information about topics (message groups), brokers (list of Kafka cluster instances), messages’ consumers (queue readers) and so on.
To start a single-node Kafka broker, you should have a single-node Zookeeper instance and a Kafka broker ID.
Like Kafka, Apache Storm also requires a Zookeeper cluster. Unlike Kafka, a Storm cluster consists of at least two distinct components: Storm Nimbus and Storm Supervisor.
- Storm Nimbus is a master node similar to Hadoop JobTracker. It’s responsible for code distribution around the cluster, assigning tasks to Supervisors and monitoring for failures.
- Storm Supervisor is a worker node. It executes the code assigned by Storm Nimbus.
A Storm cluster is fail-safe and stateless. Because Nimbus distributes the code to Supervisors, a Storm cluster can work even without a Nimbus node, so long as Zookeeper and at least one Supervisor node are available.
Additional Storm components include Storm DRPC, Storm UI and Storm Logviewer.
When processing data, similar systems (like Hadoop) use the concept of a “job”, a batch process that eventually returns a results and ends. However, Storm instead uses “topologies” which are graphs of execution; each node contains processing logic and each edge controls how data is passed between nodes. Furthermore, unlike jobs, the lifespans of topologies are not limited and only end when you kill them.
A Storm topology consists of streams handled by two components: “spouts” and “bolts”. A spout is source stream. A bolt is a stream data processing entity which possibly emits new streams. For example, a spout may connect to the Twitter API and output a stream of tweets – a bolt could then consume this stream and output a stream of trending topics.
For more information, see the Storm tutorial.
Cassandra can, at a minimum, run as a single-node cluster. Running a multi-node Cassandra cluster requires only one additional configuration: at least one Cassandra Seed node IP address (for test purposes most often one initial Cassandra node). Once you have configured your Cassandra cluster, each Cassandra node will exchange information about another node using a Gossip, a peer-to-peer communication protocol. See Cassandra’s Getting Started guide for more information.
Before we end this first part, let’s have a quick preview of CoreOS, which we’ll cover in more detail later.
CoreOS is a lightweight Linux distribution based on ChromiumOS. It provides infrastructure for easy clustered application management and deployment. CoreOS can be run on dedicated hardware, VMs or on most cloud-based hosting providers including Amazon AWS, DigitalOcean, Google Compute Engine, Microsoft Azure and others.
The main features of CoreOS are:
- atomic updates with read-only dual-partition scheme
- each application is run inside its own container
- a distributed, consistent key value shared configuration store and service discovery provided by etcd (similar to Apache Zookeeper)
- a distributed init system (called fleet)
You can install CoreOS on a single machine, on a VM using a livecd ISO file, or simply run the CoreOS Vagrant image. It’s also possible to start a CoreOS instance on one of the cloud providers. Refer to CoreOS’s website to learn more.
When we eventually explore how to deploy these apps on CoreOS, we’ll end up with a setup like this:
But that’s for later part. In the next part, we’ll see how to deploy these apps directly into Docker containers.
Published on 08. April 2015
Original Blogpost: https://endocode.com/blog/2015/04/08/building-a-stream-processing-pipeline-with-kafka-storm-and-cassandra-part-1-introducing-the-components/