Big Data |
---|
Technologies
|
Howto: Setting up a multi-node Storm cluster under Ubuntu
This tutorial explains how to set up a Storm cluster running on several Ubuntu machines. The Storm framework allows to process unbounded data streams in a distributed manner in real-time.
The basic idea behind the distributed real-time processing of Storm is splitting the overall task into several smaller tasks which can be executed quickly. Each of these tasks is executed by a so called bolt which is one type of vertex in a directed processing graph named topology. Those bolts consume one or more data input streams and produce an output data stream. Beside bolts, there exists a second type of vertices in a topology named spouts which transform an input stream into a data stream which can be processed by bolts. The directed edges in the topology represent the data flow from spouts to bolts or from bolts to bolts.[1]
In order to execute a topology, it is send to the master node of the Storm cluster which runs a daemon called "Nimbus". This daemon distributes the single tasks across the cluster. On all machines of the cluster runs a special daemon called "Supervisor" which receives the tasks assigned to him and executes them in one or more worker processes. The coordination between the Nimbus and the Supervisors is done through a ZooKeeper cluster.[1]
Figure 1 gives an overview of the cluster which will be set up during this tutorial. It consists of four Ubuntu machines. Each of them is running a Supervisor daemon. The Nimbus daemon is running on the first machine and the ZooKeeper server on the second.
This tutorial was tested with the following software versions:
- Ubuntu Linux 12.04.3
- Java version 1.8.0-ea
- ZooKeeper 3.4.5
- ZeroMQ 2.1.7
- Storm 0.8.2
Prerequisites
The following required software has to be installed on all machines.
Java 8
In order to install Java, you have to install the two further packages, first. Therefore, execute:
sudo apt-get install software-properties-common
sudo apt-get install python-software-properties
Then, the repository containing a built of the Oracle Java 8 has to be registered and the local list of the contained software packages has to be updated:
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
Finally, Java 8 can be installed:
sudo apt-get install oracle-java8-installer
SSH Server
SSH is not required for running Storm but it simplifies the installation and running of a Storm cluster because you get remote access. Therefore, an SSH server should run on all machines.
First check if SSH is already running by executing
ssh address.of.machine
for each machine which should be involved in the cluster. address.of.machine
is the address of a machine. If the login on one of the machines fails, you have to install an ssh server on that machine by following the instructions on
http://www.cyberciti.biz/faq/ubuntu-linux-openssh-server-installation-and-configuration/
Install ZooKeeper
In the current setting the ZooKeeper server should run on Machine02. Therefore, we first connect to this machine by executing:
ssh userName@address.of.machine02
userName
is the name of the user which should be used to log in on machine02. If the user name is the same as on the machine which wants to get remote access, then the following command would be enough:
ssh address.of.machine02
First, a folder has to be created which will be used from ZooKeeper to store data and create log files:
mkdir -p /path/to/zookeeper/data
In this this tutorial, this path is referred to by the following place holder /path/to/zookeeper/data
.
Now, ZooKeeper can be downloaded
wget http://mirror.synyx.de/apache/zookeeper/stable/zookeeper-3.4.5.tar.gz
and extracted
tar -xzf zookeeper-3.4.5.tar.gz
In order to configure the ZooKeeper server we first create a configuration file by copying a sample configuration.
cp zookeeper-3.4.5/conf/zoo_sample.cfg zookeeper-3.4.5/conf/zoo.cfg
This newly created configuration file zoo.cfg is edited by changing the value of dataDir
to the new path for the ZooKeeper data.
dataDir=/path/to/zookeeper/data
Now, the installation of ZooKeeper is complete. The ZooKeeper server can be started by executing
./zookeeper-3.4.5/bin/zkServer.sh start
In order to check, if the server is running correctly, the following commands can be executed:
echo ruok | nc addressOfServer 2181
echo stat | nc addressOfServer 2181
To shut down the server again simply type
./zookeeper-3.4.5/bin/zkServer.sh stop
Install ZeroMQ
The communication between the spouts and bolts of Storm is done with ZeroMQ. ZeroMQ is a socket library that transports messages between processes on the same or on different machines. First the native C-library of ZerMQ has to be built and installed. Thereafter, the Java library JMQ has to be built and installed which calls the native ZerMQ library. These steps have to be performed on all machines.
Build and install native ZeroMQ
In order to build ZeroMQ the uuid-dev development environment has to be installed.
sudo apt-get install uuid-dev
Now, the source code of ZeroMQ can be downloaded
wget http://download.zeromq.org/zeromq-2.1.7.tar.gz
and extracted
tar -xzf zeromq-2.1.7.tar.gz
After extracting ZeroMQ, it can be built and installed. Therefore, the previously extracted folder has to be entered
cd zeromq-2.1.7
the build environment has to be configured
./configure
the library has to be build
make
and finally installed
sudo make install
Now, the current folder can be left.
cd ..
Build and install native JZMQ
JZMQ requires several further programms:
- a program to download the source code from a remote repository
sudo apt-get install git
- a program which provides a common interface to query installed libraries
sudo apt-get install pkg-config
- a program for creating portable compiled libraries
sudo apt-get install libtool
- and a program which creates portable makefiles: automake
The current version of the latter one could not be used, because it produces several errors when building JZMQ. In the setting described in this tutorial the version 1.11.1 works well. This version has to be downloaded
wget http://launchpadlibrarian.net/60345882/automake_1.11.1-1ubuntu1_all.deb
and installed
sudo dpkg --install automake_1.11.1-1ubuntu1_all.deb
The latter command produces an error caused by missing dependencies. These missing libraries can be installed by executing
sudo apt-get -f install
In order to prevent automake from automated updates, simply type
echo "package hold" | sudo dpkg --set-selections
Now, JMQ can be build and installed. Therefore, a remote repository has to be cloned which contains a version of JMQ that is compatible with Storm:
git clone https://github.com/nathanmarz/jzmq.git
The newly created folder is entered
cd jzmq
Thereafter, JZMQ can be built
./autogen.sh
./configure
make
and installed
sudo make install
Even if there is no error, the installation is incomplete. In order to make JZMQ work properly with Storm, a missing jar has to be created
cd /src
jar -cvf zmq.jar ./org/
sudo cp zmq.jar /usr/share/java/zmq.jar
cd ../..
Furthermore, some libraries have been stored at a position where Ubuntu does not find them. Theses libraries have to be copied to the correct position, now.
sudo cp jzmq/perf/zmq-perf.jar /usr/share/java/zmq-perf.jar
sudo cp /usr/local/lib/. /usr/lib -R
Finally, Ubuntu has to reload all libraries.
sudo ldconfig
Install Storm
After finishing the installation of ZeroMQ, Storm can be installed on all machines.
Storm is distributed in zip archives. In order to extract them, unzip has to be installed.
sudo apt-get install unzip
Similar to the installation of ZooKeeper, a folder has to be created which will be used from Storm to store data:
mkdir -p /path/to/storm/data
In this this tutorial, this path is referred to by the following place holder /path/to/storm/data
.
Now, Storm can be downloaded
wget https://dl.dropboxusercontent.com/s/fl4kr7w0oc8ihdw/storm-0.8.2.zip
and extracted
unzip -q storm-0.8.2.zip
The extracted folder storm-0.8.2
contains the folder logs in which the log files will be written, bin which contains the scripts to run Storm, and conf which contains configuration files for Storm. In order to set up Storm correctly, the file conf/storm.yaml
has to be modified:
storm.zookeeper.servers:
- "address.of.machine02"
nimbus.host: "address.of.machine01"
storm.local.dir: "/path/to/storm/data"
In the first two lines the address of the used ZooKeeper server is defined, which is run on Machine02 in this tutorial. In the third line the address of the Nimbus is set. Finally, the position where Storm should store its data is defined in the last line. Now, the Storm cluster is ready to start.
Run Storm
Before the Storm cluster can be started, the ZooKeeper server has to be run. Therefore, the following command has to be executed on Machine02:
./zookeeper-3.4.5/bin/zkServer.sh start
The following commands, which will start the parts of the Storm cluster, are not executed in the background. This means that for each of them a new terminal is required. The programs can be stopped by pressing Ctrl + C.
The Storm cluster is started, by running the Nimbus on Machine01, first:
./storm-0.8.2/bin/storm nimbus
Thereafter, a web server can be started on the same machine as the Nimbus. It provides a user interface for the storm cluster, which can be accessed under the URL http://address.of.machine01:8080
by a web browser.
./storm-0.8.2/bin/storm ui
Finally, the Supervisors have to be started on all machines.
./storm-0.8.2/bin/storm supervisor
Execute a topology
When a Storm topology has been created (see [1]) it has to be packed into a jar. All depending Java classes have to be included in this jar. This means, that other jars have to be unjared and the extracted classed integrated into the jar. Classes which are already part of Storm, must not be included in the jar. [2]
The previously created topology can be executed by performing the following command on the same machine as the Nimbus:
./storm-0.8.2/bin/storm jar path/to/topology.jar qualified.name.of.MainClass aUniqueName
The topology can be stopped by executing
./storm-0.8.2/bin/storm kill aUniqueName
or using the user interface.
Further links
https://github.com/nathanmarz/storm/wiki
http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/
I. Bedini and S. Sakr and B. Theeten and A. Sala and P. Cogan "Modeling Performance of a Parallel Streaming Engine: Bridging Theory and Costs" Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering (ICPE '13), 2013, Pages 173-184, ACM New York, NY, USA
References
- 1 2 3 4 Marz N., (2011) Tutorial. Retrieved Nov 22, 2013, from https://github.com/nathanmarz/storm/wiki/Tutorial
- 1 2 Marz N., (2013) Running topologies on a production cluster. Retrieved Nov 22, 2013, from https://github.com/nathanmarz/storm/wiki/Running-topologies-on-a-production-cluster