Twitter – DAPLAB – Data Analysis and Processing Lab http://daplab.ch Reduces the entry barrier for companies to find value out of their data and ultimately turn into a data-driven company Wed, 29 Jun 2016 07:09:29 +0000 en-US hourly 1 https://wordpress.org/?v=5.6.10 http://daplab.ch/wp-content/uploads/2017/06/cropped-daplab-favicon-1-32x32.png Twitter – DAPLAB – Data Analysis and Processing Lab http://daplab.ch 32 32 A new framework to simplify interaction with YARN: Apache Twill http://daplab.ch/2015/09/04/fault-tolerant-twitter-firehose-ingestion-on-yarn/ Fri, 04 Sep 2015 09:34:31 +0000 http://daplab.ch/?p=51 YARN, aka NextGen MapReduce, is awesome for building fault-tolerant distributed applications. But writing plain YARN application is far than trivial and might even be a show-stopper to lots of engineers. The good news is that a framework to simplify interaction with YARN emerged and met the Apache foundation: Apache Twill. While still in the incubation phase, the project looks […]

The post A new framework to simplify interaction with YARN: Apache Twill appeared first on DAPLAB - Data Analysis and Processing Lab.

]]>
YARN, aka NextGen MapReduce, is awesome for building fault-tolerant distributed applications. But writing plain YARN application is far than trivial and might even be a show-stopper to lots of engineers.
The good news is that a framework to simplify interaction with YARN emerged and met the Apache foundation: Apache Twill. While still in the incubation phase, the project looks really promising and allow to write (easier to test) Runnable application and run them on YARN.

As part of the DAPLAB Hacky Thursday, we jumped head first into Twill, RxJava and Twitter4j, all bundled together to build a fault tolerant Twitter firehose ingestion application storing the tweets into HDFS.

We used Twill version 0.5.0-incubating. Read more on Twill herehere and here.

Twitter4j has been wrapped as an RxJava Observable object, and is attached to and HDFS sink, partitioning the data byyear/month/day/hour/minute. This will be useful to create hive tables later on, with proper partitions.

Check it out

The sources of the project are available on github: https://github.com/daplab/yarn-starter
git clone https://github.com/daplab/yarn-starter.git

Configure it

The Twitter keys and secrets are currently hardcoded in TwitterObservable.java (yeah, it’s in theTODO list :)). Please set them there before building.

Build it

mvn clean install

Run it

And Run it in the DAPLAB infrastucture like this:
./src/main/scripts/start-twitter-ingestion-app.sh daplab-wn-22.fri.lan:2181
By default data is stored under /tmp/twitter/firehose, monitor the ingestion process:
hdfs dfs -ls -R /tmp/twitter/firehose
...
-rw-r--r--   3 yarn hdfs    7469136 2015-04-24 09:59 /tmp/twitter/firehose/2015/04/24/07/58.json
-rw-r--r--   3 yarn hdfs    6958213 2015-04-24 10:00 /tmp/twitter/firehose/2015/04/24/07/59.json
drwxrwxrwx   - yarn hdfs          0 2015-04-24 10:01 /tmp/twitter/firehose/2015/04/24/08
-rw-r--r--   3 yarn hdfs    9444337 2015-04-24 10:01 /tmp/twitter/firehose/2015/04/24/08/00.json
...
That’s it, now you can kill the application and see how it will be restarted by YARN!

The post A new framework to simplify interaction with YARN: Apache Twill appeared first on DAPLAB - Data Analysis and Processing Lab.

]]>