As part of the DAPLAB Hacky Thursday, we jumped head first into Twill, RxJava and Twitter4j, all bundled together to build a fault tolerant Twitter firehose ingestion application storing the tweets into HDFS.
Twitter4j has been wrapped as an RxJava Observable object, and is attached to and HDFS sink, partitioning the data byyear/month/day/hour/minute. This will be useful to create hive tables later on, with proper partitions.
Check it out
git clone https://github.com/daplab/yarn-starter.git
TwitterObservable.java(yeah, it’s in theTODO list :)). Please set them there before building.
mvn clean install
/tmp/twitter/firehose, monitor the ingestion process:
hdfs dfs -ls -R /tmp/twitter/firehose ... -rw-r--r-- 3 yarn hdfs 7469136 2015-04-24 09:59 /tmp/twitter/firehose/2015/04/24/07/58.json -rw-r--r-- 3 yarn hdfs 6958213 2015-04-24 10:00 /tmp/twitter/firehose/2015/04/24/07/59.json drwxrwxrwx - yarn hdfs 0 2015-04-24 10:01 /tmp/twitter/firehose/2015/04/24/08 -rw-r--r-- 3 yarn hdfs 9444337 2015-04-24 10:01 /tmp/twitter/firehose/2015/04/24/08/00.json ...