Flume is a great tool by apache that can move large amounts of data, from multiple sources, to a single data store like Hadoop. In BigSQL, we are using Flume to move the log4j files from the benchmark program to HDFS. This process also uses the Hive SerDe (or serialization and deserialization) properties, as shown in the BigSQL Tutuorial, which allow you to access the date from a Hive table in a readable form. For this tutorial, I will show you how to create a Flume agent that adds Twitter JSON data to HDFS. To start, install BigSQL and complete all of the pre-requisites.
The next step is to define the Flume agent and its three parts: sink, channel, and source. Usually, you would create a custom source to retrieve the twitter data; this tutorial shows a much simpler example in which you use a spool directory. Any file with a unique file name that is added to the spool directory will be picked up by Flume .
The next part of the Flume agent is the channel, which connects the Flume source and sink. For this example, we will use a file channel, which makes a temporary file in the named directory that holds the data once it has been picked up by the source.
Finally, we will need to tell Flume what the sink is. We are using a HDFS sink so creating the Flume agent is as simple as providing the HDFS path. The following flume-conf.properties file shows how to configure the source, channel and sink, along with other options:
# Initialize agent's source, channel and sink agent.sources = TwitterExampleDir agent.channels = memoryChannel agent.sinks = flumeHDFS # Setting the source to spool directory where the file exists agent.sources.TwitterExampleDir.type = spooldir agent.sources.TwitterExampleDir.spoolDir = TwitterExample/ # Setting the channel to memory agent.channels.memoryChannel.type = memory # Max number of events stored in the memory channel agent.channels.memoryChannel.capacity = 100000 agent.channels.memoryChannel.transactioncapacity = 10000 # Setting the sink to HDFS agent.sinks.flumeHDFS.type = hdfs agent.sinks.flumeHDFS.hdfs.path = hdfs://localhost:9000/user/data/twitter agent.sinks.flumeHDFS.hdfs.fileType = DataStream # Write format can be text or writable agent.sinks.flumeHDFS.hdfs.writeFormat = Text # use a single csv file at a time agent.sinks.flumeHDFS.hdfs.maxOpenFiles = 1 # rollover file based on maximum size of 10 MB agent.sinks.flumeHDFS.hdfs.rollSize = 10485760 # never rollover based on the number of events agent.sinks.flumeHDFS.hdfs.rollCount = 0 # rollover file based on max time of 1 min agent.sinks.flumeHDFS.hdfs.rollInterval = 60 # Connect source and sink with channel agent.sources.TwitterExampleDir.channels = memoryChannel agent.sinks.flumeHDFS.channel = memoryChannel
Next, you will need to create the spool directory, make a directory in BigSQL/Flume/ called TwitterExample. For the data that Flume will be moving around, I used simplified sample data that one could get from a Twitter streaming API. The JSON data in Twitter_data.txt is a collection of name and value pairs written in the following format:
Next, you will need to create the Hive table. The hive table definition below uses the JsonSerde SerDe2 properties and corresponds to the twitter data. In order to create this table, you will need to download hive-json-serde.jar and follow the instructions on how to add the jar to your class path. Once you have done this, you will be able to create the table:
hive > CREATE EXTERNAL TABLE IF NOT EXISTS twitter ( user_mentions_screen_name string, mentions_name string, mentions_id int, mentions_url string, hashtags string, in_reply_to_screen_name string, text string, id_str string, place string, in_reply_to_status_id int, contributors string, retweet_count int, favorite string, truncated string, source string, in_reply_to_status_id_str string, created_at string, in_reply_to_user_id_str string, in_reply_to_user_id int, lang string, profile_background_image_url string, id_str2 string, default_profile_image string, statuses_count int, profile_link_color string, favourites_count int, profile_image_url_https string, follow_ing string, profile_background_color string, description string, notifications string, profile_background_tile string, time_zone string, profile_sidebar_fill_color string, listed_count int, contributors_enabled string, geo_enabled string, created_at2 string, screen_name string, follow_request_sent string, profile_sidebar_border_color string, protected string, url string, default_profile string, name string, is_translator string, show_all_inline_media string, verified string, profile_use_background_image string, followers_count int, profile_image_url string, id int, profile_background_image_url_https string, utc_offset string, friends_count int, profile_text_color string, location string, retweeted string, id2 int, coordinates string, geo string)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.JsonSerde’ LOCATION ‘/user/data/twitter’;
Once your hive table is defined, you can start the Flume agent. Make sure that you have replaced the flume-conf.properties file in bigSQL/flum/conf with the file at the begining of this tutuorial. Then go to the BigSQL/examples/test/ directory and execute the following command:
This will start the Flume agent. Just add the Twitter_data.text file to the TwitterExample directory and its file name will change to “Twitter_data.txt.COMPLETED”. In the twitter directory on HDFS, you will see a file that starts with the term “flume data” after 60 seconds that file will be changed from a .temp to a permanent file in the HDFS, and you will be able to access the data from hive.
hive > SELECT name, text FROM twitter; name text Ćhríš Super hot weather i swear im migrating to austrailia in 3 more years Mika Labrague RT @mikalabrags: Bipolar weather ☁ ☀ JXSXLXN Loving the weather for tomorrow! Chris Highton Surely June is a summer month?! So why is the weather so bad! EmmaNorthcott Woken up to rain and strong winds. What weather are you waking up too? http://t.co/ONuNC8nP Pozisa Koyana Noooooo,Cape Town weather! nxa 100% chance of rain?? Iryt ? I dnt think so ☂☹ lyds☺ Competing in this weather will be horrendous Jamie-Leigh But seriously tho, why did this arctic weather pick today to make an appearance?! Time taken: 9.747 seconds, Fetched: 9 row(s) hive > SELECT SUM(followers_count) FROM twitter; _c0 4515 Time taken: 10.066 seconds, Fetched: 1 row(s)