Flume: Import apache logs in hadoop hdfs
Flume is a project of the Apache Software Foundation used to import stream of data to a centralized data store. In hadoop environments Flume is used to import data into hadoop clusters from different data sources.
In this post I show how use Flume to import apache logs (access_log and error_log ) in hadoop hdfs filesystem.
A Flume agent is composed by a set of sources, channel and sinks:
- sources are used to collect data/events from different data sources
- channels are the communication media used to temporary store the events collected by the sources
- sinks asynchronously read the events from the channel and send them to a destination
Flume supports different types of sources,channels and sinks. The complete list of sources,channel and sinks already implemented can be obtained reading the documentation ( https://flume.apache.org/FlumeUserGuide.html )
In my example in order to import Apache web server logs I use the following flume component:
- Exec source: It runs a unix command and expects that process produce data in the standard output
- Memory channel: This channel implements an in-memory queue for the events
- HDFS Sync: It allows to create text file on HDFS The following image shows the architecture of the flume agent used in this example.
The flume agent requires a configuration file defining the sources,channels and sinks used by the agent and their properties. The following text box shows the flume configuration file that I used to import apache web server logs in hadoop:
The import of apache logs can be started running Flume with its configuration file as argument and waiting that Apache web server start to produce logs: