BD – Hortonworks Data Flow

HDF Masterclass (NiFi)

I attended the Hortonworks Data Flow Masterclass on Thursday this week (18th). It is a very comprehensive, simple to configure, piece of software. It facilitates the rapid configuration of managing data in flight. HDF simplifies the configuration of many types of incoming data flows in different protocols and formats, transforming it, redirecting it, and sending it out to another type of data flow. You basically set up “processor” blocks, apply configurations to those block to handle the data in the way you want. You then add further building blocks to build up the overall data flow that you require. It has many many other features, far to detailed for this blog that make it a excellent tool to investigate if you are having any issues getting your data into a “Data Lake”.

There is a very simple GUI that has drag and drop features that is very easy to use to build complex data flows quickly. For example, Suppose you have an FTP repository location that files are written to and you want those written to HDFS, how do you do that? Now obviously, you can do the FTP to HDFS translation without any effort using Isilon due to its multi-protocol capability, but what if you wanted to do some transformation of the data? What if you wanted to apply some meta data, maybe time-stamp in the file name and then compress the files and then write to HDFS. You can of course script such a simple process but you then you have to support it moving forward and fix issues that occur should the data format change, locations change and so on.

With HDF, this data flow process would be created as followed:

  • You start with an input processor that looks at the remote ftp repository and moves files to the local location via FTP or most likely SFTP.
  • A second attribute process block could be used to rename the file to include the time-stamp and add the required compression method as a meta data object.
  • The output would be directed to a compression processor which carries out the compression of the data based on the the type of compression algorithm object specified in the previous step
  • The output of the above processor would then be directed to an HDFS file output processor writing the file into an Hadoop file system.

These processor block are dragged/dropped onto the page in the GUI and configured with a few mouse clicks. They are connected together by clicking on the source processor and dragging an arrow to the next processor. The above configuration would take just a few minutes to build. It is then started using stop/go buttons and can be tested. Each block shows ingress/egress statistics and you also have very detailed data lineage visibility at every stage of the process. You can look at the actual data anywhere along the process and even replay the data back through the data flow to retry should you find any errors or want to add additional data processing into the path.

Some immediate use cases are apparent straight away.

  • Sure you can write scripts to do any kind of data ingest, transform and output data but those are the boring tasks. Why not offload those tasks to a very simple to use tool that can carry out the process without any scripting with great visibility of what is going on. It has many inbuilt functions to simply add to the outlying error cases that will crop up over time. That allows your staff to get back to the real work of generating business benefits from the data you have captured rather than spending all their time just managing the data flow processes.
  • What if you use a BI tool that charges per TB ingested and although you want to keep all of the data eventually, a lot of the incoming data is noise. Why not filter the noise off into a data lake and only pass the relevant data into your BI tool thereby reducing the per TB ingest costs dramatically.
  • …. many others

I could go on and on about this. A number of my colleagues have seen the product and have all been excited about the possibilities it raises. If you are involved with any kind of data management requiring the collation of data (hopefully you are if you have found these pages) then please take a look at HDF and see what you think. Of course, please feel free to feed back any comments below.

Leave a Reply