Big Data & Storage

Big data discussion area

BD – Isilon Multi-protocol access & Hadoop

BD – Isilon Multi-protocol access & Hadoop

Surely a scale out NAS array has little to do with Hadoop?

Why not just use commodity servers stuffed full of disks for your Hadoop needs?

Well, as it happens, Isilon has a great offering in Enterprise level Hadoop solutions By “Enterprise level Hadoop”. I mean that once a business decides it needs to put Hadoop into production, it requires all of the usual business processes surrounding a mission critical application. Typical Enterprise requirements are data protection, backup, snapshots, replication, high availability and security. With Isilon, not only do you get those Enterprise level features but you also get native HDFS support just like Isilon supports SMB, NFS, FTP, NDMP and Swift protocols.

Well what does that actually mean when put into production use?

Most high end NAS solutions offer simultaneous access to the same pool of data via say an NFS share and a SMB share.

explorer nfs

 

You can look at the share in Windows Explorer (left hand image), type in  ‘ls’ on the NFS share (right hand image) and you have access to the same data. You can read or write the data via any of the supported protocols assuming of course you have the appropriate file permissions.

Isilon, being a mature, Enterprise level NAS device, also provides all of the typical file sharing protocols and features but adds more!

Isilon supports the native HDFS protocol in the same manner. If you look at the share in Windows Explorer or do a ‘hdfs dfs –ls /’ it will look the same!

The screen grab below shows the output from “hdfs dfs –ls” of the same directory as those shown above in Explorer and NFS

hdfs

This means that you can simply upload your data into your Isilon based Hadoop cluster using traditional IP based protocols and then run Hadoop queries against it straight away.  There is no need to move the data, no translation, no post processing, it is just immediately available via the other protocols. Most importantly you dont need the default triple replication of the data thereby saving a huge amount of disk space. Obviously all the usual Hadoop ingest processes such as Sqoop or Flume and also Hortonworks data Flow (HDF) can also be used but the traditional IP based protocols such as SMB and NFS are well understood and have been used to share data for years.

How can this make your life easier?

In this example workflow, you have some logs from a web server writing to an NFS share. You want to run some Hadoop jobs against the data and then view the results via a windows client. You can do this in the traditional manner and have to copy the data into HDFS and out again but with Isilon it is far simpler.

workflow

The diagram above logically shows the following workflow.

  • You write your Web server log files into an NFS share
  • You then run Hadoop queries directly against them over HDFS
  • The results from the Hadoop job are written directly into a directory via HDFS which is then also available via an SMB share to make it easy to view the results straight away as a Windows user.

There is no extra moving of data in/out of HDFS, no transferring of the results to another location. It is available via any of the protocols as soon as it is written to the underlying OneFS file system. (OneFS is the operating system that runs on each node in the Isilon cluster providing the shared single file system namespace across all nodes.)

How does Isilon achieve this very useful functionality?

Each node in the Isilon Cluster runs a HDFS daemon process that responds to HDFS protocol requests as both a NameNode and a DataNode. Those requests are “translated at wire speed”, just like any other supported IP protocol, into the associated actions/results onto the Isilon OneFS Posix file system. The diagram below shows a high level view of what is going on for well know standard IP based protocols

ip_protocol

The IP protocol talks to the Isilon via a service running on a specified standard IP port. An associated service running on the protocol specific port translates the commands/data into the appropriate action onto the underlying file system. Isilon has created an HDFS protocol translator service that responds to NameNode and DataNode requests on the default port 8082. Other HDFS services such as https and webHDFS use different port numbers.

The diagram below logically shows a 4 node Isilon cluster running the HDFS daemon on each node and acts like both a NameNode and a DataNode for all of the data in the pool.

nn-dn

  1. The Hadoop worker node, running the standard HDFS Client code, connects to the NameNode to request the location of a block/file.
    • NOTE: There is no specific code or plugin required on any client as Isilon runs a fully HDFS compliant service. You just use the default HDFS software provided by an Apache Hadoop distribution.
  2. The Isilon NameNode service provides a compliant API response with the IP addresses of three DataNodes that have access to the data requested (on Isilon, all of the nodes in the pool or zone have access to all of the data)
    • Isilon also supports “rack awareness” to return the most appropriate nodes IP addresses.
  3. The compute worker node then connects to the Isilon DataNode service running on one of the nodes specified by the NameNode to request the data.
    • The selected Isilon node collates the data from the OneFS file system and returns it to the worker node.

In the above example, the NameNode listed in the core-site.xml file is the fully Qualified Domain Name (FQDN) of the SmartConnect IP address of the Isilon.

SmartConnect is an Isilon software feature that does IP load-balancing to spread the client connections from the Hadoop worker nodes across the nodes in the Isilon cluster.

Using Isilon for the underlying HDFS file system for your Hadoop compute cluster means that in a 10 node Isilon cluster there are 10 NameNodes and 10 DataNodes to support the Hadoop Compute requirements. There is no need for Secondary NameNodes or HA NameNodes as the primary service runs on every Isilon node. Isilon does not require any tuning of the memory allocation for metadata store on the name node as that function is built into the design of the Isilon node and the OneFS file system. Obviously this solution provides an extremely high level of NameNode resilience!

Isilon adheres to the HDFS protocol standards and is thoroughly tested for each Hadoop release. For example, EMC Isilon’s HDFS protocol is tested using the same 10,000 tests that HortonWorks uses for each of its new releases.  It is backwards compatible so you can run production on a stable version and then spin up a new version, read the same data and test it out before committing production to the new version of code.

After explaining the Isilon HDFS solution to a customer a little while ago, they suggested that a good way to describe it was that “Isilon provides a first class file system for HDFS”

In summary, some of the benefits of using multi-protocol access on Isilon as your HDFS storage layer are as follows:

  • Multiple Protocol access to your data without any moving/copying of data
  • Multiple versions of Apache based Hadoop distributions supported
    • Different Hadoop distributions can have access to the same data (read only for simultaneous access). You can try out a distribution and then go back to your original supplier if it does not work out.
    • Different Versions of Hadoop can have access to the same data without copying it.
      • Note: There are a few distribution and version specific issues to be aware of such as adding different users (ambari_qa or cloudera’s manager) but fundamentally you can provide access to the data for different distributions/versions.
  • You can scale compute and storage independently. Need more capacity? add another Isilon node, need more compute? add another worker node.
  • You don’t need to replicate the data 3 x to provide data protection. Isilon is far more efficient, using FEC to protect the data. This typically provides up to around 80% usable/raw disk saving on rack space, power and cooling.

There are a number of other major benefits from using Isilon as the HDFS data store. I will describe some of them in future posts.

For more immediate queries please see the EMC Isilon Big Data Community page.

 

BD – Building a test Hadoop Cluster

Building a test Hadoop Cluster

I have been working with HortonWorks recently and they wanted to see the installation process of an Hadoop Cluster using Isilon as the storage layer and any differences from a standard DAS based install. I ran into numerous issues just getting the Linux hosts into a sensible state to install Hadoop. It thought I would summarise some of the simple issues that you should try to resolve before starting to install Hadoop.

Initial starting point (HortonWorks instructions and Isilon specific set up instructions)

  • I built a few VMs using Centos 6.5 DVD 1
    • Selected a Basic Database Server as the install flavour
    • Choose a reasonable sized OS partition as you might want to make a local Hadoop repository and that is a 10GB tar.gz file download.  You have to extract that so over 20GB is needed to complete that process. I ended up resizing the VM a couple of times so I would suggest at least 60GB for the Ambari Server VM including the local repository area.
    • You might want to set up a simple script for copying files to all nodes or running a command on all nodes to save you logging into each node one at a time.Something a simple as ( for nodes in 1 2 3 4 5 6; do; scp $1  yourvmname$nodes:$1 ; done) will save a lot of time.
    • Set up Networking and Name resolution for all the nodes and Isilon (use SmartConnect)
    • Enable NTP
    • Turn off or edit the IPTABLES setting so the nodes get access to the various ports used by Hadoop
    • I needed to update the Openssl package as the Hadoop install process fails quite a few steps along the way and you may run into other issue if you restart the process again. (# yum update openssl)
    • Disable the transparent huge pages (edit the /boot/grub/grub.config file and reboot)
    • Set up password less root access for the Ambari server to the other compute nodes in the cluster
  • The only real changes during the Ambari based install process occur during the initial set up as per below:
    • Add all the compute and Master nodes into the install process and use the ssh key.
    • Go to the next page so they all are registered and install the Ambari client.
    • Then press the back button, add the Isilon FQDN to the list with a manual (not ssh login) and then continue.
    • Later, during the services/node selection process, just have the NameNode and Datanode services on the Isilon only.
    • Just follow the install process (change the repository to a local one if you set that up) I did, as my link to the remote repositories was limited to around 500k so it took ages to install multiple nodes without the local options.

I now have two Hadoop clusters up and running and using Isilon as the HDFS store so more to play with 🙂

BD-Hadoop Summit Dublin

Hadoop Summit Dublin

I am looking forward to attending the Hadoop Summit in Dublin on the 12th to 14th April. I attended the Summit in Brussels last year and really enjoyed the event. The evening visit to the auto museum matched my Jaguar and Big Data interests perfectly!

I will be there with a few of my fellow colleagues also focused on Big Data.

See you there!

BD – Hortonworks Data Flow

HDF Masterclass (NiFi)

I attended the Hortonworks Data Flow Masterclass on Thursday this week (18th). It is a very comprehensive, simple to configure, piece of software. It facilitates the rapid configuration of managing data in flight. HDF simplifies the configuration of many types of incoming data flows in different protocols and formats, transforming it, redirecting it, and sending it out to another type of data flow. You basically set up “processor” blocks, apply configurations to those block to handle the data in the way you want. You then add further building blocks to build up the overall data flow that you require. It has many many other features, far to detailed for this blog that make it a excellent tool to investigate if you are having any issues getting your data into a “Data Lake”.

There is a very simple GUI that has drag and drop features that is very easy to use to build complex data flows quickly. For example, Suppose you have an FTP repository location that files are written to and you want those written to HDFS, how do you do that? Now obviously, you can do the FTP to HDFS translation without any effort using Isilon due to its multi-protocol capability, but what if you wanted to do some transformation of the data? What if you wanted to apply some meta data, maybe time-stamp in the file name and then compress the files and then write to HDFS. You can of course script such a simple process but you then you have to support it moving forward and fix issues that occur should the data format change, locations change and so on.

With HDF, this data flow process would be created as followed:

  • You start with an input processor that looks at the remote ftp repository and moves files to the local location via FTP or most likely SFTP.
  • A second attribute process block could be used to rename the file to include the time-stamp and add the required compression method as a meta data object.
  • The output would be directed to a compression processor which carries out the compression of the data based on the the type of compression algorithm object specified in the previous step
  • The output of the above processor would then be directed to an HDFS file output processor writing the file into an Hadoop file system.

These processor block are dragged/dropped onto the page in the GUI and configured with a few mouse clicks. They are connected together by clicking on the source processor and dragging an arrow to the next processor. The above configuration would take just a few minutes to build. It is then started using stop/go buttons and can be tested. Each block shows ingress/egress statistics and you also have very detailed data lineage visibility at every stage of the process. You can look at the actual data anywhere along the process and even replay the data back through the data flow to retry should you find any errors or want to add additional data processing into the path.

Some immediate use cases are apparent straight away.

  • Sure you can write scripts to do any kind of data ingest, transform and output data but those are the boring tasks. Why not offload those tasks to a very simple to use tool that can carry out the process without any scripting with great visibility of what is going on. It has many inbuilt functions to simply add to the outlying error cases that will crop up over time. That allows your staff to get back to the real work of generating business benefits from the data you have captured rather than spending all their time just managing the data flow processes.
  • What if you use a BI tool that charges per TB ingested and although you want to keep all of the data eventually, a lot of the incoming data is noise. Why not filter the noise off into a data lake and only pass the relevant data into your BI tool thereby reducing the per TB ingest costs dramatically.
  • …. many others

I could go on and on about this. A number of my colleagues have seen the product and have all been excited about the possibilities it raises. If you are involved with any kind of data management requiring the collation of data (hopefully you are if you have found these pages) then please take a look at HDF and see what you think. Of course, please feel free to feed back any comments below.

BD – The Haystack Analogy

Find the Needle in the Haystack
Find the Needle in the Haystack

Finding the “Needle in the Haystack” is an often used analogy in discussions regarding Big Data and analytics. The problem is that having more data (A bigger Haystack) does not always mean you will find more needles. In fact it may make it harder to find those data nuggets you are searching for. I came across some great discussions/posts in the comments section of an article on “The Register” the other day. The story is “GCHQ mass spying will ‘cost lives in Britain,’ warns ex-NSA tech chief

Some of the analogies and references used in the comments included

  • Setting fire to the haystack and using a magnet to pick up the needles
    • One problem was that the heat may damage the needle (corrupt your data)
    • The terrorists could change to Carbon Fibre Needles so your magnet will not find them!
  • A far more distasteful analogy was that of a septic tank where you are looking for a chunk of chocolate but to find it you have to sample everything. Yuck!

In summary I guess the idea is to add as much “relevant” data as you can into your Data Lake but try not to pollute it with pointless additional data that makes it harder for you to find what you search for. The challenge is then, what is relevant data!