Hadoop is just starting to come into mainstream consciousness. As a result, a lot of people are grappling with understanding the relationship between Hadoop and traditional data warehouses, and how ETL (Extract, Transform, Load) fits into the picture. On one of the “Big Data” forums on LinkedIn, someone asked the below question. Scroll down for my answer.
Enterprises are looking at Hadoop as an ETL processing engine that will feed unstructured data into an Enterprise Data Warehouse to do traditional BI (Business Intelligence). Are companies looking beyond this for more value-added uses of Hadoop?
I work with a a few vendors of Hadoop-related technology; their end customers are indeed looking to use Hadoop as more than a mere ETL engine that feeds data into a data warehouse.
Instead, these customers are looking for ways to do some analytics directly on the data stored in Hadoop, in order to explore avenues of analysis against the raw data (stored in Hadoop) that were completely not anticipated when the data warehouse was designed.
A (highly imperfect) analogy I sometimes use is that Hadoop is kind of like having a garage (or maybe even a garbage dump) of infinite size. You can throw everything in there, just in case you might need it one day. No need to clean up anything before you throw it in your infinite garage – just toss it in! This means, of course, that your garage will get messy and dirty quickly, and it will eventually get difficult to find what you need. You might even lose track of what’s in there. But at least you will have everything in case you need it later.
A data warehouse, in contrast, is like, well, a warehouse: finite size (due to the expense), with nicely ordered shelves, everything labeled, and the way things are stored are optimized for the most common tasks, so warehouse workers don’t waste a lot of time.
With your ETL scenario in my analogy, ETL would be “cleaning the garage”: going through your garage, finding all the things you need, dusting them off, making them orderly, and filing them neatly on shelves in your data warehouse.
All well and good, provided you actually know exactly what is in your very messy garage, which you probably don’t. What’s needed are some tools to help you take inventory of what is in your garage and conduct some basic analysis against it. That way, you’ll know what is actually worth moving into the data warehouse.
Anyway, that’s the way I think about it, with my thinking formed by what some of my clients’ customers want to do. I’d appreciate some feedback on the analogy, btw…
{ 1 comment… read it below or add one }
ETL is to big data as CRUD is do a database.
Definitely important procedures, and probably a very convenient short-hand for talking about the technology.
For me, I’d want to contrast the impacts of making an SQL versus No-SQL technology choice, in terms of how that choice affects the utility of the tool at solving the customer’s problems.
You can abstractly compare scalability, in terms of the cost to store data, the speed of storing and retrieving data, and the processing of reports (or “batch” processes), to compare performance characteristics.
Ultimately, I think the ideal analogy is one that gives insight into how those characteristics would affect the usage of the tool.
For example, it is much faster to dump a truckload of stuff haphazardly into the “dump” than to one-at-a-time put the things where they “belong” in the warehouse.
It is also much faster to find something in the warehouse, or do an inventory-audit of “how many widgets do you have?” with the warehouse – you don’t have to exhaustively canvas the entire dump for each query. But with sufficiently large data, you may have to check multiple warehouses spread across town, versus one-stop-shopping at the dump. Combine that with an old guy with a perfect memory and a one-eyed dog, who watches everything as it is placed in the dump, and he can tell you off the top of his head how many widgets are there.
Does this help at all, or am I going off-topic?