(On request from a reader, this is a reprise of a blog post I did years ago, but was lost in the coming and going of corporate blog content. It is a bit dated, but I believe mostly still relevant, so I am reposting it here.)
Lots of discussion online about MapReduce these days. Stephen Swoyer's piece is quite a good one.
The confusion about what MapReduce is and isn't about actually goes back a ways.
The arguments within computerdom tend to generally be fairly religious in tone. That's because people really like computing and tend to be passionate about it, and when people think they've something new, their ego interest is in showing how new and innovative it is, and the ego-interest of the creators of prior-generation innovations is in casting it as old news.
I am biased, but not working in scalable computing at the moment, so I've had a little bit of time to develop some perspective here I'd like to share.
Scalable computing via parallel processing is big business these days. There's heavy data lifting at the transaction processing level, at the data analysis level, for web search indexing, for enterprise search indexing, scientific computing, data mining for target marketing, affinity advertising. There are many places where data processing of really large amounts of data has commercial benefits, and that means parallel processing adds lots of value.
At the smaller end of the parallel processing scale there's actually a crisis of sorts - we need parallelism to use up all these multi-core CPUs that the chip makers can now create easily, yet there is no mainstream accepted programming idiom that creates the multi-threaded programs yet is easy to use.
MapReduce was designed initially for the other end of the spectrum - huge data, Internet-scale data. Much bigger than enterprise data for even the largest enterprise. This is where clearly you want to use lots of computers all at once. Advocates will tell you it's also great for that downward scale to multi-core CPU thing - and I concur. But that wasn't the first intention for it.
The other big scalable computing systems of commercial import are the relational database engines, and the ETL tools that transform data, and sometimes also support data mining. In common to both of these: commercial data, commercial data types, lots of structured data in the mix, of course with gradually improving capabilities to handle unstructured data. Databases are of course data storage tools, where ETL systems are algorithmic systems. For example, in a database when you talk about how to break up data into partitions you are talking about how it is stored. In the ETL world, partitioning data is about how it is moving, and incidently it can be stored that way also. This latter is very close to the way MapReduce folks talk about data even though I don't think ETL folks and MapReduce folks drink in the same bars.
What all three of these kinds of systems, MapReduce included, have in common is that they are all about scalable Flow-based programming (FBP for short) to some degree. Not everyone agrees on this term flow-based programming, preferring dataflow, or streaming, or flow-graph or some other term. RDBMS systems have their query graphs at their query processing cores - they transform SQL into these. A query graph is a FBP. ETL tools let you draw these graphs (and script them also sometimes). They allow the graphs to have multiple outputs, unlike a query graph, but otherwise are fairly similar. If you use the Apache Pig (so NOT a commericial software system!) system to create MapReduce programs, it cascades the map-reduce operations together into a flow graph. This was created because so many map-reduce applications found themselves wanting to connect one mapreduce back to back into another that someone wisely observed that this flow composition was FBP and that was really part of the idiom.
The primary source of parallelism in all the above is called partitioning. Basically it is the principle that if you have a large piece of data, or large collection of data items, you can divide-and-conquer it, and then put the pieces back together (if necessary). The divided processing is the "map" part, the putting back together is the "reduce" part. You might map and map and map and then reduce, or alternate. Do this kind of thing over and over as you process the data, flowing the output of one such operation to the input of another.
So here's one major point of total agreement: There's a agreement across many different areas of computerdom that flow-based programming is a very good way to go if you want parallelism. I think people would also say the idiom is pretty easy to learn, adequately expressive for many problems, and so forth.
That's very important common ground. If you are a software developer and can say you've done object-oriented programming and you've used threads or actor libraries or some such, you may feel relatively accomplished at your trade. If you are not directly familiar with this FBP idiom it is something you should learn about, because you will likely find it very valuable, and it will change the way you think about solving problems.
FBP principles are where the similarities end among these kinds of data processing systems though. There are huge differences about what these systems are designed to be used for.
Now, advocates of each seem to have some "hammer vs. nail" mentality - By that I mean if you have a big enough hammer, *everything* starts to look like a nail. Example: extending an RDBMS to do what it does, but also handle spatial data, or blobs of image data, or real-time streaming input data, or transactions from a queue, or sensor data feeds. You pick it. Extending a RDBMS hammer to hit these kinds of nails, this is a very attractive option to those familiar with RDBMS technology, and particularly if they have access to the source code for one. Same for ETL tools. I was chief architect for a major commercial ETL processing engine, and we tried to extend it in all kinds of stretch directions. MapReduce advocates would look at these same problems and recognize the maps and reduces that are being cascaded to form the solutions to them.
So, what separates these technologies?
I claim the first thing is maximum scale. But what is a big enough scale to start telling these apart?
Ok, here's a problem suitable only for MapReduce today: Let's play where's Waldo and find every video frame containing my face, in every video CCTV surveilance capture everywhere on earth for the last year. You have 60 seconds to find the answer.
That is the kind of problem map-reduce can scale up to. The data is petabytes or exabytes of data. Now, the actual where's Waldo search might be very cleverly indexed, so I probably do not have to stream all these exabytes of video frame by frame to look for my face. But at some point every frame was processed to create whatever index answers this problem. That processing scale is truly vast.
None of the other technologies were built with anything like this scale in mind. Yes there are FBP principles of operation which are in common. The distinction is what frameworks have been built to actually do.
What about ETL or RDBMS technology? What's suited to them that crisply distinguishes them from MapReduce frameworks. Answer: processing structured data records repeatedly. I'll characterize what I mean by this, but first let me talk about some of the design limitations of RDBMS and ETL technologies.
First the problems are ones at more traditional enterprise scale. These days that's multiple terabytes of information, maybe even 100 terabytes now adays. One assumption common to these systems is that the whole computing cluster across which the work is spread is assumed to be up and working. Failure of any part of the cluster is expected to stop the processing. There are various checkpoint/recovery schemes which are pretty much the cutting edge of these systems these days. These always assume a failure is a relatively rare event. Most executions will complete without failure.
There's a thing I call a time/complexity trade off at work here. Which would you rather have? A system that is slower, and more complex, but which takes twice as long to compute something thereby doubling the likelyhood of encountering a failure during a run? Or would you prefer a faster system without the complexity? Computers are fairly reliable these days, so at the scale of a cluster of computers small enough where I can reasonably expect them all to all be running properly for hours at a time... Well I know my own preference here and that of many "enterprise" customers matches my own - simple wins.
MapReduce makes no such assumptions about failures. The implemented scale is assumed to include computers that will fail during the execution of a flow, and the computation must continue and mask the failure.
So here's a real and concrete question: Does your processing problem have the vastness to it that means the compute cluster will have failures in it most of the time, or is the processing scale such that the cluster is very unlikely to have a failure in it that interrupts processing? If the former, MapReduce is the only viable technology today. If the latter, there are still reasons you might want to use a MapReduce-like means.
A big difference between MapReduce systems I've looked at and the enterprise RDBMS and ETL world is efficiency of data access vs. complexity.
RDBMS and ETL are optimized for handling business-oriented data. Quite literally, I worked on engineering an ETL engine so that when a processing component accessed an integer from a data record, the C++-based implementation would generate code that would compile into a base+offset-indirection - really our goal, and we got very close, was for this to be a single CPU instruction. The assumption here: you are accessing billions of rows of well behaved small data that has been imported into a well behaved representation, bringing it in once from whatever it was before you started processing it. By the way, getting this super low level efficiency introduces lots of complexity. Many would rightfully say this was "wicked over-engineered" - mea culpa. Databases go even further and try to apply cost-based dynamic programming optimizers, move things around in the flows to reduce work done. Very cool stuff. Actually one of the first applications of ETL tools was actually a backlash to the unpredictability of SQL database optimizers. People wanted to draw a flow and have it run that way dammit! That was 1995. The optimizers have improved, but ETL is important for other more fundamental capabilities like superior expressive power. ETL tools can produce multiple outputs from a single input in one pass. SQL is a query language and is incapable of expressing this directly even though the underlying engines could easily implement this.
Anyway, all this low-level data access optimization and high-level flow-graph optimization, this kind of thing is just not considered particularly important by the MapReduce advocates. You process data in the form it was found, which is assumed to be an HTML document, XML data file, or perhaps JSON. Perhaps other forms that you happen to have some code to handle. Really it's a text-intensive world. These formats are all very verbose textual formats full of tag syntax. Verbose by comparison with carefully layed out binary data records is what I mean. It takes thousands, of machine instructions to extract a piece of data like an integer from such representations because you have to find it first (parse and/or search for tag), then convert text to integer.
But there is absolutely nothing wrong with that overhead. Because accessing small numeric fields is just not what the MapReduce world is about. Consider this: in RDBMS and ETL worlds, if you access a string, there's low overhead to get to the start of the string, and then you're processing the string data using the usual libraries available in the programming language of your implementation.
In MapReduce, there's high overhead to get to the start of the string you care about, but then you're processing the string data using the usual libraries available in the programming language of your implementation.
Get my point? The only place where the RDBMS/ETL world has advantage is if the strings are really small and must be repeatedly accessed. In many many very interesting problems, the strings are not at all small, and many problems are "one-pass" over the data by each algorithm.
So the conclusion should not be surprising: it's "horses for courses" - none of these technologies is superior to the other. They all share flow-based programming principles - they jury is back, flow-based programming is a GOOD IDEA whose time has come and it is being used all over the place. Beyond that, what kind of processing are you doing? What kind of data is at its core, and at what scale?
Tags: BI, Business Intelligence, data integration, data warehouse, MapReduce, Scalable Computing, Parallel Processing, Parallelism, multi-core
Recent Comments