How Map Reduce Implementations work
The input files are located on a distributed file system. In case of Hadoop and many others this is the HDFS, Google calls it the Google File System (GFS).
Worker processes can take over either mapper tasks or reducer tasks.
The mapper processed access pieces of the data located on the HDFS, apply the mapping function and write the result to their local disks.
Each time a mapper process is ready with a particular reduce key (as you remember, output of the map phase are key/value pairs), it informs the master process that this key is ready.
As you can see from below illustration, the master process plays an important role. It assigns the keys which are finished to different reducer processes.
Please notice, that each process handling a task can take over either a mapper or a reduce task. They are not specialized.
When a reducer process is informed by the master process that a particular reduce key is available, it is at the same time receiving the information from which mapper (and its local disk) the data can be picked up.
Be aware that a reducer task needs to know when it can start to work on a particular reduce key i.e. all mapper processes have delivered data for this particular key.
Again, the master process plays an important role as it keeps record and knows if all mapper tasks delivered data for the particular key (which is prerequisite for the reduce task to start its work on this key).
One very important advantage of the Map Reduce paradigm and its real-world implementations against traditional database systems is fault tolerance.
Consider a traditional database system like Teradata.
Although many features are built into a Teradata System to increase fault tolerance, it is hard to imagine a Teradata System scaling to hundred of thousands of processes.
Statistical probability ensures that failure is almost guaranteed in the long run.
Let's do a simple example:
If the probability for each process to fail is only 1%, the probability for one process out of hundreds of thousands to fail is around 100% . Basically this means, failure is guaranteed with such a huge system we may see more and more in the Big Data arena.
Fault tolerance is a huge topic for Map Reduce implementations.
In principle, the same statistical rules like mentioned above, apply for Map Reduce. Imagine a Map Reduce task spawning hundreds of thousands of mapper and reducer processes. Such a task can run for hours or even days on really huge amounts of data applying complex calculations like pattern recognition.
The always exists a master process which keeps communicating regulary with all other processes (mapper or reducer) and records the key ranges which have been sucessfully processed by each mapper or reducer task.
If a mapper fails, the master task just reassigns the complete key range to another mapper which accesses the HDFS and repeats work on this key range.
In case a reducer fails, the same logic from above applies and a new reducer is assigned to this task. Despite, already written keys to the output do not have to be reprocessed.
As you can see, the master process keeps track about all activities in the Map Reduce task and handles failures gracefully even for systems with hundreds of thousands of processes.
Further, even not real failures, the master process detects slow mappers or reducers and takes away the key range they are working upon, assigning it to a new process.
One might assume that the master process is a single point of failure. This is indeed the case, and if the master process fails, the complete Map Reduce tasks fails.
But again, statistical probabilities have to be considered:
Even if the chance that any of the thousands of processes fails is almost 100%, the risk that exactly this particular master process fails is very small. Exactly this property makes Map Reduce implementations very fault tolerant and more suitable for very large scale tasks than traditional database impementations.