The one concept that is quite popular today is Map Reduce. A perception is created that all the software problems can be solved by using Map Reduce. People have created cult around Map Reduce. Ok..ok..jokes apart, Map Reduce is an important concept to understand when you are dealing with large scale data processing. In this post we will try to look into Map Reduce in simple terms and try to unravel the mystery behind it.
Map-Reduce is made up of two words. Map and Reduce. Let's take each term and see what they mean. Map is a key value pair. Here in Map Reduce concept, Map word basically refers to making a Map. This is the first step in Map Reduce. The second word is Reduce, which mean to shrink. So what we do is we shrink the number of maps, so that for one key only one store exists. In the context of this discussion, let's look into an example.
Let's say we have 3 files and each file contains some set of words.
Let's say we want to find out how many words are of 1 letter, how many of 2,3 and so on.
The three files are sitting in different servers and assume that they are huge files (assume please, if not able to visualise than think of each file containing 10 million+ words). Because of large size (did I say Big Data?) we cannot bring to common place to do processing. This is another paradigm that you might want to be comfortable with. In Map Reduce, we send the processing logic to the data. Most of the time in enterprise/web applications, we load the data from database or from file system in memory and do the processing on that. In Map reduce we do the other way round. As the data sizes are huge, we send the processing logic to the data storage, in this case to the servers where files are residing.
Now let's build our Map first in Map - Reduce step.
So when 1.txt is processed, the result might look like this:
2 Letter words - 1
5 Letter words - 1
3 Letter words - 2
So we have a key value pair now. This is our Map. Pretty straightforward. Let's do this for each file
Now let's do the reduce step. In reduce step we take the maps from Map step and merge them.
2 Letter Words -> (1 from 1.txt) + (2 from 2.txt) = 3
3 Letter Words -> (2 from 1.txt) + (2 from 2.txt) + (3 from 3.txt) = 7
4 Letter Words -> ( 1 from 3.txt) = 1
5 Letter Words -> (1 from 1.txt) = 1
Let's look into one more example to cement the concept. Let's find out how many words start with a, with b and with c
Our Map table
and Reduce will give us
A -> 2 + 1 = 3
B -> 1 + 2 + 2 = 5
C -> 1 +1 + 3 = 5
So it's the same set of data, but based on what results we are looking for we create a different kind of Map and accordingly we reduce the Map to get the final result.
How Map Reduce helps in dealing server failures
In real life scenarios, there is terabytes of data and thousands of server. Even if we assume a server life of 1000 days this leads to 1 server failing daily for each 1000.
Map Reduce is good in dealing with that failure as each piece of Map task generates an intermediate result which is independent of other Map tasks. So even if one Map task fail, the same Map task can be run on another server carrying the copy of the data on the failed server. And if the reduce tasks fail that can be run on the Map results again.
Hopefully this will give a better understanding of Map Reduce concept.
You might want to see Hadoop Introduction also
Let's say we have 3 files and each file contains some set of words.
1.txt
|
2.txt
|
3.txt
|
At
Aroma
Bow
Cat
|
Bye
Cap
Be
Am
|
Cow
Big
Cup
Came
|
Let's say we want to find out how many words are of 1 letter, how many of 2,3 and so on.
The three files are sitting in different servers and assume that they are huge files (assume please, if not able to visualise than think of each file containing 10 million+ words). Because of large size (did I say Big Data?) we cannot bring to common place to do processing. This is another paradigm that you might want to be comfortable with. In Map Reduce, we send the processing logic to the data. Most of the time in enterprise/web applications, we load the data from database or from file system in memory and do the processing on that. In Map reduce we do the other way round. As the data sizes are huge, we send the processing logic to the data storage, in this case to the servers where files are residing.
Now let's build our Map first in Map - Reduce step.
So when 1.txt is processed, the result might look like this:
2 Letter words - 1
5 Letter words - 1
3 Letter words - 2
So we have a key value pair now. This is our Map. Pretty straightforward. Let's do this for each file
1.txt
|
2.txt
|
3.txt
|
2 -> 1
3 -> 2
5 -> 1
|
2 -> 2
3 -> 2
|
3 ->3
4 -> 1
|
Now let's do the reduce step. In reduce step we take the maps from Map step and merge them.
2 Letter Words -> (1 from 1.txt) + (2 from 2.txt) = 3
3 Letter Words -> (2 from 1.txt) + (2 from 2.txt) + (3 from 3.txt) = 7
4 Letter Words -> ( 1 from 3.txt) = 1
5 Letter Words -> (1 from 1.txt) = 1
Let's look into one more example to cement the concept. Let's find out how many words start with a, with b and with c
Our Map table
1.txt
|
2.txt
|
3.txt
|
A -> 2
B -> 1
C -> 1
|
A -> 1
B -> 2
C -> 1
|
B -> 2
C -> 3
|
and Reduce will give us
A -> 2 + 1 = 3
B -> 1 + 2 + 2 = 5
C -> 1 +1 + 3 = 5
So it's the same set of data, but based on what results we are looking for we create a different kind of Map and accordingly we reduce the Map to get the final result.
How Map Reduce helps in dealing server failures
In real life scenarios, there is terabytes of data and thousands of server. Even if we assume a server life of 1000 days this leads to 1 server failing daily for each 1000.
Map Reduce is good in dealing with that failure as each piece of Map task generates an intermediate result which is independent of other Map tasks. So even if one Map task fail, the same Map task can be run on another server carrying the copy of the data on the failed server. And if the reduce tasks fail that can be run on the Map results again.
Hopefully this will give a better understanding of Map Reduce concept.
You might want to see Hadoop Introduction also
No comments:
Post a Comment