|
Discussion Forums
|
Thread: Multiple output files
|
|
|
Replies:
3
-
Pages:
1
-
Last Post:
Aug 7, 2009 1:45 AM
by: Andrew@AWS
|
|
|
Posts:
3
Registered:
8/6/09
|
|
|
|
Multiple output files
Posted:
Aug 6, 2009 10:56 AM PDT
|
|
|
I've been playing with the example 'wordcount' MapReduce application provided by AWS. It outputs data in multiple 'parts' - with the example data it produces 6 separate output files instead of 1.
Can anyone explain why this is happening?
I'm new to MapReduce so forgive me if this is a stupid question!
Thanks
Dan
|
|
Posts:
74
Registered:
1/12/09
|
|
|
|
Re: Multiple output files
Posted:
Aug 6, 2009 3:06 PM PDT
in response to: mdjc20
|
|
|
Hi Dan,
In Hadoop, each reducer writes to its own file. This allows the system to scale because having reducers write to separate files increases aggregate disk write throughput. Also, you'll notice that each file is sorted, but they are not sorted in relation to each other. In other words, part-00000 might have keys [0, 2, 4] and part-00001 might have keys [1, 3, 5].
If this is a problem for you, there are a few workarounds. First, you could configure your job to only use one reducer. To do this for a streaming job (which the word count sample is) using the console you would specify the following command in "Extra Args" during the "Specify Parameters" step of creating a job flow.
-jobconf mapred.reduce.tasks=1
However, this has the drawback of serializing processing to one node and serializing writing to one disk. If you have lots of data to reduce, this will go really slow and won't scale with cluster size.
Another option is to add a second step to your job which doesn't modify the data, but just sorts it and saves it to one file. This can be done using the IdentityMapper and IdentityReducer. If your first job can greatly reduce the amount of output data, then this is better because the first pass runs in parallel while the second pass—which sorts the data—will have to operate on much less data. You can do this with the Ruby client using the following command:
./elastic-mapreduce --create \
--stream --input s3n://elasticmapreduce/samples/wordcount/input --output hdfs:///temp/intermediate/ --mapper s3n://elasticmapreduce/samples/wordcount/wordSplitter.py --reducer aggregate \
--stream --input hdfs:///temp/intermediate/ --output s3n://<your_bucket>/output/ --mapper /bin/cat --reducer org.apache.hadoop.mapred.lib.IdentityReducer --args -jobconf,mapred.reduce.tasks=1
Two things to notice about the above command:
1. I put the intermediate data into HDFS. You could instead put it in S3, but HDFS is faster for intermediate data.
2. I used /bin/cat instead of the IdentityMapper because IdentityMapper fails complaining about type, while /bin/cat accepts anything and returns it is as text.
Hope this answers your question.
Andrew
|
|
Posts:
3
Registered:
8/6/09
|
|
|
|
Re: Multiple output files
Posted:
Aug 7, 2009 1:04 AM PDT
in response to: mdjc20
|
|
|
Thank you - a perfect explanation! Just one more question.
If number of reducers is not specified, how many are used by default? I note the official Hadoop recommendation is between 0.95 and 1.75 per node - what is used in AWS's implementation?
Thanks
Message was edited by: mdjc20
|
|
Posts:
74
Registered:
1/12/09
|
|
|
|
Re: Multiple output files
Posted:
Aug 7, 2009 1:45 AM PDT
in response to: mdjc20
|
|
|
Let me first give a disclaimer. The default number of reducers is an implementation detail and may change at a later date. If you rely on a specific number of reducers, you should set it yourself and not depend on the current default setting. That said, here is the algorithm we currently use.
Reducers per node is generally one per core, except on c1.xlarge. Here are the exact values for reducers per node.
m1.small 1
m1.large 2
m1.xlarge 4
c1.medium 2
c1.xlarge 4
Andrew
|
|
|
|