|
Discussion Forums
|
Thread: Input Directory Recursive Copy
 |
This question is not answered.
Helpful answers available: 2.
Correct answers available: 1.
|
|
|
|
Replies:
5
-
Pages:
1
-
Last Post:
Apr 8, 2009 1:50 PM
by: Peter N. Skomor...
|
|
|
Posts:
34
Registered:
5/6/08
|
|
|
|
Input Directory Recursive Copy
Posted:
Apr 7, 2009 6:11 PM PDT
|
|
|
Hey folks,
quick question regarding the input to a EMR job. I understand the input is a S3 bucket location the contents of which are copied to the temporary cluster setup by EMR to execute the job. My questions is - is this copy recursive? Will subdirectories underneath the root S3 bucket be copied as well?
Thanks!
|
|
Posts:
128
Registered:
11/16/06
|
|
|
|
Re: Input Directory Recursive Copy
Posted:
Apr 8, 2009 9:31 AM PDT
in response to: pws0
|
|
|
Hadoop input directories are expected to be flat, for example 100 input files below the parent directory with no subdiretories.
For example:
s3bucket/path/file1.txt
/file2.txt
/...
|
|
Posts:
34
Registered:
5/6/08
|
|
|
|
Re: Input Directory Recursive Copy
Posted:
Apr 8, 2009 10:45 AM PDT
in response to: Peter N. Skomor...
|
|
|
Hadoop input directories are expected to be flat, for example 100 input files below the parent directory with no subdiretories.
Indeed. However, I was more thinking of the use case where, for example, I would like to work with two separate data sets in the same job - for example do a join and then aggregate on the joined set. Suppose, then, I would have the following directory setup in s3:
s3_bucket/overall_jobinput/data_set1/
and
s3_bucket/overall_jobinput/data_set2/
I would provide my EMR job with the following input: s3_bucket/overall_jobinput/
But, for my job to work, I would need to ensure EMR copied the directories underneath. Hopefully, this clarifies my use case.
Interestingly enough, as this exchange:
http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200902.mbox/%3C4988B6FC.7020607@apache.org%3E
indicates, there are folks who would like more flexibility with this in Hadoop as well but that's a Hadoop specific issue.
Thanks!
|
|
Posts:
128
Registered:
11/16/06
|
|
|
Posts:
34
Registered:
5/6/08
|
|
|
|
Re: Input Directory Recursive Copy
Posted:
Apr 8, 2009 11:56 AM PDT
in response to: Peter N. Skomor...
|
|
|
I'm planning on posting a join example soon, you can just specify each of those directories as separate inputs in the console args or json job description:
-input s3_bucket/overall_jobinput/data_set1/
-input s3_bucket/overall_jobinput/data_set2/
Thanks - that's great to know!
One last question: is the -arg option in the ruby CLI a way of passing command line parameters to the hadoop job jar? For example, I could then use this as a way of specifying the location of the two different data directories?
Thanks again!
|
|
Posts:
128
Registered:
11/16/06
|
|
|
|
Re: Input Directory Recursive Copy
Posted:
Apr 8, 2009 1:50 PM PDT
in response to: pws0
|
|
|
That is correct, you can put any of the usual command line arguments like extra "-input" lines into the "args" box and they will get passed to the job
|
|
|
|