Discussion Forums



Thread: Input Directory Recursive Copy

This question is not answered. Helpful answers available: 2. Correct answers available: 1.

Welcome, Guest Help
Login Login


Permlink Replies: 5 - Pages: 1 - Last Post: Apr 8, 2009 1:50 PM by: Peter N. Skomor...
pws0

Posts: 34
Registered: 5/6/08
Input Directory Recursive Copy
Posted: Apr 7, 2009 6:11 PM PDT
 
  Click to reply to this thread Reply

Hey folks,

quick question regarding the input to a EMR job. I understand the input is a S3 bucket location the contents of which are copied to the temporary cluster setup by EMR to execute the job. My questions is - is this copy recursive? Will subdirectories underneath the root S3 bucket be copied as well?

Thanks!


Peter N. Skomoroch
RealName(TM)


Posts: 128
Registered: 11/16/06
Re: Input Directory Recursive Copy
Posted: Apr 8, 2009 9:31 AM PDT   in response to: pws0
 
  Click to reply to this thread Reply

Hadoop input directories are expected to be flat, for example 100 input files below the parent directory with no subdiretories.

For example:

s3bucket/path/file1.txt
                      /file2.txt
                      /...




pws0

Posts: 34
Registered: 5/6/08
Re: Input Directory Recursive Copy
Posted: Apr 8, 2009 10:45 AM PDT   in response to: Peter N. Skomor...
 
  Click to reply to this thread Reply


Hadoop input directories are expected to be flat, for example 100 input files below the parent directory with no subdiretories.
Indeed. However, I was more thinking of the use case where, for example, I would like to work with two separate data sets in the same job - for example do a join and then aggregate on the joined set. Suppose, then, I would have the following directory setup in s3:

s3_bucket/overall_jobinput/data_set1/

and

s3_bucket/overall_jobinput/data_set2/

I would provide my EMR job with the following input: s3_bucket/overall_jobinput/

But, for my job to work, I would need to ensure EMR copied the directories underneath. Hopefully, this clarifies my use case.


Interestingly enough, as this exchange:

http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200902.mbox/%3C4988B6FC.7020607@apache.org%3E

indicates, there are folks who would like more flexibility with this in Hadoop as well but that's a Hadoop specific issue.

Thanks!



Peter N. Skomoroch
RealName(TM)


Posts: 128
Registered: 11/16/06
Re: Input Directory Recursive Copy
Posted: Apr 8, 2009 10:56 AM PDT   in response to: pws0
 
  Click to reply to this thread Reply

I'm planning on posting a join example soon, you can just specify each of those directories as separate inputs in the console args or json job description:

-input s3_bucket/overall_jobinput/data_set1/
-input s3_bucket/overall_jobinput/data_set2/

http://hadoop.apache.org/core/docs/r0.18.3/streaming.html#How+do+I+specify+multiple+input+directories%3F


pws0

Posts: 34
Registered: 5/6/08
Re: Input Directory Recursive Copy
Posted: Apr 8, 2009 11:56 AM PDT   in response to: Peter N. Skomor...
 
  Click to reply to this thread Reply

I'm planning on posting a join example soon, you can just specify each of those directories as separate inputs in the console args or json job description:

-input s3_bucket/overall_jobinput/data_set1/
-input s3_bucket/overall_jobinput/data_set2/

Thanks - that's great to know!

One last question: is the -arg option in the ruby CLI a way of passing command line parameters to the hadoop job jar? For example, I could then use this as a way of specifying the location of the two different data directories?

Thanks again!


Peter N. Skomoroch
RealName(TM)


Posts: 128
Registered: 11/16/06
Re: Input Directory Recursive Copy
Posted: Apr 8, 2009 1:50 PM PDT   in response to: pws0
 
  Click to reply to this thread Reply

That is correct, you can put any of the usual command line arguments like extra "-input" lines into the "args" box and they will get passed to the job



Point your RSS reader here for a feed of the latest messages in all forums