I've been starting to write some hadoop and python streaming jobs and there isn't all that much documentation regarding it out there. Things like, how do I pass environment variables, how do I pass along modules that my scripts might need, etc...

here's a couple of quick tips... to pass environment variables to your tasknodes use this command line param when launching a hadoop job:


/Users/Hadoop/hadoop/bin/hadoop jar /Users/Hadoop/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar \
-mapper /Users/hadoop/code/traffic/mapper.py \
-reducer /Users/hadoop/code/traffic/reducer.py \
-input insights-input-small/* \
-output insights-output-traffic \
-cmdenv PYTHONPATH=$PYTHONPATH:/Users/jim/Code \
-cmdenv MYAPP__PATH=/Users/jim/Code \
-cmdenv MYAPP_ENVIRONMENT=development
if you want to distribute your modules to the tasknodes instead of having them installed on the target task nodes then you can zip up your module file, rename it to mymodule.mod and use this command line param -file /Users/jim/Code/mymodule.mod then in your script you can unzip it and import it as usual
 import zipimport
 importer = zipimport.zipimporter('mymodule.mod')
 insights = importer.load_module('mymodule')
hope that helps someone :)

Ready for More?

Follow Me @jimplush