How to accept command line parameters that pass to a Import Macro using Apache Pig 0.9

The docs don't specifically show an example of how to pass a command line parameter to a macro, but it does say you can't you use parameter substitution inside your Pig script. Below is an example of how to pass a command line argument through to your Pig Macro definition.





File 1 - A sample load location (schema.pig)




Typically I'll pass the location of my data over the command line where my python wrapper will generate the date ranges required for the script I want to run. For example if I want to run my pig script over 3 days of data, my Python wrapper will create a Hadoop glob pattern as such: '/user/beacons/{2011_222,2011_223,2011,224}/*beacons.csv' where 2011_222 is the YEAR_DAYOFYEAR. Since I have a schema that does tend to change every once in a while, having to update the schema in every pig file becomes annoying, so Macros and Imports to the rescue. Here is a trimmed down sample of a schema defined in a separate pig file:








File 2 - Calculating unique users (uniques.pig)




In this file we're going to import our schema and pass a command line parameter with the input location to reuse our full schema definition





Now we just need to call our script, but first we'll do a dry run to make sure everything gets loaded properly



pig -logfile /tmp -f uniques.pig -param input='/user/beacons/{2011_222,2011_223,2011,224}/*beacons.csv' -param output=uniques --dryrun



If that all looks good then we'll go ahead and run the full script and away we go to Macro Imports



pig -logfile /tmp -f uniques.pig -param input='/user/beacons/{2011_222,2011_223,2011,224}/*beacons.csv' -param output=uniques

Ready for More?

Follow Me @jimplush