問題描述
基於感興趣的日期範圍作為參數輸入限制在 Pig Latin 中加載日誌文件 (Restricting loading of log files in Pig Latin based on interested date range as parameter input)
I'm having a problem loading log files based on parameter input and was wondering whether someone would be able to provide some guidance. The logs in question are Omniture logs, stored in subdirectories based on year, month, and day (eg. /year=2013/month=02/day=14), and with the date stamp in the filename. For any day, multiple logs could exist, each hundreds of MB.
I have a Pig script which currently processes logs for an entire month, with the month and the year specified as script parameters (eg. /year=$year/month=$month/day=*). It works fine and we're quite happy with it. That said, we want to switch to weekly processing of logs, which means the previous LOAD path glob won't work (weeks can wrap months as well as years). To solve this, I have a Python UDF which takes a start date and spits out the necessary glob for a week's worth of logs, eg:
>>> log_path_regex(2013, 1, 28)
'{year=2013/month=01/day=28,year=2013/month=01/day=29,year=2013/month=01/day=30,year=2013/month=01/day=31,year=2013/month=02/day=01,year=2013/month=02/day=02,year=2013/month=02/day=03}'
This glob will then be inserted in the appropriate path:
> %declare omniture_log_path 's3://foo/bar/$week_path/*.tsv.gz';
> data = LOAD '$omniture_log_path' USING OmnitureTextLoader(); // See http://github.com/msukmanowsky/OmnitureTextLoader
Unfortunately, I can't for the life of me figure out how to populate $week_path based on $year, $month and $day script parameters. I tried using %declare but grunt complains, says its logging but never does:
> %declare week_path util.log_path_regex(year, month, day);
2013‑02‑14 16:54:02,648 [main] INFO org.apache.pig.Main ‑ Apache Pig version 0.10.1 (r1426677) compiled Dec 28 2012, 16:46:13
2013‑02‑1416:54:02,648 [main] INFO org.apache.pig.Main ‑ Logging error messages to: /tmp/pig_1360878842643.log % ls /tmp/pig_1360878842643.log
ls: cannot access /tmp/pig_1360878842643.log: No such file or directory
The same error results if I prefix the parameters with dollar signs or surround prefixed parameters with quotes.
If I try to use define (which I believe only works for static Java functions), I get the following:
> define week_path util.log_path_regex(year, month, day);
2013‑02‑14 17:00:42,392 [main] ERROR org.apache.pig.tools.grunt.Grunt ‑ ERROR 1200: <file script.pig, line 11, column 37> mismatched input 'year' expecting RIGHT_PAREN
As with %declare, I get the same error if I prefix the parameters with dollar signs or surround prefixed parameters with quotes.
I've searched around and haven't come up with a solution. I'm possibly searching for the wrong thing. Invoking a shell command may work, but would be difficult as it would complicate our script deploy and may not be feasible given we're retrieving logs from S3 and not a mounted directory. Similarly, passing the generated glob as a single parameter may complicate an automated job on an instantiated MapReduce cluster.
It's also likely there's a nice Pig‑friendly way to restrict LOAD other than using globs. That said, I'd still have to use my UDF which seems to be the root of the issue.
This really boils down to me wanting to include a dynamic path glob built inside Pig in my LOAD statement. Pig doesn't seem to be making that easy.
Do I need to convert my UDF to a static Java method? Or will I run into the same issue? (I hesitate to do this on the off‑chance it will work. It's an 8‑line Python function, readily deployable and much more maintainable by others than the equivalent Java code would be.)
Is a custom LoadFunc the answer? With that, I'd presumably have to specify /year=/month=/day=* and force Pig to test every file name for a date stamp which falls between two dates. That seems like a huge hack and a waste of resources.
Any ideas?
‑‑‑‑‑
參考解法
方法 1:
I posted this question to the Pig user list. My understanding is that Pig will first pre‑process its scripts to substitute parameters, imports and macros before building the DAG. This makes building new variables based on existing ones somewhat impossible, and explains my failure to build a UDF to construct a path glob.
If you are a Pig developer requiring new variables to be built based on existing parameters, you can either use another script to construct those variables and pass them as parameters to your Pig script, or you can explore where you need to use those new variables and build them in a separate construct based on your needs.
In my case, I reluctantly opted to create a custom LoadFunc
as described by Cheolsoo Park. This LoadFunc
accepts the day, month and year for the beginning of the period for the report in its constructor, and builds a pathGlob
attribute to match paths for that period. That pathGlob
is then inserted into a location in setLocation()
. eg.
/**
* Limit data to a week starting at given day. If day is 0, month is assumed.
*/
public WeeklyOrMonthlyTextLoader(String year, String month, String day) {
super();
pathGlob = getPathGlob(
Integer.parseInt(year),
Integer.parseInt(month),
Integer.parseInt(day)
);
}
/**
* Replace DATE_PATH in location with glob required for reading in this
* month or week of data. This assumes the following directory structure:
*
* <code>/year=>year</month=>month</day=>day</*</code>
*/
@Override
public void setLocation(String location, Job job) throws IOException {
location = location.replace(GLOB_PLACEHOLDER, pathGlob);
super.setLocation(location, job);
}
This is then be called from a Pig script like so:
DEFINE TextLoader com.foo.WeeklyOrMonthlyTextLoader('$year', '$month', '$day');
Note that the constructor accepts String
, not int
. This is because parameters in Pig are strings and cannot be cast or converted to other types within the Pig script (except when used in MR tasks).
While creating a custom LoadFunc
may seem overkill compared to a wrapper script, I wanted the solution to be pure Pig to avoid forcing analysts to perform a setup task before working with their scripts. I also wanted to readily use a stock Pig script for different periods when creating an Amazon MapReduce cluster for a scheduled job.
(by Ian Stevens、Ian Stevens)