問題描述
選擇 MapReduce 設計模式 (Choosing a MapReduce Design Pattern)
For my honours project, I'm researching ways of processing the very large quantities of data produced by a mass spectrometer (MS) within the University's Life Sciences department.
I've been reading MapReduce Design Patterns & Hadoop: The Definitive Guide, and I'm trying to connect the data I have to a particular map & reduce pattern.
The first step is to find the weighted peak for each curve. Performing a calculation to reduce the 6 or 7 points (rows) to a single point with info like max/min, average.
The data is in ~8GB CSV files that look like this:
100312_EXP229_GFPIP_5,1,0.00056783449000000002,0,402.0745297608135,1135.039794921875,1,0,0,0,0,0
Each row is essentially a single point on a graph, with X & Y coords.
The points are grouped into curves by one of the columns. So I know for each record, which curve it belongs to.
The number of points in each curve varies, it's usually around 4 ‑ 8 points.
My problem is most of the example seem to deal with problems line by line, e.g. Word Count. I need to process N rows at a time, although I'm not sure if this is the right way to go about solving this.
Thanks!
‑‑‑‑‑
參考解法
方法 1:
In your case it should be sufficient to process the file line by line and for each line output <curve, point>
in your mapper. The reducer will receive all points, which belong to a single curve: <curve, {point1, point2, ... }>
as input. Thus you can calculate the max/min/avg based on the points of a curve in the reducer and output <curve, max/min/avg>
as the final result.
PS. For processing exactly N
rows of the input file in each mapper you could use the NLineInputFormat.