選擇 MapReduce 設計模式 (Choosing a MapReduce Design Pattern)

問題描述

For my honours project, I'm researching ways of processing the very large quantities of data produced by a mass spectrometer (MS) within the University's Life Sciences department.

I've been reading MapReduce Design Patterns & Hadoop: The Definitive Guide, and I'm trying to connect the data I have to a particular map & reduce pattern.

The first step is to find the weighted peak for each curve. Performing a calculation to reduce the 6 or 7 points (rows) to a single point with info like max/min, average.

The data is in ~8GB CSV files that look like this:

100312_EXP229_GFPIP_5,1,0.00056783449000000002,0,402.0745297608135,1135.039794921875,1,0,0,0,0,0

Each row is essentially a single point on a graph, with X & Y coords.
The points are grouped into curves by one of the columns. So I know for each record, which curve it belongs to.
The number of points in each curve varies, it's usually around 4 ‑ 8 points.

My problem is most of the example seem to deal with problems line by line, e.g. Word Count. I need to process N rows at a time, although I'm not sure if this is the right way to go about solving this.

Thanks!

‑‑‑‑‑

參考解法

方法 1:

In your case it should be sufficient to process the file line by line and for each line output <curve, point> in your mapper. The reducer will receive all points, which belong to a single curve: <curve, {point1, point2, ... }> as input. Thus you can calculate the max/min/avg based on the points of a curve in the reducer and output <curve, max/min/avg> as the final result.

PS. For processing exactly N rows of the input file in each mapper you could use the NLineInputFormat.

(by mbaird、harpun)

參考文件

Choosing a MapReduce Design Pattern (CC BY‑SA 3.0/4.0)

選擇 MapReduce 設計模式 (Choosing a MapReduce Design Pattern)

問題描述

參考解法

方法 1:

參考文件

相關問題

留言討論

選擇 MapReduce 設計模式 (Choosing a MapReduce Design Pattern)

問題描述

參考解法

方法 1:

參考文件

相關問題

hadoop -libjars dan ClassNotFoundException (hadoop -libjars and ClassNotFoundException)

基於感興趣的日期範圍作為參數輸入限制在 Pig Latin 中加載日誌文件 (Restricting loading of log files in Pig Latin based on interested date range as parameter input)

選擇 MapReduce 設計模式 (Choosing a MapReduce Design Pattern)

Lỗi phân vùng tùy chỉnh (Custom Partitioner Error)

Connection Refused - 為什麼 zookeeper 嘗試連接到 localhost 而不是服務器 ip (Connection Refused - Why does zookeeper tries to connect to localhost instead of a server ip)

現有表的 Hive 分桶和分區 (Hive bucketing and partition for existing table)

如何在 R 中讀取 HDFS 中的文件而不會丟失列名和行名 (How to read files in HDFS in R without loosing column and row names)

CDH 網絡接口速度抑制 (CDH Network Interface Speed Suppress)

Apache Apex 是依賴 HDFS 還是有自己的文件系統？ (Does Apache Apex rely on HDFS or does it have its own file system?)

java.io.IOException：作業失敗！使用 hadoop-0.19.1 在我的 osx 上運行示例應用程序時 (java.io.IOException: Job failed! when running a sample app on my osx with hadoop-0.19.1)

如何使用 PIG 腳本驗證列表 (How to validate a list using PIG script)

使用 spark-submit 為 Spark Job 設置 HBase 屬性 (set HBase properties for Spark Job using spark-submit)

留言討論