在 Spark 與 Redshift 上執行查詢 (Execute query on Spark vs Redshift)

問題描述

我們的數據倉庫位於 Redshift（50TB 大小）中。有時業務用戶會運行大型查詢（過多的聯接、內聯查詢 ‑ 由 Tableau 等 BI 工俱生成）。大查詢會降低數據庫性能。

在 Redshift 之上使用 Spark 來卸載 Redshift 之外的一些計算是否明智？
或者通過添加更多節點來增加 Redshift 計算能力是否更容易且更具成本效益？
如果我執行 select a.col1, b.col2 from table1 a, table2 b 其中 a.key = b.key 在 Spark 中。表通過 JDBC 連接並駐留在 Redshift 上，實際處理髮生在哪裡（在 Spark 或 Redshift 中）？

參考解法

方法 1:

Any queries on the data stored in Amazon Redshift are performed by the Amazon Redshift nodes. While Spark could make an external JDBC call, the SQL will be executed by Redshift.

There are many techniques to optimize Redshift query execution:

Tuning Query Performance
Top 10 Performance Tuning Techniques for Amazon Redshift
Tuning Workload Management parameters to control parallel queries and memory allocation

Start by looking at queries that consume too many resources and determine whether they can be optimized by changing the Sort Key, Distribution Key and Compression Encodings used by each table. Correct use of these parameters can greatly improve Redshift performance.

Then, if many users are running simultaneous queries, check whether it is worth improving Workload Management settings to create separate queues with different memory settings.

Finally, if performance is still a problem, add additional Redshift nodes. The dense compute nodes will offer better performance because they use SSD storage, but it is a higher cost per TB of storage.

(by Dipankar、John Rotenstein)

參考文件

Execute query on Spark vs Redshift (CC BY‑SA 2.5/3.0/4.0)

在 Spark 與 Redshift 上執行查詢 (Execute query on Spark vs Redshift)

問題描述

參考解法

方法 1:

參考文件

相關問題

留言討論

在 Spark 與 Redshift 上執行查詢 (Execute query on Spark vs Redshift)

問題描述

參考解法

方法 1:

參考文件

相關問題

為什麼我在 rdd 中的 println 會打印元素字符串？ (Why does my println in rdd prints the string of elements?)

如何在 PySpark 中有效地按值排序？ (How to sort by value efficiently in PySpark?)

Apache Spark 導致 Tomcat 正常關閉 (Apache Spark cause Tomcat to graceful shutdown)

查看 Spark 中派生的機器學習模型 (view machine learning model derived in Spark)

在 Spark 與 Redshift 上執行查詢 (Execute query on Spark vs Redshift)

Apache Spark：指向父 RDD 的引用指針 (Apache Spark: Reference pointer to the parent RDD)

防止 Spark Shell 中結構化流的進度輸出 (Prevent progress output from Structured Streaming in Spark Shell)

火花非確定性和重新計算安全 (Spark nondeterminism and recomputation safety)

使用 spark-submit 為 Spark Job 設置 HBase 屬性 (set HBase properties for Spark Job using spark-submit)

ST_WITHIN 使用 Spark/Java (ST_WITHIN using Spark / Java)

spark中的jdbc更新語句 (Jdbc update statement in spark)

使用 when() 進行條件聚合 (Conditional aggregation using when())

留言討論