問題描述
在 Spark 與 Redshift 上執行查詢 (Execute query on Spark vs Redshift)
我們的數據倉庫位於 Redshift(50TB 大小)中。有時業務用戶會運行大型查詢(過多的聯接、內聯查詢 ‑ 由 Tableau 等 BI 工俱生成)。大查詢會降低數據庫性能。
在 Redshift 之上使用 Spark 來卸載 Redshift 之外的一些計算是否明智?
或者通過添加更多節點來增加 Redshift 計算能力是否更容易且更具成本效益?
如果我執行
select a.col1, b.col2 from table1 a, table2 b 其中 a.key = b.key
在 Spark 中。表通過 JDBC 連接並駐留在 Redshift 上,實際處理髮生在哪裡(在 Spark 或 Redshift 中)?
參考解法
方法 1:
Any queries on the data stored in Amazon Redshift are performed by the Amazon Redshift nodes. While Spark could make an external JDBC call, the SQL will be executed by Redshift.
There are many techniques to optimize Redshift query execution:
- Tuning Query Performance
- Top 10 Performance Tuning Techniques for Amazon Redshift
- Tuning Workload Management parameters to control parallel queries and memory allocation
Start by looking at queries that consume too many resources and determine whether they can be optimized by changing the Sort Key, Distribution Key and Compression Encodings used by each table. Correct use of these parameters can greatly improve Redshift performance.
Then, if many users are running simultaneous queries, check whether it is worth improving Workload Management settings to create separate queues with different memory settings.
Finally, if performance is still a problem, add additional Redshift nodes. The dense compute nodes will offer better performance because they use SSD storage, but it is a higher cost per TB of storage.
(by Dipankar、John Rotenstein)