問題描述
在 gae flexible 上長時間運行的雲任務會提前終止而不會出錯。如何調試?我錯過了什麼? (Long running cloud task on gae flexible terminates early without error. How to debug? What am I missing?)
我正在使用 python 和燒瓶靈活地在 gae 上運行應用程序。我定期使用 cron 作業調度雲任務。這些基本上循環遍歷所有用戶並執行一些聚類分析。任務終止而不會引發任何類型的錯誤,但不會執行所有工作(意味著並非所有用戶都被循環通過)。它似乎不會在 276.5s ‑ 323.3s 的一致時間發生,也不會在同一個用戶處停止。有沒有人經歷過類似的事情?
我的猜測是我在某處違反了某種類型的資源限製或超時。我想過或嘗試過的事情:
應該允許雲任務運行長達一個小時(根據這個:https://cloud.google.com/tasks/docs/creating‑appengine‑handlers)
我將 gunicorn workers 的超時時間增加到 3600反映這一點。
我有幾個工人正在運行。
我試圖找出是否有內存峰值或cpu過載,但沒有發現任何可疑之處。
對不起,如果我太含糊或完全沒有抓住重點,我對這個問題感到很困惑。感謝您的任何指點。
我試圖找出是否有內存峰值或cpu過載,但沒有發現任何可疑之處。
對不起,如果我太含糊或完全沒有抓住重點,我對這個問題感到很困惑。感謝您的任何指點。
我試圖找出是否有內存峰值或cpu過載,但沒有發現任何可疑之處。
對不起,如果我太含糊或完全沒有抓住重點,我對這個問題感到很困惑。感謝您的任何指點。
參考解法
方法 1:
Thank you for all the suggestions, I played around with them and have found out the root cause, although by accident reading firestore documentation. I had no indication that this had anything to do with firestore.
From here: https://googleapis.dev/python/firestore/latest/collection.html I found out that Query.stream() (or Query.get()) has a timeout on the individual documents like so:
Note: The underlying stream of responses will time out after the max_rpc_timeout_millis value set in the GAPIC client configuration for the RunQuery API. Snapshots not consumed from the iterator before that point will be lost.
So what eventually timed out was the query of all users, I came across this by chance, none of the errors I caught pointed me back towards the query. Hope this helps someone in the future!
方法 2:
Other than use Cloud Scheduler, you can inspect the logs to make sure the Tasks ran properly and make sure there's no deadline issues. As application logs are grouped, and after the task itself is executed, it’s sent to Stackdriver. When a task is forcibly terminated, no log may be output. Try catching the Deadline exception so that some log is output and you may see some helpful info to start troubleshooting.
(by Lennart Paar、Lennart Paar、Michael T)