gridengine clustering server的錯誤解法

※派出的job還在跑(或是跑很久沒回應),用qstat看主機(z840)狀態(states)是”E”

  • 解決方法:重新開機,然後下指令”qmod -c all.q@ubuntu”,手動關掉”E”的狀態

原文:“ The “E” state is more of a concern than the ‘au’ state. It means that there was a major problem on the compute node (with the system or the job itself). E states do not go away automatically, even if you reboot the cluster. Once you think the cluster is fine you can use the “qmod” command to clear the E state.”

Reference

已失效

比較新的: 06 SGE administration.md

※重啟gridengine服務後,派出的JOB仍卡在queue,用qstat看主機(dgx)狀態(states)是”au”

  • 通常發生在dgx還在跑job,但整台機器被重啟

  • 解決方法:重啟gridengine service

	sudo /etc/init.d/gridengine-master restart 
	sudo /etc/init.d/gridengine-exec restart

※派出的job跑很久但是沒回應,到正在run的那台機器上用htop看卻沒有job在跑 (但是qstat顯示該job是”r”),於是qdel該job又出現dr,然後卡住

  • 解決方法:下指令sudo qdel -f 該job_id

原文:“Is there a way that my users can kill their own jobs that are stuck in the dr state?

*qstat -f * *as the user, returns job is already in deletion* *yet when run as root it does get deleted*

Reference

Kill an SGE job “already in deletion”, as user

※queue status出現E,然後job都排不進去

  • 解決方法:用admin賬號,下指令qmod -c all.q

  • 接著修改可排的job數:

qconf -mq all.q

slots                 1,[dgx1-3gpu=80] 
找到這行
把80改成2
離開並儲存
  • 避免機器load_avg數值過高, queue都不派job了:

load_thresholds np_load_avg=1.75

  • 目前遇到loading_avg 144卡住不派job, 改這樣就恢復了

load_thresholds np_load_avg=10

原文:load_threshold

When threshold is exceeded, no new jobs are placed on host. Essentially it is the signal that the host is busy as determined by uptime Can use iether built-in values or values reported by custom load sensors (example: ‘logged-in-users=5’).

Default value

np_load_avg=1.75 

lead to oversubscription of computational tasks. In many cases lower threshold such as 0.75 might be better.  suspend_threshold, nsuspend, suspend_interval Similar to load_threshold but running jobs will actually be suspended/stopped. The ‘nsuspend’ param determines how many jobs per interval get suspend signals. ‘suspend_interval’ defaults to 00:05:00.

Reference

http://www.softpanorama.org/HPC/Grid_engine/Queues/some_interesting_queue_attributes.shtml

※DGX container被重開,queue的服務沒打開,輸入qstat -f會出現下面這個錯誤訊息:

error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "940ead339e09": got send error
  • 要手動打開gridengine服務,輸入指令(每輸入一次可以打qstat -f確認一下):
sudo /etc/init.d/gridengine-master restart 
sudo /etc/init.d/gridengine-master start 
sudo /etc/init.d/gridengine-exec restart 
sudo /etc/init.d/gridengine-exec start

※gridengine排程時出現錯誤 :

queue.pl: Error submitting jobs to queue (return status was 256)
Output of qsub was: Unable to run job: denied: host "f95869d0b506" is no submit host.

因為這個container的名字沒加到gridengine的host:

qconf -as f95869d0b506

用這個可以看詳細的訊息:

qstat -j <job-id>

找error reason的欄位