Hi, My job is killed. I do not know why. The log shows 17 Killed. My log file is 15K. My model state is less than 1G. Thanks, Wentao

Created by wentaoz1
The unix server running your submission had this content in `/var/log/messages` at the time the container was stopped: ``` Jan 22 23:26:25 bm01 kernel: python invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 Jan 22 23:26:25 bm01 kernel: python cpuset=189dcfa13dec957e92ad201af74fd712fe06c4e4bb2d177ae0f6485c693c6d25 mems_allowed=0-1 Jan 22 23:26:25 bm01 kernel: CPU: 14 PID: 6136 Comm: python Tainted: P B OE ------------ 3.10.0-514.2.2.el7.x86_64 #1 Jan 22 23:26:25 bm01 kernel: Hardware name: Supermicro PIO-628U-TR4T+-ST031/X10DRU-i+, BIOS 2.0 12/17/2015 Jan 22 23:26:25 bm01 kernel: ffff8840b0b00fb0 0000000075008236 ffff8840be9dfcc0 ffffffff816860cc Jan 22 23:26:25 bm01 kernel: ffff8840be9dfd50 ffffffff81681077 ffff887f6d7bb180 0000000000000001 Jan 22 23:26:25 bm01 kernel: 0000000000000000 0000000000000000 0000000000000046 ffffffff811842b6 Jan 22 23:26:25 bm01 kernel: Call Trace: Jan 22 23:26:25 bm01 kernel: [] dump_stack+0x19/0x1b Jan 22 23:26:25 bm01 kernel: [] dump_header+0x8e/0x225 Jan 22 23:26:25 bm01 kernel: [] ? find_lock_task_mm+0x56/0xc0 Jan 22 23:26:25 bm01 kernel: [] oom_kill_process+0x24e/0x3c0 Jan 22 23:26:25 bm01 kernel: [] ? has_capability_noaudit+0x1e/0x30 Jan 22 23:26:25 bm01 kernel: [] mem_cgroup_oom_synchronize+0x551/0x580 Jan 22 23:26:25 bm01 kernel: [] ? mem_cgroup_charge_common+0xc0/0xc0 Jan 22 23:26:25 bm01 kernel: [] pagefault_out_of_memory+0x14/0x90 Jan 22 23:26:25 bm01 kernel: [] mm_fault_error+0x68/0x12b Jan 22 23:26:25 bm01 kernel: [] __do_page_fault+0x395/0x450 Jan 22 23:26:25 bm01 kernel: [] do_page_fault+0x35/0x90 Jan 22 23:26:25 bm01 kernel: [] page_fault+0x28/0x30 Jan 22 23:26:25 bm01 kernel: Task in /docker/189dcfa13dec957e92ad201af74fd712fe06c4e4bb2d177ae0f6485c693c6d25 killed as a result of limit of /docker/189dcfa13dec957e92ad201af74fd712fe06c4e4bb2d177ae0f6485c693c6d25 Jan 22 23:26:25 bm01 kernel: memory: usage 209715200kB, limit 209715200kB, failcnt 9318616 Jan 22 23:26:25 bm01 kernel: memory+swap: usage 210369004kB, limit 419430400kB, failcnt 0 Jan 22 23:26:25 bm01 kernel: kmem: usage 43788kB, limit 9007199254740988kB, failcnt 0 Jan 22 23:26:25 bm01 kernel: Memory cgroup stats for /docker/189dcfa13dec957e92ad201af74fd712fe06c4e4bb2d177ae0f6485c693c6d25: cache:40KB rss:209671372KB rss_huge:117321728KB mapped_file:0KB swap:653804KB inactive_anon:3343464KB active_anon:206327908KB inactive_file:0KB active_file:40KB unevictable:0KB Jan 22 23:26:25 bm01 kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Jan 22 23:26:25 bm01 kernel: [ 6112] 0 6112 2910 5 12 50 0 train.sh Jan 22 23:26:25 bm01 kernel: [ 6136] 0 6136 195079258 52338849 103215 262026 0 python Jan 22 23:26:25 bm01 kernel: Memory cgroup out of memory: Kill process 12797 (python) score 1000 or sacrifice child Jan 22 23:26:25 bm01 kernel: Killed process 6136 (python) total-vm:780317032kB, anon-rss:209276640kB, file-rss:78756kB, shmem-rss:0kB Jan 22 23:26:34 bm01 kernel: XFS (dm-21): Unmounting Filesystem Jan 22 23:26:34 bm01 nvidia-docker-plugin: /usr/bin/nvidia-docker-plugin | 2017/01/22 23:26:34 Received unmount request for volume 'nvidia_driver_367.48' ``` I don't understand everything in this log file but it seems that you ran out of memory.
Thanks very much! I will analysis my code and resource usage. Thanks!
Line 57 of your `/train.sh` file says: ``` python run_cnn_k_mil_new.py ``` So I'm guessing there's something triggered by your Python code. I found this discussion: > If the user or sysadmin did not kill the program the kernel may have. The kernel would only kill a process under exceptional circumstances such as extreme resource starvation (think mem+swap exhaustion). http://stackoverflow.com/questions/726690/who-killed-my-process-and-why#726762
@wentaoz1 Our records showed that the Docker container for submission 8057770 completed with no error. > The killed information is in the line 1569 of the log file. Yes, I see that line in the logs: ``` STDERR: /train.sh: line 57: 17 Killed ``` At this point I'm not sure what it means.
Dear Wentao, Apologies for the delay in response. Unfortunately, I am unable to answer your question, but I have redirected your question to the other challenge organizers. Thanks in advance for your patience. Best, Thomas
Could anyone answer my questions? Thanks a lot! @thomas.yu
Dear deepmammo: Your Submission to the Digital Mammography challenge (submission ID 8057770) has completed its training phase. Your logs are available here: https://www.synapse.org/#!Synapse:syn8057771. The state of your model has been uploaded to syn8058111 which you may use in the challenge scoring phase. You may also download the archive here: https://www.synapse.org/#!Synapse:syn8058111 Please direct any questions to the challenge forum, https://www.synapse.org/#!Synapse:syn4224222/discussion. Sincerely, Challenge Administration The killed information is in the line 1569 of the log file. Thanks very much!
Dear Wentao, Please kindly provide me with the email that you were sent by about your submission being killed that contains your submission Id, your log file synapse id, etc... Best, Thomas

Why my job is killed? page is loading…