Recently I had deployed my node app on one of the overloaded linux servers, I was using. It uses cluster node library, which also sends me notification if any child process is killed and then it respawns the child.

So ever since deployment, everyday around lunch, I would get a hella-lot of emails about the child processes being killed. The weird thing was that every one of the child process were receiving SIGKILL, thus someone was killing them. During the carnage which would last for around 2-3 hours, even processes such as npm install would be killed, just as mercilessly.

Shutting down a few processes did help. It slowed down the killing, but did not stop it completely. Only passing of time would stop it, so by late evening till next day lunch, the carnage was again silent.

So next day, I figured out what other thing was happening at the same time. It turns out that another team was uploading a huge data which also involved lots of processing and thus system was becoming OOM (Out of Memory) and the OOM Killer was killing my processes. Looking at the mem stats:

free -m
             total       used       free     shared    buffers     cached
Mem:          2008       1995         13          0          0         23
-/+ buffers/cache:       1971         36
Swap:         1535       1535          0

This server was super-contrained for memory and the kernel OOM Killer kicked in to reclaim memory. Obviously not ideal amount of RAM and even worse SWAP space but this was a sandbox server, so a bit of unsettleness dint matter so much. Looking at more logs for oom killer:

grep -i 'kill' /var/log/messages | awk -F 'kernel:' '{print $2}'
 [1776235.313736] PassengerHel/ei invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
 [1776235.313791]  [<ffffffff810b8dec>] oom_kill_process+0xcc/0x2f0
 [1776235.322903] Out of memory: kill process 12050 (ruby) score 113591 or a child
 [1776235.322906] Killed process 12050 (ruby)
 [1776766.287970] python invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
 [1776766.288033]  [<ffffffff810b8dec>] oom_kill_process+0xcc/0x2f0
 [1776766.298348] Out of memory: kill process 12121 (ruby) score 112262 or a child
 [1776766.298350] Killed process 12121 (ruby)
 [1776766.304348] python invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
 [1776766.304389]  [<ffffffff810b8dec>] oom_kill_process+0xcc/0x2f0
 [1776766.313545] Out of memory: kill process 8259 (python) score 85423 or a child
 [1776766.313548] Killed process 8259 (python)
 ...truncated

As sou can see, OOM Killer killed a number of processes. The best way to resolve this OOM Killing, is to add more RAM and SWAP space. You can see more helpful reference for the OOM Killer here:

oom-killer

will-linux-start-killing-my-processes-without-asking-me-if-memory-gets-short

where-can-i-see-a-list-of-kernel-killed-processes



blog comments powered by Disqus