While testing locally our product, after interacting with it for a while my machine cpu usage will spike over 95% for a long while, sometimes becoming useless for long periods. After finding out the usage pattern that would lead to this cpu usage I searched for the reason.

System description

While I would not give too much info about the product, I will say that the particular component is a worker which processes a task queue with 4 threads. It runs locally inside a docker container and it uses a very simplified version of the production code to reduce waiting times (among other reasons).

Symptoms

This problem was not reported on production and it became only visible when running a specific task multiple times simultaneously. Once the tasks would finish, the cpu usage will come down again. By repeating this cycle of triggering the tasks and waiting for them to finish, the cpu usage would grow higher each iteration, and taking longer to recover, to the point were it would not recover. By restarting the container the system will go back to the first stage.

Debugging process

The first step was to check that it was actually this testing container which was requesting the cpu. For that I monitored the container with the command below:

while true; do docker stats --no-stream | grep testing_container | tee --append stats.txt; sleep 1; done

That would let me see the stats on the terminal as at the same time it will log them into a file. Which looked something like:

CONTAINER ID   NAME                CPU %     MEM USAGE / LIMIT     MEM %     NET I/O   BLOCK I/O        PIDS
07085618e6f9   testing_container   0.08%     1.089GiB / 15.29GiB   7.12%     0B / 0B   537MB / 4.43MB   37
07085618e6f9   testing_container   0.40%     1.089GiB / 15.29GiB   7.12%     0B / 0B   537MB / 4.43MB   37
07085618e6f9   testing_container   174.94%   1.185GiB / 15.29GiB   7.75%     0B / 0B   541MB / 4.46MB   53
07085618e6f9   testing_container   459.66%   1.942GiB / 15.29GiB   12.70%    0B / 0B   541MB / 4.46MB   213
07085618e6f9   testing_container   458.73%   2.198GiB / 15.29GiB   14.37%    0B / 0B   541MB / 4.46MB   213
07085618e6f9   testing_container   542.71%   2.099GiB / 15.29GiB   13.73%    0B / 0B   541MB / 4.46MB   213
07085618e6f9   testing_container   538.65%   2.139GiB / 15.29GiB   13.99%    0B / 0B   541MB / 4.46MB   213
07085618e6f9   testing_container   495.53%   2.073GiB / 15.29GiB   13.55%    0B / 0B   541MB / 4.46MB   213
07085618e6f9   testing_container   489.07%   2.126GiB / 15.29GiB   13.90%    0B / 0B   541MB / 4.46MB   213
...

So, yes, as assumed, the docker container was sucking the cpu.

Checking container processes

Once the testing container was found guilty, the next step was to discover what inside it made it faulty. So on a similar manner I logged the processes inside the container while reproducing the error. This command had to run inside the container and you will note a filter for the task queue processor:

while true; do top -b | grep celery | tee --append top.txt; sleep 1; done

Which gave the following output (note that it is truncated):

PID  USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
root      20   0 1267016 232920 109484 R  48.3   1.5   0:01.45 <faulty-process-name-a>
root      20   0 1267016 234828 111320 R  47.7   1.5   0:01.43 <faulty-process-name-a>
root      20   0 1267016 232912 109480 R  46.3   1.5   0:01.39 <faulty-process-name-a>
root      20   0 1264968 233488 111400 R  46.0   1.5   0:01.38 <faulty-process-name-a>
root      20   0    4360   3124   2872 S   0.0   0.0   0:00.00 <faulty-process-name-b>
root      20   0    4360   3220   2968 S   0.0   0.0   0:00.00 <faulty-process-name-b>
root      20   0    4360   3216   2968 S   0.0   0.0   0:00.00 <faulty-process-name-b>
root      20   0    4360   3292   3044 S   0.0   0.0   0:00.00 <faulty-process-name-b>
root      20   0 1751036 422712 165728 R 132.2   2.6   0:05.41 <faulty-process-name-a>
...

From this output it was seen an increasing number of process with faulty process name. The list will grow each time the tasks queue was triggered.

These processes should terminate after the task finishes. But for some reason they stayed on an interruptible sleep state.

The process in particular runs an external software dependency which does some image rendering, a cpu intensive task.

Talking to other teams which heavily use this dependency locally, none of them had this issue before. I did run some tests locally and outside of the container. To my surprise all processes terminated as expected.

Research findings

Here I took a small guess. Looking into the differences between my system and the docker container, one of the things I could tell is that there is no systemd running on the container.

Searching online for zombie processes in docker containers you can quickly arrive to tini. Tini spawns a process with PID 1 (tini) which acts as a process reaper and handles external signals. His author did a very nice post with more information about its internals.

According to it, my docker container had no way to reap zombie processes. Could this be related?

Solving the problem

I modified the docker command to run with tini. This can be done by passing the –init flag to docker run

But wait, there was not a single zombie process in the logs. Was this solving our issue?

Going back to the logs

I went back to further analyze the output of the top command above. This time I focused rather at the last entries:

PID  USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
...
root      20   0 1797372 402416 163856 R 130.3   2.5   0:16.44 <faulty-process-name-a>
root      20   0 1790204 422508 165808 R 127.7   2.6   0:16.46 <faulty-process-name-a>
root      20   0 1804540 438272 163596 R 126.3   2.7   0:16.59 <faulty-process-name-a>
root      20   0 1784060 426068 165812 R 124.7   2.7   0:16.73 <faulty-process-name-a>

root      20   0    4360   3124   2872 S   0.0   0.0   0:00.00 <faulty-process-name-b>
root      20   0    4360   3220   2968 S   0.0   0.0   0:00.00 <faulty-process-name-b>
root      20   0    4360   3216   2968 S   0.0   0.0   0:00.00 <faulty-process-name-b>
root      20   0    4360   3292   3044 S   0.0   0.0   0:00.00 <faulty-process-name-b>

root      20   0 1797372 406316 163856 R 122.9   2.5   0:20.14 <faulty-process-name-a>
root      20   0 1804540 439880 163596 S 122.9   2.7   0:20.29 <faulty-process-name-a>
root      20   0 1790204 424976 165808 R 122.6   2.7   0:20.15 <faulty-process-name-a>
root      20   0 1784060 401904 165812 R 119.9   2.5   0:20.34 <faulty-process-name-a>

root      20   0    4360   3124   2872 S   0.0   0.0   0:00.00 <faulty-process-name-b>
root      20   0    4360   3220   2968 S   0.0   0.0   0:00.00 <faulty-process-name-b>
root      20   0    4360   3216   2968 S   0.0   0.0   0:00.00 <faulty-process-name-b>
root      20   0    4360   3292   3044 S   0.0   0.0   0:00.00 <faulty-process-name-b>

root      20   0 1797372 398724 163856 R 131.0   2.5   0:24.07 <faulty-process-name-a>

Here I could discern a pattern. The number of process was increasing by 4 each time I run the task that caused the issue. The task in particular was asking the docker container to render four different images at the same time. Since the worker had 4 different threads, 4 processes were expected to work at the same time.

Looking closely, the running time for the processes increases with about 4 mins difference. That was exactly the time I took between different runs.

How could tini have fixed our problem?

Well, lets look at the PIDs. If we look into the PIDs, there are only 8 PIDs which repeated themselves. Therefore we had not really a problem of an increased amount of processes. I had misinterpreted the logs. The error I did was to think that the number of processing was increasing, while what was happening is that the command I used to store the logs was attaching them to the same file.

Lessons learnt

Image rendering is a very intensive process.
Docker lacks a reaping process by default. Adding it is rather simple.