System Load

Few weeks ago we had some performance troubles with certain services. While looking into our monitoring tools I came across the cpu load metric. A metric which I had not paid too much attention before. It seemed the right time to look into why the values of this metric are odd.

Usually a metric will take a range of values between 0 and 100 or 0 and 10. However, the cpu load maximum value did not have a hard upper bound.

My first question was what does system load measure?

As with many other topics, there is plenty information online. I saw some contradictory information, so I needed to find some reliable source, and what is better than official docs?

How to see the system load?

There are different commands on a linux system to display the system load: top, w, uptime.

Since uptime is the one that displays less data I started with its documentation.

Running uptime will display something like

17:15:35 up 18 days,  7:23,  9 users,  load average: 1.07, 1.77, 2.07

What measures the system load?

From the documentation link above:

System load averages is the average number of processes that are either in a runnable or uninterruptable state. A process in a runnable state is either using the CPU or waiting to use the CPU. A process in uninterruptable state is waiting for some I/O access, eg waiting for disk. The averages are taken over the three time intervals (ed. 1, 5 and 15 mins). Load averages are not normalized for the number of CPUs in a system, so a load average of 1 means a single CPU system is loaded all the time while on a 4 CPU system it means it was idle 75% of the time.

So the system load is an average of processes on the running and/or uninterrumpable sleep. Potentially we could have dozens of processes.

This explains why the value is not a percentage as I had expected.

The next question that came to my mind was: what is an acceptable value? For the cpu to keep up with the processes, the value should be less than the amount of parallel processes the machine can run. Which raised another questions, how many threads can the machine run?

Calculate max number of parallel threads

To find this out I made use of lscpu.

From the output of the command I got the following interesting lines:

CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              2
Core(s) per socket:              4
Socket(s):                       1

The number of threads can be deducted as:

Total number of threads = cores/socket * sockets * threads/core
8 = 4 * 1 *2

Giving meaning to system load values

At this point we are on a better position to evaluate the value of the system load.

From the example above, the system load averages was between 1.07 and 2.07, having a maximum of 8 threads, this seems like a small value. This numbers mean that the cpu is not representing a bottle neck at the moment.

In case your server always display a small value compared to their number of threads, you can think of scaling down to cut some costs.

On the other hand if the number is too large you might want to consider a larger cpu.