Notes on psql engine metrics
Recently a colleague raised the alarm because our main psql instance memory usage was around 80%. We had some nice conversations about how linux manages ram and how to interpret our observability dashboards. This is an interesting topic which comes up regularly. I hope this post helps you understand why this such a high usage is not an issue.
First, 80% ram usage is normal on any long running linux machine. This does not mean the system is about to run out of memory and collapse. The OS will liberate memory when processes request more memory.
Psql main ram usages
Psql mainly uses ram for two purposes:
- Caching data. Ram used for this purpose can be claimed by the OS at any time. So the OS will free space from here when other tasks need RAM. 1.1. Active cache has been recently accessed by the OS. Here frequently accessed tables and index data are stored. 1.2. Inactive cache has not been used in a while. 1.3. Memory mapped maps to files on the disk.
- Resident Set Size (RSS). This is used by running processes and can not be claimed by the OS. Having a high RSS usage means a higher OOM risk.
High RSS usage is the real risk
RSS memory can not be claimed by the OS, therefore the server needs to wait until the processes using the memory are done and liberate the memory.
Each PSQL process will use RSS memory to store data related to itself, things such as the heap, stack or hash tables. Each process requires a different amount of RSS and while usually is a small portion, each process requires its own space. Since each PSQL connection spawns a process, if no limit is set, the number of connections and therefore processes can be very high, creating a memory pressure problem.
A very high RSS usage can cause an out of memory (OOM) event which will degrade the server performance, could even force the OS to kill psql processes or even kill the psql engine.
The number of active connections can be restricted by using a connection pooler, such as PgBouncer.
What to monitor?
- RAM usage. Ideally with different RAM types.
- CPU usage. Should be below 30%. Higher numbers imply the server is too busy. This could be caused by complicated queries or by a high number of connections. If the cpu usage goes too high it will slow down everything.
- IOPS. Depending on the database use case, spikes can be correlated with queries which access a lot of data. Might indicate poor query performance.
- Disk. Keep about 20% of the disk free.
Database engine metrics:
- Active connections. A very high number of active connection will put pressure in the memory and cpu.
- Idle transactions. This should be low a high number might indicate application logic or locking issues.