zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.35k stars 980 forks source link

Is there a way to reduce WAL file size under pg_wal directory for clusters? #1743

Open m0sh1x2 opened 2 years ago

m0sh1x2 commented 2 years ago

Hello,

I am using the base minimal cluster with one master and one worker node and using the v1.7.1 version of the operator with default settings.

Currently I am noticing that the pg_wal file directory grows quite a lot in some cases and eats up space really fast from several GB per day without cleanup.

Is there a way to reduce WAL files on the minimal cluster or this is an expected functionality.

As I understand backing up the cluster should automatically clear the WAL files but sometimes they stay for days and in other cases they fill up the storage for example for 1-2GB increase per day for a database with size ~400MB.

Please let me know if I might be missing something or the size of the WAL files is expected to grow in time.

Thanks


I have also answered the required questions for debugging:

Please, answer some short questions which should help us to understand your problem / question better?

Some general remarks when posting a bug report:

thangamani-arun commented 2 years ago

We are also facing similar issues. When I dig into the pods, found that /home/postgres/pgdata/pgroot/data/pg_wal directory has lot for wal log files.

The wal log files were not cleaned, hence the logs file fills the pod disk space and postgresql will not respond anymore. any way to cleanup the files as it is archiving over remote S3 ?

nitindamle commented 2 years ago

we are using kubernates cluster on top of private cloud,we are also facing similar issues. When I dig into the pods, found that /home/postgres/pgdata/pgroot/data/pg_wal directory has lot for wal log files with GB's and manytime pod gets crash due to no space left on disk.

The wal log files were not cleaned, hence the logs file fills the pod disk space and postgresql will not respond anymore. any way to cleanup the files as it is archiving over remote S3 ?

tydra-wang commented 2 years ago

+1

CyberDem0n commented 2 years ago

There are more or less three reasons why pg_wal is growing:

  1. checkpoints not happening (very unlikely)
  2. unused replication slot
  3. failing archive_command

You need to investigate, find the reason, and eliminate the problem. The starting point would be Postgres logs located in $PGDATA/../pg_log, SELECT * FROM pg_replication_slots, and ps auxwf output.

tydra-wang commented 2 years ago

How can I figure out the max disk size pg_wal would cost? @CyberDem0n

spreeker commented 9 months ago

Hi for anyone landing here figureing out why their database cluster is running out of space.

executing du -h -d 4 showed me that the wal folder got realy large ./pgdata/pgroot/data/pg_wal. The reason for that is because replication nodes where not longer healthy and catching up.

I solvend it by executing on a node: patronictl reinit and selecting the unhealty replicas. You can see the status of replicas with patronictl list. When the nodes where healty again my wal folder size when from 144gb to near zero on the database nodes.

thangamani-arun commented 8 months ago

@spreeker : I agree that procedure works. But since these are containers, unless you monitor the disk usage of the pods/containers and the cluster status, you will not be able to take such actions. It may lead to data loss as well.

spreeker commented 8 months ago

Yep very inconvenient.