zubkov-andrei / pg_profile

Postgres historic workload reports
Other
236 stars 33 forks source link

ERROR: Duplicate key value violation when taking sample #55

Closed annlund closed 1 year ago

annlund commented 1 year ago

Hi, and thanks for a great project!

In my setup pg_profile v4.1 is running in a docker container with postgres 14.3, and collecting samples from 2 dbs on remote servers. For one of the dbs it's working perfectly fine. For the other db, I get errors when taking a sample. This specific db is a development db meant for testing and trying out stuff, and is completely deleted each night and then restored from a backup of the production db. My guess is that this error has something to do with that? If I drop the server in pg_profile and create it again, samples can be taken for the rest of the day (until the db is deleted in the night again).

What is the cause of the error? Is there any workaround I can use?

Thanks!


SQL Error [23505]: ERROR: duplicate key value violates unique constraint "uk_sample_settings_sysid"
  Detail: Key (server_id, name)=(4, system_identifier) already exists.
  Where: SQL statement "INSERT INTO sample_settings(
      server_id,
      first_seen,
      setting_scope,
      name,
      setting,
      reset_val,
      boot_val,
      unit,
      sourcefile,
      sourceline,
      pending_restart
    )
    SELECT
      s.server_id as server_id,
      s.sample_time as first_seen,
      cur.setting_scope,
      cur.name,
      cur.setting,
      cur.reset_val,
      cur.boot_val,
      cur.unit,
      cur.sourcefile,
      cur.sourceline,
      cur.pending_restart
    FROM
      sample_settings lst JOIN (
        -- Getting last versions of settings
        SELECT server_id, name, max(first_seen) as first_seen
        FROM sample_settings
        WHERE server_id = sserver_id AND (
          NOT settings_refresh
          -- system identifier shouldn't have a duplicate in case of version change
          -- this breaks export/import procedures, as those are related to this ID
          OR name = 'system_identifier'
        )
        GROUP BY server_id, name
      ) lst_times
      USING (server_id, name, first_seen)
      -- Getting current settings values
      RIGHT OUTER JOIN dblink('server_connection',server_query
          ) AS cur (
            setting_scope smallint,
            name text,
            setting text,
            reset_val text,
            boot_val text,
            unit text,
            sourcefile text,
            sourceline integer,
            pending_restart boolean
          )
        USING (setting_scope, name)
      JOIN samples s ON (s.server_id = sserver_id AND s.sample_id = s_id)
    WHERE
      cur.reset_val IS NOT NULL AND (
        lst.name IS NULL
        OR cur.reset_val != lst.reset_val
        OR cur.pending_restart != lst.pending_restart
        OR lst.sourcefile != cur.sourcefile
        OR lst.sourceline != cur.sourceline
        OR lst.unit != cur.unit
      )"
PL/pgSQL function take_sample(integer,boolean) line 176 at SQL statement
PL/pgSQL function take_sample(name,boolean) line 9 at RETURN
zubkov-andrei commented 1 year ago

Hi Annika!

Thank you for your report. It should never insert a new value for system_identifier because the identifier doesn't change for a cluster. Maybe system identifier can change during your restore operation.. However pg_profile must check this and the error message should say that identifier is changed. Anyway right now it seems like a bug. Can you check if the system identifier of your development database gets changed during restore? Use the this query to check it:

select system_identifier from pg_control_system()

Please run it before restore procedure and right after it. Check if the values are the same.

annlund commented 1 year ago

Hi!

Yes, you are right, the system identifier changes when I delete and restore the db.

Do you think there are any workarounds, other than changing the restoration procedure so that the system identifier doesn't change?

Thanks for the support!

zubkov-andrei commented 1 year ago

Hi Annika!

This bug will be fixed in the next release - the error message will point directly to system_identifier change.

This behavior of pg_profile is necessary because pg_profile is able to collect samples from remote servers. Due to changes in configuration of DHCP or DNS services (or something else) it is possible that old connection string is pointing to the different cluster now. In such cases we can't continue statistics collection and we should throw an exception. This is checked during take_sample execution (and here is a bug in 4.1) and there is a constraint on pg_profile tables for sure - you've bumped in it.

You can add recreation of pg_profile remote server to your restore procedure. If you want to save previous samples, you can disable old remote server and create a new one. take_sample() won't collect samples for disabled server, so you'll avoid this error. But you'll need to take care of remote servers naming and retention - pg_profile won't delete obsolete samples (and servers) in such case. Anyway, it seems there is no convenient solution in your case.

annlund commented 1 year ago

Hi,

Thank you for the suggestion. I'll consider changing the way I restore the db to preserve the system identifier, so that I can use pg_profile the way it was intended.

Thank you for the support!

zubkov-andrei commented 1 year ago

In 4.2 the error message in such case should be much more clear, but system identifier is still strictly controlled. You should create a new server if the system identifier was changed.