[YSQL] Transactional COPY FROM consumes excessive amount of memory

treibholzhq commented 6 months ago

Description

Hello everyone.

I am facing memory issues when trying to ingest "large" files via a transactional copy.

My setup is as follows:

YugabyteDB OpenSource 2.21 deployed on Red Hat OpenShift 4.12 via the official Helm Chart (https://charts.yugabyte.com/).

Resource and storage configuration:

resource:
  master:
    requests:
      cpu: "1"
      memory: 2Gi
    limits:
      cpu: "2"
      memory: 4Gi
  tserver:
    requests:
      cpu: "2"
      memory: 4Gi
    limits:
      cpu: "12"
      memory: 24Gi

storage:
  master:
    count: 1
    size: 5Gi
    storageClass: ""
  tserver:
    count: 1
    size: 30Gi
    storageClass: ""

Here is the DDL of my DB schema:

CREATE TABLE IF NOT EXISTS measurements (
    measurement_id UUID, 
    PRIMARY KEY(measurement_id)
);

CREATE TABLE IF NOT EXISTS signals (
    measurement_id UUID NOT NULL,
    signal_name VARCHAR NOT NULL CHECK (signal_name <> ''),
    timestamp TIMESTAMPTZ NOT NULL,
    signal_value FLOAT(8) NOT NULL,
    PRIMARY KEY((signal_name, measurement_id) HASH, timestamp ASC),
    CONSTRAINT fk_measurements
        FOREIGN KEY(measurement_id) 
            REFERENCES measurements(measurement_id)
);

My use-case is as follows:

I am extracting data out of a measurement file, which contains time series data for 0..n sensors (signals). Each file is identified by a measurement id and is registered in the measurements table. The corresponding signal data is ingested into the signals table, which is hashed on the combination of signal name and measurement id. The scope of the transaction is the measurement file as a whole, meaning I either want the entire file to be committed or nothing at all.

This will lead to very large transactions as you can imagine.

I am batching the ingest via COPY FROM for each sensor using a python script with psycopg2.copy_from() and StringIO, based on the example by @FranckPachot (https://dev.to/yugabyte/bulk-load-into-postgresql-yugabytedb-psycopg2-fep). The difference to his approach is that I cannot split the ingest up into multiple transactions.

# parsing the file, producing signal_list
...

signal_list = []

with psycopg2.connect(connection_str) as connection:
  connection.autocommit = False

  with connection.cursor() as cursor:
    cursor.execute(
        """
        INSERT INTO measurements (measurement_id) VALUES (%s);
        """,
        (measurement_id),
    )

    for signal in signal_list:
      timestamps = signal.timestamps
      signal_values = signal.values 

      rows = StringIO()
      rows.writelines(f"{measurement_id}\t{signal_name}\t{timestamp}\t{float(signal_value)}\n"
                      for timestamp, signal_value in nditer([timestamps, signal_values]))
      rows.seek(0)

      cursor.copy_from(file=rows, table="signals", sep="\t")

      rows.seek(0)
      rows.truncate(0)

  connection.commit()

I am running this through an oc port-forward yb-tserver-0 5433.

While this works fine for smaller files (e.g. 3.5mb with around 2 million rows), the ingest of larger files (e.g. 300mb with around 310 million rows) consumes excessive amount of memory, so that I am effectively hitting the resource limits of my OpenShift resource configuration, which terminates the connection.

TServer Memory consumption

It is clear that, due to the scope of the transaction, at some point Yugabyte needs to have a full copy of the data in memory, but it seems dubious to me that 300mb of data lead to 24gb+ of memory consumption on the tserver.

Any suggestions?

Thank you very much for your help!

Jo

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

[X] I confirm this issue does not contain any sensitive information.

FranckPachot commented 6 months ago

Hi

due to the scope of the transaction, at some point Yugabyte needs to have a full copy of the data in memory

It should not. In the SQL layer the memory used is per batch of rows. In the storage layer, it is per RocksDB memtable.

consumes excessive amount of memory,

what is the metric for the graph you show? Can you check mem_tracker_server_Tablets_overhead_PerTablet (for DocDB layer) and allocated_mem_bytes from pg_stat_activity (for the SQL layer)

(e.g. 300mb with around 310 million rows)

What is the size of one row? Can you share a small sample of the measurement files, to test it.

treibholzhq commented 6 months ago

Hi @FranckPachot, thank you very much for your answer and your clarification regarding memory usage.

The graph shows the memory consumption metrics on OpenShift for the yb-tservers stateful set, showing yb-tserver-0 pod consuming 25 gigabyte of memory when ingesting a 300 megabyte file (310 million rows).

allocated_mem_bytes is just rising and rising:

I tried to get the mem_tracker_server_Tablets_overhead_PerTablet from the Web UI (Metrics), unfortunately I cannot see yb-tserver-0 reporting it (I can see the other 2 tservers reporting it though).

The size of one row is quite small, it contains just a uuid, a signal name (mostly < 256 chars), a timestamp and a 64bit floating point number.

Example:

8884dc5b-1f08-47e1-8d2c-8e8b8f716f71 m_torque 2023-07-01 12:01:56.874 +0200 917.0

Thats 65 bytes.

Oddly enough, I don't get any error in the YugabyteDB TServer logs, the connection to the pod just terminates:

psycopg2.OperationalError: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

It's very much like the issue described here: https://github.com/yugabyte/yugabyte-db/issues/5453

Thank you very much for your help.

FranckPachot commented 6 months ago

I tried some variations to get something similar to what you get (high pg_stat_activity.allocated_mem_bytes) I got this when explicitly disabling intermediate commits withset yb_default_copy_from_rows_per_transaction=0`:

and it finally failed with ERROR: temporary file size exceeds temp_file_limit This doesn't happen when the foreign key is removed.

I'll open an issue. For the moment, the workarounds are intermediate commit (but from what you say you want the whole to be transactional) or without the foreign key (there's a DISABLE_FK_CHECK in COPY but I guess you have to use copy_expert rather than copy_from)

treibholzhq commented 6 months ago

@FranckPachot thanks for the investigation.

I have rewritten my script to use psycopg3 which is much simpler to use, so I tried using COPY signals FROM STDIN WITH (DISABLE_FK_CHECK), but this fails with the same behaviour.

I then removed the foreign key entirely and created the signals table without it. No luck. Still consuming way too much memory.

Btw: select * from pg_settings where name = 'yb_default_copy_from_rows_per_transaction'; shows 20000, which is the default I guess.

FranckPachot commented 6 months ago

ok, so it seems to be a different problem. Will check with @droberts-yb if we can reproduce it

treibholzhq commented 6 months ago

Side note: I have been using a connection set to autocommit this morning and can confirm that memory consumption stays flat in this case.

So I guess it is definitely an issue with transaction size?

FranckPachot commented 6 months ago

OK, my first thought was that yb_default_copy_from_rows_per_transaction was disabled with a warning when not in autocommit but I tested it quickly and didn't confirm it. My test was probably wrong. We need to reproduce it to understand what is allocated (foreign key was a possibility but maybe something else). Usually bulk loads are done with intermediate commits.

treibholzhq commented 6 months ago

Yeah, the challenge here is that our bulk belongs to an entity and that entity can hold a lot of data (100 gigabytes in the worst case), so the desired behaviour is "all of entity" or "nothing".

treibholzhq commented 5 months ago

@FranckPachot @droberts-yb in the meantime I tried a different approach in that I am using COPY to ingest to a temporary table and then doing an INSERT INTO SELECT from this temp table to the real table, which gives me transactional behavior when the connection fails (and also helps with parallel ingests). First, I had to increase the maximum temp file limit (it was set to 1GB, I just set it to 100GB, just to be safe). Then, I tried to ingest 2 files (same as above) in parallel, they failed with a Disk Full error (I have 50GB disks now). I created the temp table without foreign keys using CREATE TEMP TABLE signals_temp (LIKE signals INCLUDING DEFAULTS), so I am not sure if that is the same issue you mentioned above, @FranckPachot.

FranckPachot commented 5 months ago

Thanks for testing. I'm not surprised that it is the same from a temporary table but it helps to understand what happens. What is the size of file (you said 300mb, right?), and the size of the temporary table? (select pg_table_size('signals_temp');)

treibholzhq commented 5 months ago

Yes, it's the same file (300mb, 313 million rows). Size of the signals_temp table is:

select pg_size_pretty(pg_table_size('signals_temp'));
31 GB

FranckPachot commented 5 months ago

313 million rows of 65 bytes are 19GB, so 31GB seems large, but it's okay. What format is the file to hold that in 300MB?

So, yes, that seems too much to be loaded in one transaction. Maybe you can load it non-transactionally and have a view to make it visible at the end. like this: https://dev.to/yugabyte/yugabytedb-transactional-load-with-non-transactional-copy-10ac

treibholzhq commented 5 months ago

The file is .mf4 (https://www.asam.net/standards/detail/mdf/). A compact binary format for measurement data.

I believe I am already loading non-transactionally, I am setting autocommit=True on the psycopg connection. So the COPY to the temp table already is non-transactional, as in each row is committed immediately. So yes, I realize that the temp table should be 19gb, but I do not understand where the overhead of 12gb comes from.

For reference, a 3.5mb file with 2 million rows has a temp table size of 150mb, which is okay given that the size should be around 130mb (65 bytes * 2 million = 130mb).

But this overhead grows exponentially, as stated, the overhead for 300mb, 313 million rows is 12gb.

Typically, our measurement files are 2.5gb in size, with over 1 billion rows alone, and can in the worst case be 100gb in size.

The overhead adds up pretty quickly, then.

We could reduce the row size a bit, though, by not storing a uuid in the signals table but a bigserial or even serial.

yugabyte / yugabyte-db