yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
9.02k stars 1.08k forks source link

tserver OOM while migrating using pgloader #8137

Open epratt-yb opened 3 years ago

epratt-yb commented 3 years ago

Jira Link: DB-2576 We have been testing migrating mysql databases. We were using the employees' sample DB to do some testing and when we get to the foreign key creation we end up OOM the system. The data is not very big so I'm surprised we OOM. this seems to be the query we hit the issue with:

2021-04-22T22:02:04.992716Z NOTICE ALTER TABLE employees.salaries ADD CONSTRAINT salaries_ibfk_1 FOREIGN KEY(emp_no) REFERENCES employees.employees(emp_no) ON UPDATE RESTRICT ON DELETE CASCADE

We can see from dmesg -T that we OOM:

Thu Apr 22 22:10:54 2021] Out of memory: Kill process 11737 (postgres) score 759 or sacrifice child
[Thu Apr 22 22:10:54 2021] Killed process 11737 (postgres) total-vm:13004288kB, anon-rss:11976548kB, file-rss:0kB, shmem-rss:132kB

Attached is the core dump and screen grab.

epratt-yb commented 3 years ago
Screen Shot 2021-04-22 at 4 32 37 PM
ajcaldera1 commented 3 years ago

This is not necessarily restricted to pgloader. ALTER TABLE ADD CONSTRAINT ... FOREIGN KEY ends up doing a LEFT JOIN from the parent table to the dependent table. This query must be able to resolve on a single nodes heap memory. If the heap is insufficient, then the backend will likely be OOM-killed and all sessions tied to that tserver (postmaster) process will also be killed. See #9541.

jaki commented 3 years ago

Relevant code:

static void
validateForeignKeyConstraint(char *conname,
...
    /*
     * See if we can do it with a single LEFT JOIN query.  A false result
     * indicates we must proceed with the fire-the-trigger method.
     */
    if (RI_Initial_Check(&trig, rel, pkrel))
        return;
...
    while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
    {
...

First, notice that there is an alternative to the LEFT JOIN. This alternative path looks very similar to the COPY OOM issue. If that's the case, we can solve this by

  1. Always avoid the LEFT JOIN code path.
  2. If the alternative path also has memory issues, fix it the same way as in COPY by using additional memory contexts that get reset per iteration. The memory is probably allocated in pggate for storage layer purposes. I don't see why this memory would need to be retained across rows.
tedyu commented 3 years ago

10199 has been merged to master and 2.6 branches.

It should address #2 in comment above.