zonemaster / zonemaster-backend

The Zonemaster Backend - part of the Zonemaster project
Other
14 stars 23 forks source link

Clickhouse: new database engine (experimental) #1094

Closed ghost closed 9 months ago

ghost commented 1 year ago

Purpose

Clickhouse is a vertical database engine that appears to be quite efficient when it comes to run Zonemaster with millions of domains

Context

Changes

How to test this PR

Setup a Clickhouse server

Configure and use Zonemaster-Backend

Adapt the share/backend_config.ini file and use zmb or zmtest. You could check any results by running some SQL commands with clickhouse-client.

$ clickhouse-client
:) USE zonemaster;
:) SELECT * FROM test_results;

Run the unit tests

Unit test should work when using Clickhouse.

TARGET=Clickhouse prove -l t/
matsduf commented 1 year ago

Some installation instructions are needed.

ghost commented 11 months ago

This PR is still marked as "draft". Is it really ready for installation?

Yes. I marked it as "ready for review" and updated the description.

marc-vanderwal commented 11 months ago

Release testing report – Success, no issues

Rocky Linux 9.3/Clickhouse

Merged pnax/clickhouse in a detached head. Followed the installation instructions as updated by zonemaster/zonemaster#1228 in order to get a Clickhouse server running on the same host as zonemaster-backend. After running the “smoke test”, the test_results table contains data, as expected.

For the unit tests, the procedure was incomplete. Unit tests needed a zonemaster_testing database, a zonemaster_testing user and enough privileges for the aforementioned user to touch the database. With those prerequisites out of the way, unit tests pass.

tgreenx commented 11 months ago

@pnax Before this is merged it should be rebased on latest develop (#1132 has now been merged), to be sure that CI passes

marc-vanderwal commented 11 months ago

I’m currently testing #1121 with the experimental Clickhouse database engine still enabled and it seems that I’m hitting issues related to the way Clickhouse implements ALTER TABLE … UPDATE that don’t make me feel comfortable at all with using a Clickhouse database backend in production.

For example, I get warnings like the following:

Argument "860b459647d9be55" isn't numeric in numeric eq (==) at /usr/local/share/perl5/5.32/Zonemaster/Backend/DB.pm line 362.

That line 362 is the equality comparison of $progress with 0 in the following subroutine:

sub test_state {
    my ( $self, $test_id ) = @_;

    my ( $progress ) = $self->dbh->selectrow_array(
        q[
            SELECT progress
            FROM test_results
            WHERE hash_id = ?
        ],
        undef,
        $test_id,
    );
    if ( !defined $progress ) {
        die Zonemaster::Backend::Error::Internal->new( reason => 'job not found' );
    }

    if ( $progress == 0 ) {
        return $TEST_WAITING;
    }
    elsif ( 0 < $progress && $progress < 100 ) {
        return $TEST_RUNNING;
    }
    elsif ( $progress == 100 ) {
        return $TEST_COMPLETED;
    }
    else {
        die Zonemaster::Backend::Error::Internal->new( reason => 'state could not be determined' );
    }
}

But these “argument isn’t numeric” errors should in theory never happen, because the code fetches exactly one row and column and the datum is, according to the database schema, an integer. How do we end up with a string?

Grepping 860b459647d9be55 across both log files shows something even more concerning: the thing that’s being compared to a number is actually the ID of a test being run by the other of the two test agents on the machine. In this example, if I order the matches by timestamp, this happens:

zm-testagent-2.log:2023-12-13T08:59:51Z [2296] [WARNING] [main] Argument "860b459647d9be55" isn't numeric in numeric eq (==) at /usr/local/share/perl5/5.32/Zonemaster/Backend/DB.pm line 362.
zm-testagent.log:2023-12-13T08:59:53Z [2152] [INFO] [main] Test found: 860b459647d9be55
zm-testagent-2.log:2023-12-13T08:59:54Z [2296] [WARNING] [main] Argument "860b459647d9be55" isn't numeric in numeric eq (==) at /usr/local/share/perl5/5.32/Zonemaster/Backend/DB.pm line 362.
zm-testagent.log:2023-12-13T09:00:21Z [2812] [INFO] [main] Test starting: 860b459647d9be55
zm-testagent.log:2023-12-13T09:05:53Z [2812] [INFO] [main] Test completed: 860b459647d9be55
zm-testagent.log:2023-12-13T09:05:54Z [2152] [NOTICE] [Zonemaster::Backend::Config] Worker process (pid 2812, testid 860b459647d9be55): Terminated with exit code 0

Due to this, many tests either crash with “state could not be determined” or “illegal transition to COMPLETED”.

It seems that Clickhouse doesn’t implement proper transaction isolation when updating a table. Yet, currently, the code does many updates to test_results that, without proper isolation, cause Zonemaster::Backend to exhibit very surprising behavior.