Improve database bootstrap story

acaloiaro commented 5 months ago

Coming from the traditional style of migration tools that you're probably familiar with, I've grown accustom to tooling that provides a clear path from an empty database, to one that fully reflects the current schema.

I believe this story starts with timestamped migration names, as schemas must be realized serially. I.e. fields can't be added to users until users exists. This way, timestamped migrations in a migrations directly declare the serialized path to the most current schema. Of course, some migrations can be parallelized since a create table cannot interfere with another create table, so long as its migrations doesn't involve foreign keys, but I don't think it's useful to get mired in implementation detail right now.

As a user, what I'd like from pgroll is:

That it govern migration creation. This could be as simple as pgroll new <name> <dir> to add new migration files with the correct timestamp format to <dir>. It may be nice to open up $EDITOR after file creation.
Given a directory full of migrations, run all migrations within. I believe what's required here is to simply run pgroll start -c <dir>/*, sorted. The command could perhaps be pgroll bootstrap <dir>

I don't believe either of these things present a significant amount of work, but can greatly improve adoption. I'll likely implement (1.) and (2.) in my own fork for personal use. I'd be happy to get feedback on this proposal and open a PR if it's deemed acceptable. I likely won't have time to write tests of acceptable quality, however.

andrew-farries commented 5 months ago

Hi @acaloiaro 👋 , thanks for opening the issue.

Given a directory full of migrations, run all migrations within. I believe what's required here is to simply run pgroll start -c <dir>/*, sorted. The command could perhaps be pgroll bootstrap <dir>.

This is absolutely a missing feature of pgroll and something that has to be addressed before a v1 release. Given the two-phase (start, complete) nature of pgroll migrations, the command would have to start and complete all migrations in <dir>. The final migration in <dir> could in theory be started and left incomplete, but this probably wouldn't be useful in a database bootstrap scenario.

Using pgroll start for this could confusing, given that it has a --complete flag to control whether the migration should be immediately completed or not. As we'd have to complete migrations when running a <dir> full of migrations the --complete flag would become redundant/confusing. Perhaps adding pgroll bootstrap as you suggest would be the way to go.

That it govern migration creation. This could be as simple as pgroll new <name> <dir> to add new migration files with the correct timestamp format to <dir>. It may be nice to open up $EDITOR after file creation.

I'm not entirely convinced of the need for a pgroll new command. Having pgroll new create timestamped migration files would be nice, but it's also possible for users to create numbered migration files by hand (01_create_table.json, 02_add_field.json etc). This seems to be a common approach for other migration tools.

A nice feature of pgroll new that might make the command worthwhile would be to stub out the migration JSON using the JSON schema for the operation, so that pgroll new create_table opens $EDITOR with a create_table migration pre-populated with all required and optional fields.

We'd gladly accept and review PRs that implement either of these, especially for point 1.. pgroll has extensive tests for the migrations engine itself in pkg/migrations and pkg/state but none around the CLI itself. We're generally happy with this approach for now, so the requirement for extra tests would be minimal.

acaloiaro commented 5 months ago

Hi @acaloiaro 👋 , thanks for opening the issue.

Given a directory full of migrations, run all migrations within. I believe what's required here is to simply run pgroll start -c <dir>/*, sorted. The command could perhaps be pgroll bootstrap <dir>.

This is absolutely a missing feature of pgroll and something that has to be addressed before a v1 release. Given the two-phase (start, complete) nature of pgroll migrations, the command would have to start and complete all migrations in <dir>. The final migration in <dir> could in theory be started and left incomplete, but this probably wouldn't be useful in a database bootstrap scenario.

I agree. In the DB bootstrap scenario I'm generally standing up a new system with nothing to roll back to, so marking a boostrapped db as "complete" seems sensible by default.

Using pgroll start for this could confusing, given that it has a --complete flag to control whether the migration should be immediately completed or not. As we'd have to complete migrations when running a <dir> full of migrations the --complete flag would become redundant/confusing. Perhaps adding pgroll bootstrap as you suggest would be the way to go.

Yeah I agree, --complete is probably silly for this use case.

That it govern migration creation. This could be as simple as pgroll new <name> <dir> to add new migration files with the correct timestamp format to <dir>. It may be nice to open up $EDITOR after file creation.

I'm not entirely convinced of the need for a pgroll new command. Having pgroll new create timestamped migration files would be nice, but it's also possible for users to create numbered migration files by hand (01_create_table.json, 02_add_field.json etc). This seems to be a common approach for other migration tools.

Instead of timestamping, perhaps auto-incrementing from the most-recent migration number would be sensible for new, e.g pgroll new <name> <dir?"."> -> <dir?>/<new_migration_number>_<migration_name>.json

A nice feature of pgroll new that might make the command worthwhile would be to stub out the migration JSON using the JSON schema for the operation, so that pgroll new create_table opens $EDITOR with a create_table migration pre-populated with all required and optional fields.

That sounds quite handy.

We'd gladly accept and review PRs that implement either of these, especially for point 1.. pgroll has extensive tests for the migrations engine itself in pkg/migrations and pkg/state but none around the CLI itself. We're generally happy with this approach for now, so the requirement for extra tests would be minimal.

Sounds good. I'll get something in the works for 1.

I'll leave how 2. gets done up to your discretion. Either way, I'll be implementing a solution for it on my end to solve my immediate needs. I'm happy share what I come up with as a proof-of-concept if you want. It's unlikely to be super robust and ready for the masses, but will be a starting point.

acaloiaro commented 5 months ago

After more hands-on time with pgroll, I think this issue has the potential to be a whole can of worms. The "bootstrap" problem scope is a subset of a more general one, which is running multiple, unapplied migrations in order.

It's possible that bootstrapping is a problem worth solving on its own, and the more general problem of running unapplied migrations can be solved separately. But, I do think it's worth mentioning here that as a user, I pretty quickly ran into the issue of needing to run multiple unapplied migrations sequentially. I'll refer to this as the "MAM" problem below.

The most obvious way that the MAM problem surfaces is when developers work on feature branches in which multiple migrations are required. While pgroll migrations do indeed support multiple operations, not all operations may be applied within the same migration. This exacerbates the MAM problem.

At least two types of operations cannot be applied in the same migration with other operations: sql, and add_index. I imagine there are others.

This results in the MAM problem surfacing more often than one would expect, e.g. when developers add new tables that necessitate one or more indices, or when postgres extensions need to be enabled, e.g. using sql operations to CREATE EXTENSION IF NOT EXISTS \"uuid-ossp\" to enable UUID types.

This means that many features involving migrations result in multiple migrations, rather than a single migration. With multiple developers building multiple features that easily result in the MAM problem, managing migration releases becomes difficult.

So is the MAM problem can of worms worth solving now, or should this issue focus solely on bootstrapping? Having fewer or no exclusive operations like sql and add_index would alleviate some of the MAM problem pain, but the bootstrap scenario remains.

@andrew-farries could you possibly share the xata team's experience managing this in production?

andrew-farries commented 5 months ago

I think this issue should focus solely on the issue of running pgroll on a directory full of migrations.

The restrictions around multiple operations in a migration operating on the same resources is a separate issue IMO, already tracked in https://github.com/xataio/pgroll/issues/239.

Any solution for lifting the multiple operations restrictions is likely going to orthogonal to this 'bootstrap' issue.

exekias commented 5 months ago

So is the MAM problem can of worms worth solving now, or should this issue focus solely on bootstrapping? Having fewer or no exclusive operations like sql and add_index would alleviate some of the MAM problem pain, but the bootstrap scenario remains.

FWIW I'm working on relaxing the restrictions in the sql migration so it can run together with other ops: https://github.com/xataio/pgroll/pull/280

I wonder if having indices as part of the create_table op would make sense too

andrew-farries commented 5 months ago

I wonder if having indices as part of the create_table op would make sense too

I think this would make sense. It's been raised before as https://github.com/xataio/pgroll/issues/203

acaloiaro commented 4 months ago

Sounds good. I'll get something in the works for 1.

I just wanted to respond because I actually decided not to put any work into this for the time being, and I don't want anyone under the impression that it's a work in progress.

This is not me pooh-poohing on pgroll. I still think pgroll's way of doing migrations is superior. However between the current shortcomings of no easy bootstrap, no way to define complex migrations (#203, #239), and the JSON migration file format that prevents me from using sqlc without some sort of automated pgdump step to get my schema as plain SQL, migrations were not feeling very ergonomic.

In my personal time I'm considering forking pgroll, removing the data migration up portion of migrations, and replacing the JSON schema definition format with plain SQL. I believe these changes will make it feel more ergonomic for me.

I won't be shocked if I return to the mainline pgroll, however. But with the amount of migrations I'm currently authoring, the ergonomics just aren't working for me and the toil factor is too high. I think pgroll will feel a lot more comfortable for established applications with relatively stable data schemas.

FYI: @andrew-farries This is where I ended up on this issue.

andrew-farries commented 4 months ago

Thanks for the feedback on what's pushing you away from using pgroll. The points you raise are valid and we're in agreement that they need to be addressed before a v1 release.

I hope you can adapt pgroll to your needs in the meantime and maybe return once these issues are addressed.

acaloiaro commented 4 months ago

I'm confident I'll be back; pgroll has outstanding promise :smile:

NefixEstrada commented 3 months ago

Hello! I had both the "bootsrap" and the sqlc integration issues and, after a bit of tinkering, I came up with the following solution

echo "Generating DB schema for ${service}"
psql -U postgres -c "DROP DATABASE IF EXISTS ${service}_schemas;"
psql -U postgres -c "CREATE DATABASE ${service}_schemas;"
pgroll --postgres-url "postgres://postgres:postgres@0.0.0.0:5432/${service}_schemas?sslmode=disable" init
find . -path "./${service}/model/migrations/*.json" | xargs -I% pgroll --postgres-url "postgres://postgres:postgres@0.0.0.0:5432/${service}_schemas?sslmode=disable" start --complete %
pg_dump -s -x -n public -U postgres ${service}_schemas > ${service}/model/schema.sql

sqlc generate

This is far from ideal, but it works as one would expect. Leaving it here so maybe I can save some time to the next one! :)

acaloiaro commented 2 months ago

Hello! I had both the "bootsrap" and the sqlc integration issues and, after a bit of tinkering, I came up with the following solution
echo "Generating DB schema for ${service}"
psql -U postgres -c "DROP DATABASE IF EXISTS ${service}_schemas;"
psql -U postgres -c "CREATE DATABASE ${service}_schemas;"
pgroll --postgres-url "postgres://postgres:postgres@0.0.0.0:5432/${service}_schemas?sslmode=disable" init
find . -path "./${service}/model/migrations/*.json" | xargs -I% pgroll --postgres-url "postgres://postgres:postgres@0.0.0.0:5432/${service}_schemas?sslmode=disable" start --complete %
pg_dump -s -x -n public -U postgres ${service}_schemas > ${service}/model/schema.sql

sqlc generate
This is far from ideal, but it works as one would expect. Leaving it here so maybe I can save some time to the next one! :)

Not bad at all really. Especially if you watch migrations/*.json with air or a similar tool, triggering model generation on migration changes.

xataio / pgroll

Improve database bootstrap story #269