Use `alembic` to manage SQLite schema changes?

widdowquinn / pyani

Application and Python module for average nucleotide identity analyses of microbes.

http://widdowquinn.github.io/pyani/

MIT License

197 stars 55 forks source link

Use `alembic` to manage SQLite schema changes? #378

Open widdowquinn opened 2 years ago

widdowquinn commented 2 years ago

Summary:

Use alembic or similar to manage database migration.

Description:

In discussions with users, modifications to the SQLite database schema have caused issues where updating their v0.3 installation has led to incompatibility with an existing database, and resulting errors.

We could potentially avoid this by carefully versioning the SQLite/SQLAlchemy schema and using alembic or similar to aid migration between versions in most cases.

If users maintain large analyses in their databases, we should avoid discouraging them from trusting the package by breaking backwards-compatibility.

Having used this before with Flask applications, I'm confident we could usually achieve seamless in-place upgrades of users' existing local databases.

baileythegreen commented 2 years ago

Can you share any of these discussions with users? That would give me a more concrete idea of the problem this solves. I imagine it's the same issue I had between the fastANI branch and others, though.

widdowquinn commented 2 years ago

You're the user with the most experience of these incompatibilities, for sure ;)

We have talked about using alembic for this before, and it's good practice in any case.

I don't have much concrete, but here's a related screenshot as I helped Angelika understand her issue with the update:

baileythegreen commented 2 years ago

I've been playing with this, and there are a few things I don't seem to have figured out, yet.

revision naming, and specifying a specific revision from the command line, vs just saying alembic upgrade head
where migration files need to exist in a directory hierarchy; do they go in multidb?
allowing multiple databases; I think I am halfway towards this, but something seems incorrect
how much of Q&As that talk about Flask is going to be relevant? I'm seeing lots of Flask stuff, despite not searching for it

BUT, I think I have a basic migration script written for the old/new versions of the database (pre-/post- fastANI addition).

baileythegreen commented 2 years ago

I have set up the basic requirements for a new versiondb subcommand, and also managed to get alembic working correctly locally. I'm now working on tying the functionality into pyani in a sensible way.

I was going to add a tag in the repo, prior to the merge of fastANI, to capture the database changes, but it looks like tags are directly related to releases, and doing so might be confusing, since the database is only a factor in version 3, and all of the existing releases are version 2.

baileythegreen commented 2 years ago

Actually, I have found an issue with the downgrade() function I am testing, which should remove the columns that are added to support fastANI.

The function I am running is written in accordance with examples I've seen:

def downgrade_old_db():
    with op.batch_alter_table("comparisons") as batch_op:
        batch_op.drop_column('kmersize')
        batch_op.drop_column('minmatch')

but nothing happens—note the lack of a 'Running downgrade ...' message here:

(pyani_dev) ! alembic downgrade base
INFO  [alembic.env] Migrating database old_db
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.

Many sources on the internet claim that this is because sqlite does not support the DROP COLUMN functionality of ALTER TABLE, but the documentation for this indicates otherwise. It appears this change was made sometime prior to July 2021, though I have not been able to pinpoint exactly when.

I have, however, confirmed that I can drop columns within my sqlite installation, so I'm puzzled as to why alembic is not managing to do so.

sqlite> .schema comparisons
CREATE TABLE comparisons (
    comparison_id INTEGER NOT NULL, 
    query_id INTEGER NOT NULL, 
    subject_id INTEGER NOT NULL, 
    aln_length INTEGER, 
    sim_errs INTEGER, 
    identity FLOAT, 
    cov_query FLOAT, 
    cov_subject FLOAT, 
    program VARCHAR, 
    version VARCHAR, 
    fragsize INTEGER, 
    maxmatch BOOLEAN, kmersize INTEGER, minmatch FLOAT, 
    PRIMARY KEY (comparison_id), 
    UNIQUE (query_id, subject_id, program, version, fragsize, maxmatch), 
    FOREIGN KEY(query_id) REFERENCES genomes (genome_id), 
    FOREIGN KEY(subject_id) REFERENCES genomes (genome_id)
);
sqlite> alter table comparisons drop kmersize;
sqlite> .schema comparisons
CREATE TABLE comparisons (
    comparison_id INTEGER NOT NULL, 
    query_id INTEGER NOT NULL, 
    subject_id INTEGER NOT NULL, 
    aln_length INTEGER, 
    sim_errs INTEGER, 
    identity FLOAT, 
    cov_query FLOAT, 
    cov_subject FLOAT, 
    program VARCHAR, 
    version VARCHAR, 
    fragsize INTEGER, 
    maxmatch BOOLEAN, minmatch FLOAT, 
    PRIMARY KEY (comparison_id), 
    UNIQUE (query_id, subject_id, program, version, fragsize, maxmatch), 
    FOREIGN KEY(query_id) REFERENCES genomes (genome_id), 
    FOREIGN KEY(subject_id) REFERENCES genomes (genome_id)
);
sqlite>

widdowquinn commented 2 years ago

it looks like tags are directly related to releases, and doing so might be confusing, since the database is only a factor in version 3, and all of the existing releases are version 2.

IIRC, git tags are just labels (though really, only lightweight labels are just labels - annotated tags are proper git objects; both point to specific commits) - but GitHub chooses to interpret tags in a specific way. Each tag is understood to be a release point, but the release point only becomes an actual release if Release Notes are added. GitHub's approach to this is not especially general, IMO.

widdowquinn commented 2 years ago

Actually, I have found an issue with the downgrade() function I am testing, which should remove the columns that are added to support fastANI.

The function I am running is written in accordance with examples I've seen:
def downgrade_old_db():
    with op.batch_alter_table("comparisons") as batch_op:
        batch_op.drop_column('kmersize')
        batch_op.drop_column('minmatch')
but nothing happens—note the lack of a 'Running downgrade ...' message here:
(pyani_dev) ! alembic downgrade base
INFO  [alembic.env] Migrating database old_db
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.

Does alembic provide no error message or other useful information (and appear to have "worked" without DROPping the column)?

A quick search turned up at least one workaround someone implemented, but we shouldn't have to do that.

I've seen two mentions of a render_as_batch option which, I think, allows alembic to batch copy the table, modify it accordingly, DROP the old one, and rename the new one:

here
and here

Miguel Grinberg has a blog post outlining a possible solution.

I expect you'll have seen and tried all of these, mind.

baileythegreen commented 2 years ago

Does alembic provide no error message or other useful information (and appear to have "worked" without DROPping the column)?

By default it does not produce any error message, and I have tried including the --raiseerr optional flag to get a full traceback, but it still doesn't seem to output anything.

I was attempting to verify that the issue was with drop table, and not just downgrading specifically, by creating a second revision so I had more things to step through. However, now I am stuck in a loop where alembic claims the database is not up-to-date, but the fastani columns exist. The issue seems to be it's not populating the alembic_version table now, so the upgrade fails partway through, but I don't know why. I think I saw something about this the other day, but haven't managed to find it again.

I think this issue started when I manually deleted columns, to prove I could, though I've since recreated the database. Perhaps something is just not hooked up quite right....

baileythegreen commented 2 years ago

I seem to have hacked my way through a series of arcane errors I could find little information about to something that works.

I now have two migration scripts, that I am able to use to both upgrade and downgrade a database reliably. The differences from what I had before are that I did not use the multidb template this time, as it did not seem necessary after further reflection, and I have had to switch from performing the batch operations within a context:

with op.batch_alter_table("comparisons") as batch_op:
        batch_op.add_column(sa.Column('kmersize', sa.Integer))
        batch_op.add_column(sa.Column('minmatch', sa.Float))

to doing so outside of one:

op.add_column('comparisons', sa.Column('kmersize', sa.Integer))
op.add_column('comparisons', sa.Column('minmatch', sa.Float))

This change was necessary because I got the following error, and nothing else both prevents the error, and still performs the changes. I have not been able to figure out why this caused an error when, for instance, I successfully used the former format yesterday.

Traceback (most recent call last):
  File "/Users/baileythegreen/Software/miniconda3/bin/alembic", line 8, in <module>
    sys.exit(main())
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/config.py", line 588, in main
    CommandLine(prog=prog).main(argv=argv)
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/config.py", line 582, in main
    self.run_cmd(cfg, options)
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/config.py", line 559, in run_cmd
    fn(
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/command.py", line 320, in upgrade
    script.run_env()
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/script/base.py", line 563, in run_env
    util.load_python_file(self.dir, "env.py")
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/util/pyfiles.py", line 92, in load_python_file
    module = load_module_py(module_id, path)
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/util/pyfiles.py", line 108, in load_module_py
    spec.loader.exec_module(module)  # type: ignore
  File "<frozen importlib._bootstrap_external>", line 848, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/Users/baileythegreen/Software/pyani/scratch/alembic/env.py", line 72, in <module>
    run_migrations_online()
  File "/Users/baileythegreen/Software/pyani/scratch/alembic/env.py", line 66, in run_migrations_online
    context.run_migrations()
  File "<string>", line 8, in run_migrations
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/runtime/environment.py", line 851, in run_migrations
    self.get_context().run_migrations(**kw)
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/runtime/migration.py", line 620, in run_migrations
    step.migration_fn(**kw)
  File "/Users/baileythegreen/Software/pyani/scratch/alembic/versions/65538af5a5e1_add_nonsense_column.py", line 21, in upgrade
    batch_op.add_column(sa.Column('nonsense', sa.Integer))
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/contextlib.py", line 120, in __exit__
    next(self.gen)
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/operations/base.py", line 374, in batch_alter_table
    impl.flush()
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/operations/batch.py", line 101, in flush
    should_recreate = self._should_recreate()
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/operations/batch.py", line 94, in _should_recreate
    return self.operations.impl.requires_recreate_in_batch(self)
  File "/Users/baileythegreen/Software/miniconda3/lib/python3.8/site-packages/alembic/ddl/sqlite.py", line 56, in requires_recreate_in_batch
    and col.server_default.persisted
AttributeError: 'NoneType' object has no attribute 'persisted'

baileythegreen commented 2 years ago

I have no idea how this manages to get around the issue I saw before when nothing seemed to happen on downgrade.

baileythegreen commented 2 years ago

Does this seem like the right way to incorporate alembic with pyani? I know less abut the nuance of packaging things in Python.

https://github.com/sqlalchemy/alembic/issues/463#issuecomment-441887412

widdowquinn commented 2 years ago

I'm not familiar enough with alembic to know whether it's the "correct" way to proceed. It looks to have been part of alembic since 1.2.0, but isn't explicitly documented or in the tutoral AFAICT. So I presume it's stable, but maybe a niche operation that doesn't get stress-tested a lot.

What I get from the thread linked above seems sensible. If it does what we need, doesn't break anything else, and doesn't impose any outrageous requirements, that's good enough for me.

baileythegreen commented 2 years ago

I meant more from a Python packaging standpoint. All this seems to do is create two empty __init__.py files—one in the alembic (or other) directory specified as the location for alembic's stuff; the other in the versions subdirectory of the former. This seems unlikely to break anything, but may or may not be necessary.

I think the files that need to be included/given to users, and the correct structure for them is:

-  alembic.ini
-  alembic/
    - __init__.py
    - env.py
    - README           # contains a single line stating what alembic template was used; may not be useful
    - script.py.mako   # this may not be necessary; it is a template for creating new migration files
    - versions/
       - __init__.py
       - <revision_file>
       - <revision_file>

I have a subcommand working that will run upgrades and downgrades, and capture the output/errors from those. The main questions now are:

How should these files be included/given to the user?
- Including them at the top level seems sensible to me, but for users who install via conda, I don't know if something else would be better.
Is there a way to set the sqlalchemy.url variable programmatically?
- This is currently the only change that has to be made in the alembic.ini file in order for things to work, but it needs to match whatever database the user is using; for distribution we could use the default database name, but I would like to find a way to pull it from the command line arguments, or something. Ideally, without using sed, or a vimscript, to edit it.

widdowquinn commented 2 years ago

All this seems to do is create two empty __init__.py files—one in the alembic (or other) directory specified as the location for alembic's stuff; the other in the versions subdirectory of the former.

That doesn't seem unreasonable but, as I understand it, the __init__.py may no longer be necessary with namespace packages. As pyani isn't supported for Python versions that don't support namespace packaging, we'd likely be OK without it.

widdowquinn commented 2 years ago

How should these files be included/given to the user? Including them at the top level seems sensible to me, but for users who install via conda, I don't know if something else would be better.

I imagined that there would be a way to present only the pyani upgrade/pyani downgrade command to the user and have it work. Is that not possible?

widdowquinn commented 2 years ago

Is there a way to set the sqlalchemy.url variable programmatically?

I don't really know what this refers to, or should point to. I would have assumed the up/downgrade could run with a default --dbpath argument like the other commands, or take a passed path.

Is this issue relevant? https://github.com/sqlalchemy/alembic/issues/606

baileythegreen commented 2 years ago

For alembic to work, it needs a config file: alembic.ini, which contains the database name/driver info in the form of an sqlalchemy.url variable, and the location of the migration scripts as another variable. The sqlalchemy.url variable is pulled from the config file when alembic's env.py is run; I am not yet sure where the script location variable is retrieved. There does not seem to be a general way to pass these values inside an alembic command so that I can just pull them from our own command line options. It is possible to give alembic a database name this way, but I believe this only works in cases where the config file uses the multidb template, and allows you to selectively upgrade/downgrade a single database.

I imagined that there would be a way to present only the pyani upgrade/pyani downgrade command to the user and have it work. Is that not possible?

Right now, pyani versiondb [--upgrade <revision>] and pyani versiondb --downgrade <revision> work, but the location of the database, and the driver used, as well as the location of the migration scripts still have to be specified somewhere.

This is what I mean by specifying the driver & database location:

Valid SQLite URL forms are:
sqlite:///:memory: (or, sqlite://)
sqlite:///relative/path/to/file.db
sqlite:////absolute/path/to/file.db

The sqlite part should be fine, but the path is obviously something we can't predict, and the only way I currently know to get the value of something like --dbpath to the right bit of code is through an environment variable, which I would have to set, because I don't think alembic's env.py will have access to our namespace variables.

Is this issue relevant? sqlalchemy/alembic#606

This looks like they're using an environmental variable, so maybe. If you think that's an acceptable way forward. I still need to do some digging to see where script location stuff is used.

baileythegreen commented 2 years ago

Examples of my current alembic.ini and env.py. The only modifications I have made are to the sqlalchemy.url line(s) in the first one.

alembic.ini:

# A generic, single database configuration.

[alembic]
# path to migration scripts
script_location = alembic

# template used to generate migration files
# file_template = %%(rev)s_%%(slug)s

# sys.path path, will be prepended to sys.path if present.
# defaults to the current working directory.
prepend_sys_path = .

# timezone to use when rendering the date within the migration file
# as well as the filename.
# If specified, requires the python-dateutil library that can be
# installed by adding `alembic[tz]` to the pip requirements
# string value is passed to dateutil.tz.gettz()
# leave blank for localtime
# timezone =

# max length of characters to apply to the
# "slug" field
# truncate_slug_length = 40

# set to 'true' to run the environment during
# the 'revision' command, regardless of autogenerate
# revision_environment = false

# set to 'true' to allow .pyc and .pyo files without
# a source .py file to be detected as revisions in the
# versions/ directory
# sourceless = false

# version location specification; This defaults
# to alembic/versions.  When using multiple version
# directories, initial revisions must be specified with --version-path.
# The path separator used here should be the separator specified by "version_path_separator"
# version_locations = %(here)s/bar:%(here)s/bat:alembic/versions

# version path separator; As mentioned above, this is the character used to split
# version_locations. Valid values are:
#
# version_path_separator = :
# version_path_separator = ;
# version_path_separator = space
version_path_separator = os  # default: use os.pathsep

# the output encoding used when revision files
# are written from script.py.mako
# output_encoding = utf-8

# sqlalchemy.url = sqlite:////Users/baileythegreen/Software/pyani/scratch/alembic/old_db
sqlalchemy.url = sqlite:///pyani/scratch/alembic/old_db

[post_write_hooks]
# post_write_hooks defines scripts or Python functions that are run
# on newly generated revision scripts.  See the documentation for further
# detail and examples

# format using "black" - use the console_scripts runner, against the "black" entrypoint
# hooks = black
# black.type = console_scripts
# black.entrypoint = black
# black.options = -l 79 REVISION_SCRIPT_FILENAME

# Logging configuration
[loggers]
keys = root,sqlalchemy,alembic

[handlers]
keys = console

[formatters]
keys = generic

[logger_root]
level = WARN
handlers = console
qualname =

[logger_sqlalchemy]
level = WARN
handlers =
qualname = sqlalchemy.engine

[logger_alembic]
level = INFO
handlers =
qualname = alembic

[handler_console]
class = StreamHandler
args = (sys.stderr,)
level = NOTSET
formatter = generic

[formatter_generic]
format = %(levelname)-5.5s [%(name)s] %(message)s
datefmt = %H:%M:%S

and env.py:

from logging.config import fileConfig

from sqlalchemy import engine_from_config
from sqlalchemy import pool

from alembic import context

# this is the Alembic Config object, which provides
# access to the values within the .ini file in use.
config = context.config

# Interpret the config file for Python logging.
# This line sets up loggers basically.
fileConfig(config.config_file_name)

# add your model's MetaData object here
# for 'autogenerate' support
# from myapp import mymodel
# target_metadata = mymodel.Base.metadata
target_metadata = None

# other values from the config, defined by the needs of env.py,
# can be acquired:
# my_important_option = config.get_main_option("my_important_option")
# ... etc.

def run_migrations_offline():
    """Run migrations in 'offline' mode.
    This configures the context with just a URL
    and not an Engine, though an Engine is acceptable
    here as well.  By skipping the Engine creation
    we don't even need a DBAPI to be available.
    Calls to context.execute() here emit the given string to the
    script output.
    """
    url = config.get_main_option("sqlalchemy.url")
    context.configure(
        url=url,
        target_metadata=target_metadata,
        literal_binds=True,
        dialect_opts={"paramstyle": "named"},
    )

    with context.begin_transaction():
        context.run_migrations()

def run_migrations_online():
    """Run migrations in 'online' mode.
    In this scenario we need to create an Engine
    and associate a connection with the context.
    """
    connectable = engine_from_config(
        config.get_section(config.config_ini_section),
        prefix="sqlalchemy.",
        poolclass=pool.NullPool,
    )

    with connectable.connect() as connection:
        context.configure(
            connection=connection, target_metadata=target_metadata
        )

        with context.begin_transaction():
            context.run_migrations()

if context.is_offline_mode():
    run_migrations_offline()
else:
    run_migrations_online()

widdowquinn commented 2 years ago

This looks like they're using an environmental variable, so maybe. If you think that's an acceptable way forward. I still need to do some digging to see where script location stuff is used.

Environmental variables are fair game. os.environ["PYANI_DB_PATH"] could be set pretty straightforwardly post-CLI parsing, I think.

When populating, pathlib's Path should allow identification of the absolute path of whatever the CLI picks up for --dbpath on the current system so that you don't need to worry about relative paths in the sqlalchemy URL.

baileythegreen commented 2 years ago

Okay, good to know. I'll see if I can get that working. I have been waiting to push stuff because technically right now it won't run; without those values, it's going to throw an error.

widdowquinn commented 2 years ago

I'm looking forward to seeing how you manage the tests for this, too.

baileythegreen commented 2 years ago

That's what I'll be looking at after I try this stuff with environmental variables. I have ideas for what I'll do with tests.

baileythegreen commented 2 years ago

The issue described here has been resolved (and I felt really silly once it was). Skip to here if you want to bypass irrelevant questions/discussion.

I have thus far failed to get the environment variable plan working. I have managed to set them, and I have the syntax for retrieving them correct. However, I am unable to get their values assigned to the alembic config values, and I don't know why.

The env.py file seems to be a black box; I can't get print() statements to work, which makes it difficult to debug what's happening. If I don't have an alembic.ini file present in the directory, I get an error related to that; if I have one, but comment out the lines setting the two variables I need to modify, I get a different error; and whatever I supply there is not overwritten when I reset it in env.py.

An example of how I'm trying to reset them, where $DATABASE is the value of the --dbpath variable.

dbpath = os.environ.get('DATABASE')
    url = "sqlite:////Users/baileythegreen/Software/pyani/" + dbpath
    config.set_main_option("sqlalchemy.url", url)

From what I understand, this is the correct API method for what I want to do (eliminate the .ini file, but I have yet to get it to work.

widdowquinn commented 2 years ago

It's hard to comment on the first part without seeing the error message or practically taking a look.

In the second part, this line:

url = "sqlite:////Users/baileythegreen/Software/pyani/" + dbpath

looks suspect, to me. However, it's not clear what's in the DATABASE environment variable that gets read into dbpath.

The required format, described in an earlier post, is essentially:

sqlite:////absolute/path/to/file.db

It's not clear to me that your string, concatenated with dbpath, satisfies that. It looks like the hard-coded path to your pyani directory might also cause issues.

Have you tried:

using args.dbpath from the CLI, get the absolute path with args.dbpath.resolve()
setting a suitably-named environment variable with the result of args.dbpath.resolve() (and I'd suggest having a PYANI_ prefix so that related environment variables are effectively namespaced.
setting url = f"sqlite:///{os.environ.get('PYANI_DBPATH'}"

and what was/is the result?

widdowquinn commented 2 years ago

FEATURE REQUEST

alembic does upgrades/downgrades in-place. It would be useful, I think, to avoid borking an existing database by these updates, so we could backup the database before we start. For instance:

pyanidb

could be copied to

pyanidb.YYYY-MM-DD_HH-MM-SS.bak

before the in-place modification starts.

baileythegreen commented 2 years ago

To resummarise where this is (will potentially be expanded as I think of other things to add):

(Branch with code found here.)

The issues with editing env.py not working have been resolved; the problem was that an incorrect env.py was edited.
The pyani versiondb command creates an environment variable that are used to set values for the database name, overriding the value specified in the alembic.ini file. [^1]
An environment variable is also used for the script location, but this is set to alembic, rather than being user-determined. [^2]
The --package option creates two __init__.py files; one in the directory specified to alembic (default alembic/), one in the subdirectory alembic/versions/. [^3]
A step has been added to create a backup of the database being migrated, as described in the previous comment.
Additional options (--name, --config) have been created to allow for specifying a specific database name (in cases of a multidb alembic setup) or and config file not in the default location (./alembic.ini).

[^1]: This prevents the need to predict/dictate the database name. [^2]: This allows us to provide such a directory containing the necessary migration scripts as part of a pyani install. [^3]: I have decided not to use this option as it creates an issue for devs, wherein running import alembic in the parent directory of a project that contains an alembic/__init__.py file, attempts to import what it sees as a local module named alembic, rather than the actual database version control package.

CLI help information for `pyani versiondb`

! pyani versiondb -h
usage: pyani versiondb [-h] [-l LOGFILE] [-v] [--debug] [--disable_tqdm] [--version] [--citation]
                       [--dbpath DBPATH]
                       (--upgrade [VERSION] | --downgrade [VERSION] | --dry-run [START:END])
                       [--alembic_exe ALEMBIC_EXE] [-n NAME] [-c FILE]

One of --upgrade, --downgrade, or --dry-run must be specified.

optional arguments:
  -h, --help            show this help message and exit
  -l LOGFILE, --logfile LOGFILE
                        logfile location (default: None)
  -v, --verbose         report verbose progress to log (default: False)
  --debug               report debug messages to log (default: False)
  --disable_tqdm        Turn off tqdm progress bar (default: False)
  --version
  --citation
  --dbpath DBPATH       path to pyani database (default: .pyani/pyanidb)
  --upgrade [VERSION]   update an existing database to a newer schema; if no argument is given,
                        'head' will be used (default: None)
  --downgrade [VERSION]
                        revert an existing database to a older schema; if no argument is given,
                        'base' will be used (default: None)
  --dry-run [START:END]
                        produce the SQL that would be run in migrations, without altering the
                        database; start and end versions must be specified (default: None)
  --alembic_exe ALEMBIC_EXE
                        path to alembic executable (default: alembic)
  -n NAME, --name NAME  used to specify an individual database in a multidb setup (default: None)
  -c FILE, --config FILE
                        used to specify a config file for alembic (default: None)

baileythegreen commented 2 years ago

I have chased down the issue underlying something weird I was seeing, which somewhat derailed my writing tests for pyani versiondb.

Weird thing I noticed

If I took a copy of an old database (pre-fastANI changes to the ORM), my code was able to upgrade/downgrade this without issue; but if I took a new one (post-fastANI changes to the ORM), I get the following traceback:

Traceback (most recent call last):
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1802, in _execute_context
    self.dialect.do_execute(
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: error in table comparisons after drop column: no such column: kmersize
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/bin/alembic", line 8, in <module>
    sys.exit(main())
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/config.py", line 588, in main
    CommandLine(prog=prog).main(argv=argv)
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/config.py", line 582, in main
    self.run_cmd(cfg, options)
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/config.py", line 559, in run_cmd
    fn(
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/command.py", line 366, in downgrade
    script.run_env()
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/script/base.py", line 563, in run_env
    util.load_python_file(self.dir, "env.py")
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/util/pyfiles.py", line 92, in load_python_file
    module = load_module_py(module_id, path)
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/util/pyfiles.py", line 108, in load_module_py
    spec.loader.exec_module(module)  # type: ignore
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "alembic/env.py", line 104, in <module>
    run_migrations_online()
  File "alembic/env.py", line 79, in run_migrations_online
    context.run_migrations()
  File "<string>", line 8, in run_migrations
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/runtime/environment.py", line 851, in run_migrations
    self.get_context().run_migrations(**kw)
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/runtime/migration.py", line 620, in run_migrations
    step.migration_fn(**kw)
  File "/Users/baileythegreen/Software/pyani/alembic/versions/92f7f6b1626e_add_fastani_columns.py", line 28, in downgrade
    op.drop_column("comparisons", "kmersize")
  File "<string>", line 8, in drop_column
  File "<string>", line 3, in drop_column
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/operations/ops.py", line 2189, in drop_column
    return operations.invoke(op)
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/operations/base.py", line 392, in invoke
    return fn(self, operation)
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/operations/toimpl.py", line 89, in drop_column
    operations.impl.drop_column(
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/ddl/impl.py", line 333, in drop_column
    self._exec(base.DropColumn(table_name, column, schema=schema))
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/alembic/ddl/impl.py", line 197, in _exec
    return conn.execute(construct, multiparams)
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1289, in execute
    return meth(self, multiparams, params, _EMPTY_EXECUTION_OPTS)
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/sqlalchemy/sql/ddl.py", line 80, in _execute_on_connection
    return connection._execute_ddl(
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1381, in _execute_ddl
    ret = self._execute_context(
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1845, in _execute_context
    self._handle_dbapi_exception(
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2026, in _handle_dbapi_exception
    util.raise_(
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1802, in _execute_context
    self.dialect.do_execute(
  File "/Users/baileythegreen/Software/miniconda3/envs/pyani_dev/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) error in table comparisons after drop column: no such column: kmersize
[SQL: ALTER TABLE comparisons DROP COLUMN kmersize]
(Background on this error at: https://sqlalche.me/e/14/e3q8)

This puzzled me, because the column is clearly present in the database.

Upon further inspection, it seems to be a difference in the unique constraints specified in the updated ORM, versus those achieved by my migration script to update the old database.

Schema for `comparisons` table in old database

sqlite> .schema comparisons
CREATE TABLE comparisons (
    comparison_id INTEGER NOT NULL, 
    query_id INTEGER NOT NULL, 
    subject_id INTEGER NOT NULL, 
    aln_length INTEGER, 
    sim_errs INTEGER, 
    identity FLOAT, 
    cov_query FLOAT, 
    cov_subject FLOAT, 
    program VARCHAR, 
    version VARCHAR, 
    fragsize INTEGER, 
    maxmatch BOOLEAN, 
    PRIMARY KEY (comparison_id), 
    UNIQUE (query_id, subject_id, program, version, fragsize, maxmatch), 
    FOREIGN KEY(query_id) REFERENCES genomes (genome_id), 
    FOREIGN KEY(subject_id) REFERENCES genomes (genome_id)
);

Schema for `comparisons` table in old database following upgrade

sqlite> .schema comparisons
CREATE TABLE comparisons (
    comparison_id INTEGER NOT NULL, 
    query_id INTEGER NOT NULL, 
    subject_id INTEGER NOT NULL, 
    aln_length INTEGER, 
    sim_errs INTEGER, 
    identity FLOAT, 
    cov_query FLOAT, 
    cov_subject FLOAT, 
    program VARCHAR, 
    version VARCHAR, 
    fragsize INTEGER, 
    maxmatch BOOLEAN, kmersize INTEGER, minmatch FLOAT, nonsense INTEGER, 
    PRIMARY KEY (comparison_id), 
    UNIQUE (query_id, subject_id, program, version, fragsize, maxmatch), 
    FOREIGN KEY(query_id) REFERENCES genomes (genome_id), 
    FOREIGN KEY(subject_id) REFERENCES genomes (genome_id)
);

Schema for `comparisons` table in new database

CREATE TABLE comparisons (
    comparison_id INTEGER NOT NULL, 
    query_id INTEGER NOT NULL, 
    subject_id INTEGER NOT NULL, 
    aln_length INTEGER, 
    sim_errs INTEGER, 
    identity FLOAT, 
    cov_query FLOAT, 
    cov_subject FLOAT, 
    program VARCHAR, 
    version VARCHAR, 
    fragsize INTEGER, 
    maxmatch BOOLEAN, 
    kmersize INTEGER, 
    minmatch FLOAT, 
    PRIMARY KEY (comparison_id), 
    UNIQUE (query_id, subject_id, program, version, fragsize, maxmatch, kmersize, minmatch), 
    FOREIGN KEY(query_id) REFERENCES genomes (genome_id), 
    FOREIGN KEY(subject_id) REFERENCES genomes (genome_id)
);

baileythegreen commented 2 years ago

sigh

Apparently dropping unique constraints is one of the things SQLite doesn't like to do, so this migration just got a lot harder.

I have also been thinking about the utility of the downgrade option, and have been led to question it. If someone has a new database that contains any fastANI runs, removal of the kmersize and minmatch columns might mean some entries are non-unique, which is problematic. This could be handled by just not supporting downgrading; or we could throw a warning about this; or remove any fastANI entries (and notify the user).

We're creating a backup, so none of these would be catastrophic, but one may be preferable.

baileythegreen commented 2 years ago

The above issue (re: constraints) seems to be solved using the contextual version of the batch_alter_table() code mentioned earlier in this thread (https://github.com/widdowquinn/pyani/issues/378#issuecomment-1032081759). And the issue mentioned in that comment seems to have disappeared? I still don't know what that was about.

tl;dr; I have the database successfully upgrading and downgrading, new and old databases alike.

Now I just need to decide whether warnings are needed for downgrading, et cetera, and to finish my tests.

widdowquinn commented 2 years ago

I have also been thinking about the utility of the downgrade option, and have been led to question it. If someone has a new database that contains any fastANI runs, removal of the kmersize and minmatch columns might mean some entries are non-unique, which is problematic. This could be handled by just not supporting downgrading; or we could throw a warning about this; or remove any fastANI entries (and notify the user).

We're creating a backup, so none of these would be catastrophic, but one may be preferable.

My first thought is that we should aim to preserve the necessary metadata for a run, which includes - for fastANI - the kmersize/minmatch columns. If we cannot do so, we should loudly (with a warning) flag the problem to the user, and remove those runs from the database, with the argument that the backtransfer of fastANI runs is not compatible with the old schema.

I think we are justified as we backup, and do not destroy, the original, late-format database.

I would see downgrading the database from fastANI to not fastANI support as being an unusual operation, but that we might expect more up/downgrading with fitire use.

baileythegreen commented 2 years ago

I agree it's likely to be an unusual case, but if I'm supplying code that will perform such a downgrade (and I currently am), then I need to think about such consequences for the user.

Will write code to remove the rows, and do so loudly, with specific messages pointing to the backup.

baileythegreen commented 2 years ago

Ran into a new issue when working on the --dry-run tests:

It seems the modification of constraints requires reflection, which is fine when there is a connection to the database, but requires some extra steps to do offline. The error I got related to this (which is sufficiently arcane) was:

AttributeError: 'MockConnection' object has no attribute 'close'

The useful solution I have found is from the docs here (https://hellowac.github.io/alembic_doc/en/batch.html#working-in-offline-mode).

meta = MetaData()
some_table = Table(
    'some_table', meta,
    Column('id', Integer, primary_key=True),
    Column('bar', String(50))
)

with op.batch_alter_table("some_table", copy_from=some_table) as batch_op:
    batch_op.add_column(Column('foo', Integer))
    batch_op.drop_column('bar')

This basically requires the table schema to be included in the migration file (or somewhere, anyway). A bit annoying, but easy enough to do, once you figure out you need to do it.

And I got a second error when dealing with the first:

sqlalchemy.exc.CircularDependencyError: Circular dependency detected. ('kmersize', 'minmatch')

The deal with this one seems to be that I hadn't yet created two dummy tables in the migration script—one that contained the columns being added/updated constraints, and one that modeled the base settings—for the different manipulations to be done on respectively.

baileythegreen commented 2 years ago

Saw the weird error mentioned in https://github.com/widdowquinn/pyani/issues/378#issuecomment-1032081759 again. This time I seem to have managed to resolve it by swapping the order of some lines, e.g.:

        batch_op.drop_constraint("base_reqs")
        batch_op.add_column(sa.Column("kmersize", sa.Integer, default=None))
        batch_op.add_column(sa.Column("minmatch", sa.Float, default=None))
        batch_op.create_unique_constraint(
            "fastani_reqs",
            [
                "query_id",
                "subject_id",
                "program",
                "version",
                "fragsize",
                "maxmatch",
                "kmersize",
                "minmatch",
            ],
        )

instead of:

        batch_op.add_column(sa.Column("kmersize", sa.Integer, default=None))
        batch_op.add_column(sa.Column("minmatch", sa.Float, default=None))
        batch_op.drop_constraint("base_reqs")
        batch_op.create_unique_constraint(
            "fastani_reqs",
            [
                "query_id",
                "subject_id",
                "program",
                "version",
                "fragsize",
                "maxmatch",
                "kmersize",
                "minmatch",
            ],
        )

I still do not know why this makes sense.

baileythegreen commented 2 years ago

Tests are passing locally for me—exclusive of dry-run stuff, which I have decided is not an immediate priority (because I expect that will take some time yet). However, tests fail in CircleCI because I have not managed to get a working copy of sqlite3 installed there. I am rather puzzled by this, because the file is there, it seems to have all of the right permissions, yet it can't be called. I was trying to fix this in circleci on branch circleci-editor/74/alembic_378.

Context

I completely scrapped the tests from before, rewriting them based on the meeting last week. They now work by first creating a dump from an input database, using that dump to initialise a new database, which is then upgraded or downgraded, and then comparing a dump of the up/downgraded database against a dump of a target. This is done with subprocess calls because I have yet to find something that looks like the right way to do that in sqlalchemy, and it really seems like that's not what the package is intended for. But this is why sqlite3 needs to be installed.

widdowquinn commented 2 years ago

When I've had similar CircleCI issues in the past, I've had to pay particular attention to $PATH on the CircleCI VM.

It can be useful for diagnosis to insert a step in the CircleCI config that checks for the presence of the executable you want in the $PATH and writes it to an artifact file (which you can collect after the run). See e.g. https://circleci.com/docs/2.0/artifacts/ if you've not tried this before.

On the whole your new approach to the tests sounds good.

baileythegreen commented 2 years ago

My less sophisticated approach thus far has been to put extra shell commands in the Makefile action:

wget https://www.sqlite.org/2022/sqlite-tools-linux-x86-3380100.zip
unzip sqlite-tools-linux-x86-3380100.zip
ls -l sqlite-tools-linux-x86-3380100/sqlite3
echo $PWD
echo 'export PATH=$PWD/sqlite-tools-linux-x86-3380100/sqlite3:$PATH' >> $BASH_ENV
source $BASH_ENV
echo $PATH
/home/circleci/repo/sqlite-tools-linux-x86-3380100/sqlite3 -help
sqlite3 -help

which gives:

--2022-03-14 22:15:44--  https://www.sqlite.org/2022/sqlite-tools-linux-x86-3380100.zip
Resolving www.sqlite.org (www.sqlite.org)... 45.33.6.223, 2600:3c00::f03c:91ff:fe96:b959
Connecting to www.sqlite.org (www.sqlite.org)|45.33.6.223|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2232504 (2.1M) [application/zip]
Saving to: ‘sqlite-tools-linux-x86-3380100.zip’

sqlite-tools-linux- 100%[===================>]   2.13M  9.51MB/s    in 0.2s    

2022-03-14 22:15:44 (9.51 MB/s) - ‘sqlite-tools-linux-x86-3380100.zip’ saved [2232504/2232504]

Archive:  sqlite-tools-linux-x86-3380100.zip
   creating: sqlite-tools-linux-x86-3380100/
  inflating: sqlite-tools-linux-x86-3380100/sqlite3  
  inflating: sqlite-tools-linux-x86-3380100/sqlite3_analyzer  
  inflating: sqlite-tools-linux-x86-3380100/sqldiff  

-rwxrwxr-x 1 circleci circleci 1202884 Mar 12 14:05 sqlite-tools-linux-x86-3380100/sqlite3
/home/circleci/repo
/home/circleci/repo/sqlite-tools-linux-x86-3380100/sqlite3:/home/circleci/repo:/home/circleci/repo/blast-2.2.26/bin:/home/circleci/repo:/home/circleci/repo/blast-2.2.26/bin:/home/circleci/.local/bin:/home/circleci/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
/bin/bash: line 7: /home/circleci/repo/sqlite-tools-linux-x86-3380100/sqlite3: No such file or directory

Exited with code exit status 127
CircleCI received exit code 127

The failure has occurred whenever I try to invoke sqlite3, whether using the full path, or not. I have played with what I add to $PATH, and even downloaded and unzipped the Linux compiled version locally, so I could make sure I understood the structure. I have a feeling the issue is less existence of the file, and more to do with the permissions, except I can't see any issue with those.

I don't know if any of this obviates your suggestion with the artefacts, but I'll start looking into that, as I haven't used them before.

widdowquinn commented 2 years ago

My less sophisticated approach thus far has been to put extra shell commands in the Makefile action:

If you're modifying the Makefile in the repo, then I think that's not a good idea.

Changes specific to CircleCI integration should be made in the .circleci/config.yaml file. This keeps them within CircleCI scope - especially important if the issue is CircleCI specific. It's easy to forget/fail to clean up correctly after changing more global files to fix a local problem.

Examples of how to handle a specific installation for CircleCI can be seen in that config.yaml file, e.g.

      - run:
          name: install legacy BLAST
          command: |
            curl -o legacy_blast.tar.gz ftp://ftp.ncbi.nlm.nih.gov/blast/executables/legacy.NOTSUPPORTED/2.2.26/blast-2.2.26-x64-linux.tar.gz
            tar -zxf legacy_blast.tar.gz
            echo 'export PATH=$PWD/blast-2.2.26/bin:$PATH' >> $BASH_ENV
            source $BASH_ENV

Note that the name field is descriptive, and we can install binaries prior to the pip setup calls.

In the same section you should be able to echo environment variables/redirect STDOUT/STDERR as appropriate to named files, which will show up in the CircleCI artifacts for the build.

This kind of workflow keeps everything involved in debugging/fixing the CircleCI runs neatly in one file, and easily removable when it is no longer required.

Since pyani v0.3 should not pass any tests if sqlite3 is not present, a first step might be to check if the executable is in the $PATH with which (I assume you've done this) and, if not, try a find to see if the executable can be located. If it can, then modifying the $PATH might be a way forward.

If you need to install the executable, it may be simpler to use the built-in package management, which should place the executable somewhere visible, e.g.:

      - run:
          name: install third-party tools
          command: |
            sudo apt-get install csh mummer ncbi-blast+ sqlite3

baileythegreen commented 2 years ago

My less sophisticated approach thus far has been to put extra shell commands in the Makefile action:

If you're modifying the Makefile in the repo, then I think that's not a good idea.

I described what I did poorly/inaccurately. I have been editing the .circleci/config.yml file. What I posted is what I added there as a new workflow step.

widdowquinn commented 2 years ago

Have you tried the apt-get installation route? It's usually the least painful.

baileythegreen commented 2 years ago

No, I don't tend to think about that because I've never really been able to use it. You might be right about that being the way forward, though.

I did look for a conda option, but couldn't figure out the right way to specify the necessary channel.

Would it be apt-get or apt install? Or apt-get install? I am unsure of the differences between them.

widdowquinn commented 2 years ago

Try appending sqlite3 to the existing third party tool installation, as noted above.

widdowquinn / pyani

Use `alembic` to manage SQLite schema changes? #378

Summary:

Description:

The issue described here has been resolved (and I felt really silly once it was). Skip to here if you want to bypass irrelevant questions/discussion.

CLI help information for pyani versiondb

Weird thing I noticed

Schema for comparisons table in old database

Schema for comparisons table in old database following upgrade

Schema for comparisons table in new database

Context

CLI help information for `pyani versiondb`

Schema for `comparisons` table in old database

Schema for `comparisons` table in old database following upgrade

Schema for `comparisons` table in new database