zulip / docker-zulip

Container configurations, images, and examples for Zulip.
https://zulip.com/
Apache License 2.0
579 stars 241 forks source link

Install flow requires manual user creation #227

Open t3hmrman opened 4 years ago

t3hmrman commented 4 years ago

I'm not sure if this belongs better on the zulip repository, but I just installed 2.1.1-0 and went through the installation process and found a rough patch: creating a realm did not successfully create the associated user account.

A UserProfile does not exist error occurred (500), which has happened in the past, but unlike the past I was unable to log in afterwards. I had to exec into the container and run manage.py create_user which gave the same UserProfile error, but I could log in after.

Zulip would be a lot easier to deploy and setup if:

timabbott commented 4 years ago

Can you post the exceptions from this?

The way this is supposed to work is:

  1. On zulip container first startup, it runs a few commands to initialize the database, including creating a few "system users" like "Notification Bot" that are used to send system messages.
  2. manage.py generate_realm_creation_link gets run at the end of that process.
  3. You use a realm creation link to create your first organization (including an administrator user for that realm) through a web flow.

My guess as to what might have caused what you describe is if you destroyed the storage of the postgres container (or replaced the postgres container) sometime between step 1 and step 2, you'd have an improperly initialized database without those system users, and the code that expects them to exist will 500. This sort of failure mode is very unlikely with the traditional installer, but Docker makes it easy to do this sort of thing.

I think we need to write some code to provide a better error message in the situation that Zulip's database has not been initialized properly (probably reproducible without Docker by rerunning postgres-init-db after the installer runs) so that users can figure out what the problem is and resolve it themselves.

The other thing we could do is change the startup logic for the zulip container to use a more authoritative test for whether the database is already initialized; I think it uses a stamp file to avoid rerunning initialization, whereas it probably could be checking e.g. Realm.objects.exists().

@andersk thoughts?

t3hmrman commented 4 years ago

It looks like what you described is certainly what's happening, here's the exception that I got:

Logger django.request, from module django.core.handlers.exception line 135:
Error generated by Anonymous user (not logged in) on subdomain.domain.tld deployment

Traceback (most recent call last):
  File "/srv/zulip-venv-cache/546c5f60162d0b5be09d52adebaa995f9c609da3/zulip-py3-venv/lib/python3.6/site-packages/django/core/handlers/exception.py", line 41, in inner
    response = get_response(request)
  File "/srv/zulip-venv-cache/546c5f60162d0b5be09d52adebaa995f9c609da3/zulip-py3-venv/lib/python3.6/site-packages/django/core/handlers/base.py", line 187, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/srv/zulip-venv-cache/546c5f60162d0b5be09d52adebaa995f9c609da3/zulip-py3-venv/lib/python3.6/site-packages/django/core/handlers/base.py", line 185, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "./zerver/decorator.py", line 133, in wrapper
    return func(request, *args, **kwargs)
  File "./zerver/views/registration.py", line 232, in accounts_register
    realm = do_create_realm(string_id, realm_name)
  File "./zerver/lib/actions.py", line 3779, in do_create_realm
    admin_realm = get_system_bot(settings.NOTIFICATION_BOT).realm
  File "./zerver/lib/cache.py", line 168, in func_with_caching
    val = func(*args, **kwargs)
  File "./zerver/models.py", line 2130, in get_system_bot
    return UserProfile.objects.select_related().get(email__iexact=email.strip())
  File "/srv/zulip-venv-cache/546c5f60162d0b5be09d52adebaa995f9c609da3/zulip-py3-venv/lib/python3.6/site-packages/django/db/models/query.py", line 380, in get
    self.model._meta.object_name
zerver.models.DoesNotExist: UserProfile matching query does not exist.

Deployed code:
- ZULIP_VERSION: 2.1.1
- version: docker

Request info:
- path: /accounts/register/
- POST: {'csrfmiddlewaretoken': ['**********'], 'realm_name': ['Realm Name'], 'realm_in_root_domain': ['on'], 'realm_subdomain': [''], 'key': ['**********'], 'timezone': ['Asia/Tokyo'], 'full_name': ['User'], 'password': ['**********'], 'next': ['']}
- REMOTE_ADDR: "['xxx.xxx.xxx.xxx']"
- QUERY_STRING: "['']"
- SERVER_NAME: "['']"

it is looking for the NOTIFICATION_BOT user.

It's certainly possible that the DB container went down during steps 1 & 2 -- is there any way to check this initialization or run it manually?

timabbott commented 4 years ago

I think you can just delete .initiated in the zulip container's data directory.

Read zulipFirstStartInit in entrypoint.sh for details.

But I think the correct fix would be to replace the check in zulipFirstStartInit with if $(su zulip -c "/home/zulip/deployments/current/manage.py list_realms") | grep -q zulipinternal so that we're not trying to hand-synchronize state between the zulip container's filesystem and the database; can you try testing that solution?

t3hmrman commented 4 years ago

I definitely can -- I'll try both and report back here -- I'll re-make the deployment from scratch as well.

I personally think this functionality should be inside zulip itself -- is that a likely possibility? I can understand if this is expected to stay in the domain of zulip-related setup rather than zulip itself but it seems like this particular issue could go either way.

[EDIT] - Still haven't run the test, but another note on this -- could zulip store all it's sort of startup status data in the DB? That would go a long way to making it easier to deploy Zulip to various environments and reduce the number of things to keep track of.

timabbott commented 4 years ago

could zulip store all it's sort of startup status data in the DB?

That's what the change I proposed above is intended to fix -- use some existing state in the database rather than the .initiated stamp file to determine if the database is already initialized.

On the front of "inside Zulip", I guess I'm not sure what you had in mind. For first-user creation, one has to think about security -- if anyone who visits the URL of an unconfigured Zulip server can create an initial administrator user, that could be problematic if an administrator sets up a server on the public Internet and then takes a while to set it up.

That's why we have this "realm-creation link" flow where one generates a secure link, and then opens that in their browser to beign a secure flow.

t3hmrman commented 4 years ago

That's what the change I proposed above is intended to fix -- use some existing state in the database rather than the .initiated stamp file to determine if the database is already initialized.

Ahh but I mean moving the check inside the zulip codebase @ startup time instead. I was imagining moving the logic in zulipFirstStartInit into the django code itself

On the front of "inside Zulip", I guess I'm not sure what you had in mind. For first-user creation, one has to think about security -- if anyone who visits the URL of an unconfigured Zulip server can create an initial administrator user, that could be problematic if an administrator sets up a server on the public Internet and then takes a while to set it up.

I agree -- what I meant to suggest was making the pseudo random id that gets put on the realm creation link suppliable through enf -- so if I supply REALM_CREATION_SLUG=aSvZlFxKzP then having the realm creation link be generated automatically to have that slug. Assuming realm creation is only accessible once from a given slug, and admins generate this value randomly, and the option isn't turned on by default, it would be pretty useful for starting zulip instances.

If even that is too much risk (IMO the risk is identical to someone guessing your realm creation slug and getting to the URL before you do, there are a few other options):

timabbott commented 4 years ago

Ahh but I mean moving the check inside the zulip codebase @ startup time instead. I was imagining moving the logic in zulipFirstStartInit into the django code itself

I don't think that makes sense, because this code is normally run by the Zulip installer script in the non-Docker setting. And we don't want to run it on Django process startup, because there are more than one of those (so we'd have to worry about races) and it'd be needlessly expensive.

This belongs in the Docker setup logic, because only in the Docker case are we having a system initialize its database on boot.

t3hmrman commented 4 years ago

Ahhh I see -- thanks for the explanation, so it looks like moving it into zulip itself won't happen, so the fix would definitely be in this repo.

Will try and get to trying the newer solution today

timabbott commented 4 years ago

Following https://github.com/zulip/zulip/pull/13888, with Zulip master it should be possible to just delete the block of code in entrypoint.sh around /.initiated. Probably worth backporting the relevant patches into the Zulip 2.1.3 branch. @mateuszmandera can you put together a backported branch that I can pull into the 2.1.x branch series for this purpose? I'm not sure exactly how many commits on that path are since 2.1.x.