wmde / wikibase-release-pipeline

BSD 3-Clause "New" or "Revised" License
45 stars 33 forks source link

Lexemes not appearing in sparql queries despite clearly existing in Wikibase #774

Closed megamattc closed 1 month ago

megamattc commented 2 months ago

Hello,

I have setup a WBS instance (current version) on a VM running ubuntu 24.04, with the WikibaseLexeme extension installed. The WikibaseLexeme extension works to the extent I can create Lexemes manually, and see that they appear both in the main search box and pages like 'Recent Changes'. In addition, my sparql query service is able to find regular P- and Q-items I have created. However, the sparql query does not find any lexemes under, for instance, the following standard query:

SELECT DISTINCT
  ?lexeme ?lemma ?lexcat
WHERE 
{
  ?lexeme a ontolex:LexicalEntry .
  ?lexeme wikibase:lemma ?lemma .
  ?lexeme wikibase:lexicalCategory ?lexcat .
}

The query executes but does not return any of the lexemes I created.

I do not understand why this is. I have checked the container logs and do not find any errors. This, even as I do not understand Wikibase well enough to understand where the problem must lie according to those logs. For explanation purposes I will specify modifications I made to the docker-compose.yml file and other notable steps I took when building the wikibase.

The docker-compose.yml file is below. In particular,

In wikibase I mount a copy of WikibaseLexeme:

volumes:
      - ./config:/config
      - wikibase-image-data:/var/www/html/images
      - quickstatements-data:/quickstatements/data
      - ./WikibaseLexeme:/var/www/html/extensions/WikibaseLexeme

In wdqs-updater I define three environment variables:

environment:
      - WIKIBASE_MAX_DAYS_BACK=${WIKIBASE_MAX_DAYS_BACK}
      - WIKIBASE_HOST=${WIKIBASE_PUBLIC_HOST}
      - WIKIBASE_SCHEME=https

(I forgot the reason I did this, but it was not for Lexemes)

I first launched the Docker containers so that the default ./config/LocalSettings.php would be generated. Then I edited ./config/LocalSettings.php so that it contained the required lines:

wfLoadExtension('WikibaseLexeme');
define('Lexeme', 146);
define('Lexeme_talk', 147);

I then did docker compose down and docker compose up --wait to reinitialize the containers with the modified ./config/LocalSettings.php file.

Finally, because initially the sparql query service does not update as it should (see https://www.mediawiki.org/wiki/Wikibase/FAQ/en#Why_doesn't_the_query_service_update?), I resort to the recommendation in the above link so as to reset the update conditions:

sudo docker compose stop wdqs-updater
sudo docker compose run --rm wdqs-updater bash
./runUpdate.sh -h http://"$WDQS_HOST":"$WDQS_PORT" -- --wikibaseUrl "$WIKIBASE_SCHEME"://"$WIKIBASE_HOST" --conceptUri "$WIKIBASE_SCHEME"://"$WIKIBASE_HOST" --entityNamespaces "$WDQS_ENTITY_NAMESPACES" --init --start 20240928120000

CTRL+C

sudo docker compose start wdqs-updater

Then, as I said, the wikibase appears to work normally except that sparql queries do not detect Lexemes, even under the most general of queries (e.g. SELECT * WHERE {?x ?y ?z .}).

Does anyone know what the problem is?

For reference, the full docker-compose.yml file is below:

name: wbs-deploy

services:
  # --------------------------------------------------
  # A. CORE WIKIBASE SUITE SERVICES
  # --------------------------------------------------

  wikibase:
    image: wikibase/wikibase:3
    depends_on:
      mysql:
        condition: service_healthy
    restart: unless-stopped
    ports:
      - 8880:80
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.wikibase.rule=Host(`${WIKIBASE_PUBLIC_HOST}`)"
      - "traefik.http.routers.wikibase.entrypoints=websecure"
      - "traefik.http.routers.wikibase.tls.certresolver=letsencrypt"
    volumes:
      - ./config:/config
      - wikibase-image-data:/var/www/html/images
      - quickstatements-data:/quickstatements/data
      - ./WikibaseLexeme:/var/www/html/extensions/WikibaseLexeme
    environment:
      MW_ADMIN_NAME: ${MW_ADMIN_NAME}
      MW_ADMIN_PASS: ${MW_ADMIN_PASS}
      MW_ADMIN_EMAIL: ${MW_ADMIN_EMAIL}
      MW_WG_SERVER: https://${WIKIBASE_PUBLIC_HOST}
      DB_SERVER: mysql:3306
      DB_USER: ${DB_USER}
      DB_PASS: ${DB_PASS}
      DB_NAME: ${DB_NAME}
      ELASTICSEARCH_HOST: elasticsearch
      QUICKSTATEMENTS_PUBLIC_URL: https://${QUICKSTATEMENTS_PUBLIC_HOST}
    healthcheck:
      test: curl --silent --fail localhost/wiki/Main_Page
      interval: 500s
      start_period: 5m

  wikibase-jobrunner:
    image: wikibase/wikibase:3
    command: /jobrunner-entrypoint.sh
    depends_on:
      wikibase:
        condition: service_healthy
    restart: unless-stopped
    volumes_from:
      - wikibase

  mysql:
    image: mariadb:10.11
    restart: unless-stopped
    volumes:
      - mysql-data:/var/lib/mysql
    environment:
      MYSQL_DATABASE: ${DB_NAME}
      MYSQL_USER: ${DB_USER}
      MYSQL_PASSWORD: ${DB_PASS}
      MYSQL_RANDOM_ROOT_PASSWORD: yes
    healthcheck:
      test: healthcheck.sh --connect --innodb_initialized
      start_period: 1m
      interval: 20s
      timeout: 5s

  # --------------------------------------------------
  # B. EXTRA WIKIBASE SUITE SERVICES
  # --------------------------------------------------

  # To disable Elasticsearch and use default MediaWiki search functionality remove
  # the elasticsearch service, and the MW_ELASTIC_* vars from wikibase_variables
  # at the top of this file.
  elasticsearch:
    image: wikibase/elasticsearch:1
    restart: unless-stopped
    volumes:
      - elasticsearch-data:/usr/share/elasticsearch/data
    environment:
      discovery.type: single-node
      ES_JAVA_OPTS: -Xms512m -Xmx512m -Dlog4j2.formatMsgNoLookups=true
    healthcheck:
      test: curl --silent --fail localhost:9200
      interval: 10s
      start_period: 2m

  wdqs:
    image: wikibase/wdqs:1
    command: /runBlazegraph.sh
    depends_on:
      wikibase:
        condition: service_healthy
    restart: unless-stopped
    # Set number of files ulimit high enough, otherwise blazegraph will abort with:
    # library initialization failed - unable to allocate file descriptor table - out of memory
    # Appeared on Docker 24.0.5, containerd 1.7.9, Linux 6.6.6, NixOS 23.11
    ulimits:
      nofile:
        soft: 32768
        hard: 32768
    volumes:
      - wdqs-data:/wdqs/data
    healthcheck:
      test: curl --silent --fail localhost:9999/bigdata/namespace/wdq/sparql
      interval: 10s
      start_period: 2m

  wdqs-updater:
    image: wikibase/wdqs:1
    command: /runUpdate.sh
    depends_on:
      wdqs:
        condition: service_healthy
    restart: unless-stopped
    # Set number of files ulimit high enough, otherwise blazegraph will abort with:
    # library initialization failed - unable to allocate file descriptor table - out of memory
    # Appeared on Docker 24.0.5, containerd 1.7.9, Linux 6.6.6, NixOS 23.11
    ulimits:
      nofile:
        soft: 32768
        hard: 32768
    environment:
      - WIKIBASE_MAX_DAYS_BACK=${WIKIBASE_MAX_DAYS_BACK}
      - WIKIBASE_HOST=${WIKIBASE_PUBLIC_HOST}
      - WIKIBASE_SCHEME=https

  wdqs-proxy:
    image: wikibase/wdqs-proxy:1
    depends_on:
      wdqs:
        condition: service_healthy
    restart: unless-stopped

  wdqs-frontend:
    image: wikibase/wdqs-frontend:1
    depends_on:
      - wdqs-proxy
    restart: unless-stopped
    ports:
      - 8834:80
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.wdqs-frontend.rule=Host(`${WDQS_FRONTEND_PUBLIC_HOST}`)"
      - "traefik.http.routers.wdqs-frontend.entrypoints=websecure"
      - "traefik.http.routers.wdqs-frontend.tls.certresolver=letsencrypt"
    environment:
      WDQS_HOST: wdqs-proxy
    healthcheck:
      test: curl --silent --fail localhost
      interval: 10s
      start_period: 2m

  quickstatements:
    image: wikibase/quickstatements:1
    depends_on:
      wikibase:
        condition: service_healthy
    restart: unless-stopped
    ports:
      - 8840:80
    volumes:
      - quickstatements-data:/quickstatements/data
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.quickstatements.rule=Host(`${QUICKSTATEMENTS_PUBLIC_HOST}`)"
      - "traefik.http.routers.quickstatements.entrypoints=websecure"
      - "traefik.http.routers.quickstatements.tls.certresolver=letsencrypt"
    environment:
      QUICKSTATEMENTS_PUBLIC_URL: https://${QUICKSTATEMENTS_PUBLIC_HOST}
      WIKIBASE_PUBLIC_URL: https://${WIKIBASE_PUBLIC_HOST}
    healthcheck:
      test: curl --silent --fail localhost
      interval: 10s
      start_period: 2m

  # --------------------------------------------------
  # C. REVERSE PROXY AND SSL SERVICES
  # --------------------------------------------------

  traefik:
    image: traefik:3.1
    command:
      # Basic setup
      - "--providers.docker=true"
      - "--providers.docker.exposedbydefault=false"
      - "--entrypoints.web.address=:80"
      - "--entrypoints.websecure.address=:443"
      # Redirects all http request to https
      - "--entrypoints.web.http.redirections.entryPoint.to=websecure"
      - "--entrypoints.web.http.redirections.entryPoint.scheme=https"
      - "--entrypoints.web.http.redirections.entrypoint.permanent=true"
      # ACME SSL certificate generation
      - "--certificatesresolvers.letsencrypt.acme.httpchallenge=true"
      - "--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web"
      - "--certificatesresolvers.letsencrypt.acme.email=${MW_ADMIN_EMAIL}"
      - "--certificatesresolvers.letsencrypt.acme.storage=/letsencrypt/acme.json"
      # Uncomment this line to only test ssl generation first, makes sure you don't run into letsencrypt rate limits
      #- "--certificatesresolvers.letsencrypt.acme.caserver=https://acme-staging-v02.api.letsencrypt.org/directory"
    ports:
      - 80:80
      - 443:443
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - traefik-letsencrypt-data:/letsencrypt

volumes:
  # A. CORE WIKIBASE SUITE SERVICES DATA
  wikibase-image-data:
  mysql-data:
  # B. EXTRA WIKIBASE SUITE SERVICES DATA
  wdqs-data:
  elasticsearch-data:
  quickstatements-data:
  # C. REVERSE PROXY AND SSL SERVICES DATA
  traefik-letsencrypt-data:
rti commented 1 month ago

Hi @megamattc,

Thanks for sharing that.

My first guess would be, that you need to configure WDQS_ENTITY_NAMESPACES on WDQS, so that your lexeme namespace gets synced as well.

So your wdqs-updater service should look like this:

  wdqs-updater:
    image: wikibase/wdqs:1
    command: /runUpdate.sh
    depends_on:
      wdqs:
        condition: service_healthy
    restart: unless-stopped
    # Set number of files ulimit high enough, otherwise blazegraph will abort with:
    # library initialization failed - unable to allocate file descriptor table - out of memory
    # Appeared on Docker 24.0.5, containerd 1.7.9, Linux 6.6.6, NixOS 23.11
    ulimits:
      nofile:
        soft: 32768
        hard: 32768
    environment:
      - WDQS_ENTITY_NAMESPACES=120,122,146

Does this make any difference?

Best, Robert

megamattc commented 1 month ago

Whohoo!!

Yes! This long-standing irritation has been solved. Thank you! Perhaps I should have looked at that particular Readme.md that you link to more carefully.

On the other hand, I wish such information was specified with the basic installation instructions for the WikibaseLexeme extension, at the MediaWiki site. It mentions elliptically to 'define the namespace' for the Lexemes, but I recall when I searched for what this meant I only found references to modifying LocalSettings.php with the lines I mentioned above, i.e.

wfLoadExtension('WikibaseLexeme');
define('Lexeme', 146);
define('Lexeme_talk', 147);

However, the modification of the docker-compose.yml file is also necessary.