wellcomecollection / platform

Wellcome Collection Digital Platform
https://developers.wellcomecollection.org/
MIT License
48 stars 10 forks source link

Investigate vector indexing in catalogue images index #5822

Closed StepanBrychta closed 1 week ago

StepanBrychta commented 2 weeks ago

We should investigate whether we need to keep indexing the features1 and features2 vectors in the catalogue pipeline images_indexed index. The relevant portion of the mapping file currently looks like this (both locally, and in the production ES cluster):

    "vectorValues": {
      "properties": {
        "features1": {
          "type": "dense_vector",
          "dims": 2048,
          "index": true,
          "similarity": "cosine"
        },
        "features2": {
          "type": "dense_vector",
          "dims": 2048,
          "index": true,
          "similarity": "cosine"
        }
      }
    }

However, the same portion looked different in the original mapping file we defined (note the absence of the "index": true property):

    "vectorValues": {
      "properties": {
        "features1": {
          "type": "dense_vector",
          "dims": 2048
        },
        "features2": {
          "type": "dense_vector",
          "dims": 2048
        }
      }
    }

The current version of Elasticsearch adds "index": true by default, but this might not have been the case in older versions. There is a suspicion that we specified these mappings while working with an ES version which did not index them by default, and then updated to a version which started indexing them by default.

This suspicion arose from the fact that until recently, we were running automated tests with a version of Elasticsearch (8.5) which did not support indexing such large vectors in the first place.

The question we should answer is: "Is there a good reason for indexing these two vectors?" If the answer is no, we should stop indexing them because doing so is expensive.

agnesgaroux commented 2 weeks ago

I think you're onto something here the API only uses the reducedFeatures src/main/scala/weco/api/search/elasticsearch/ImageSimilarity.scala

StepanBrychta commented 1 week ago

See https://github.com/wellcomecollection/platform/issues/5823