zentity-io / zentity

Entity resolution for Elasticsearch.
https://zentity.io
Apache License 2.0
157 stars 28 forks source link

Graph Resolution - zentity 2.0 #123

Open davemoore- opened 5 months ago

davemoore- commented 5 months ago

Graph Resolution - zentity 2.0

One of the most popular feature requests I've heard from the community has been to support the resolution of multiple entities and their relations. This issue documents my thoughts on how to implement that in zentity. I call this feature graph resolution because it introduces concepts of graph theory and entails resolving entities and relations in a graph. The feature would be significant enough to warrant the promotion of zentity to version 2.0. The actual implementation may differ from my initial outline below.

Foundational requirements

These are the minimum required capabilities of graph resolution in zentity:

  1. Generating IDs for entities and relations - zentity must generate unique identifiers (_zid) for entities and relations.
  2. Modeling relations - zentity must provide a way for users to model relations between entities.
  3. Resolving relations - zentity must be able to apply relationship models to track relations between entities, and return those in the response of a resolution job.
  4. Resolving multiple entities in one request - zentity must be able to return multiple entities in the response of a single resolution job.
  5. Extracting entities from documents - zentity must be able to extract multiple entities from a given document.
  6. Performing transitive closure - zentity must track the associations between _doc and _zid. Whenever a _doc appears for multiple _zid, zentity must merge the entities of those _zid and their relations.

Optimizations

These are optimizations that can be moved to a subsequent minor version release if needed:

  1. Scoping graph resolution - zentity should be able to scope resolution jobs by entity type and relation type, in addition to the current accepted scope of attributes, resolvers, and indices.
  2. Limiting graph traversal - zentity will need a parameter in the resolution job to limit its searches on linked entities by some number of degrees of separation from the entities in the request.

1. Generating IDs for entities and relations

zentity must generate unique identifiers for entities and relations. I will call this identifier a _zid (short for "zentity ID"). The _zid should be a composite value of existing data that together would uniquely identify an entity or relation.

1.1 Entity _zid

Proposed syntax of _zid for entities:

ENTITY_TYPE|ENTITY_INSTANCE|INDEX_NAME|base64(DOC_ID)

Defined as the following:

Example (using the cross-cluster search syntax for the index name to show why a colon : shouldn't be used as a delimiter for the _zid):

person|0|us:my_index|Mg==

Benefits of the proposed syntax of _zid for entities:

*Note - The_zid will NOT always be guaranteed to be the same across multiple responses from the Resolution API. They are ephemeral, and should be used only to uniquely identify the entities and relations of a single resolution request. Persisting these would be in scope of a future enhancement to persist and manage the outputs of entity resolution.

1.2 Relation _zid

Proposed syntax of _zid for relations:

RELATION_TYPE#RELATION_DIRECTION#_ZID_A#_ZID_B

Defined as the following:

Examples:

Benefits of the proposed syntax of _zid for relations:

2. Modeling relations

zentity must provide a way for users to model the relations between entities as they appear in documents. These relations could be either typed or untyped, and either directional, bidirectional, or undirected. A default relation could be untyped and undirected, representing the co-occurrence of two entities in a document.

Index name for relation models:

.zentity-models-relations

Relation model:

{
  "index": INDEX_NAME,
  "type": RELATION_TYPE,
  "direction": RELATION_DIRECTION,
  "a": ENTITY_TYPE,
  "b": ENTITY_TYPE
}

3. Resolving multiple entities in one request

Currently, zentity performs entity resolution for a single entity. The request accepts inputs for a single entity, and the reponse provides data for a single entity.

Graph resolution MUST have the response provide data for one or many entities, and SHOULD have the request allow inputs for multiple entities.

3.1 Resolution API Request

Expected changes that preserve backwards compatibility:

Expected breaking changes:

Current syntax for requests:

POST _zentity/resolution/ENTITY_TYPE
{
  "attributes": { ... },
  "terms": [ ... ],
  "ids": { ... },
  "scope": { ... }
}

Current alternative syntax for requests using an embedded an entity model:

POST _zentity/resolution
{
  "attributes": { ... },
  "terms": [ ... ],
  "ids": { ... },
  "scope": { ... },
  "model": { ... }
}

Proposed syntax for requests using the "entities" syntax, which supports separately resolving one or many entities in a single resolution job:

POST _zentity/resolution
{
  "entities": [
    {
      "type": ...,
      "attributes": { ... },
      "terms": [ ... ],
      "ids": { ... }
    }
  ],
  "scope": { ... }
}

Propose alternative syntax for requests using embedded entity models:

POST _zentity/resolution
{
  "entities": [
    {
      "attributes": { ... },
      "terms": [ ... ],
      "ids": { ... },
      "model": { ... }
    }
  ],
  "scope": { ... }
}

When using the "entities" syntax, the values of "scope.*.attributes" and "scope.*.resolvers" must be prefixed with ENTITY_TYPE: to

3.2 Resolution API Response

Proposed syntax for responses:

POST _zentity/resolution
{
  "took" : INTEGER,
  "entities": [
    {
      "_zid": _ZID,
      "_type": ENTITY_TYPE,
      "_hits": [ ... ]
    },
    ...
  ],
  "relations": [
    {
      "_zid": _ZID,
      "_type": RELATION_TYPE,
      "_direction": RELATION_DIRECTION,
      "_a": _ZID,
      "_b": _ZID,
      "_hits": [
        {
          "_index": INDEX_NAME,
          "_id": DOC_ID
        },
        ...
      ]
    },
    ...
  ]
}

The response is a node-link graph structure, where the nodes are listed in the "entities" field and the links are listed in the "relations" field:

4. Extracting entities from documents

Currently, zentity assumes that everything in a resolution job belongs to a single entity: the attributes for every query submitted to Elasticsearch, and the attributes from every document received from Elasticsearch.

zentity must be able to find all possible entities in the scope of the resolution job. The way it can do this is to check if the document contains non-empty values for every attribute of any resolver for every entity type.

Proposed implementation:

for each doc returned by a query:
    for each entity type in the scope of the job:
        for each resolver in the model of that entity type:
            if the doc contains non-empty values for each attribute in that resolver:
                consider the doc as a hit for that entity type, and use it input for subsequent queries

5. Resolving relations

Relations will be defined by the co-occurence of two entities in a document. By default, any co-occurrence of multiple entities in a document will create an untyped, undirected relation between each pair of those entities. Sometimes this might not be desired, and so there should be parameter to disable the creation of relations that aren't described by a user-created relation model.

6. Performing transitive closure

During the life of the resolution job, it's possible that two or more entities could be discovered to be the same entity (see example below). zentity must merge any entities (and their relations) that share transitive connections.

Example:

  1. User provides inputs for the attributes of entity A.
  2. zentity submits queries using the input and receives a document that matches entity A and also contains entity B.
  3. zentity submits queries using the attributes of entity A and receives a document that matches entity A and contains entity C.
  4. zentity submits queries using the attributes of entity B and receives a document that matches entity B and contains entity D.
  5. zentity submits queries using the attributes of entity C and receives a document whose _id was one of entity A and a document whose _id was one of entity B.

In this example, zentity should merge entities A, B, and C, because it was shown that C = B and C = A, therefore A = B = C.

How to check for transitivity

zentity will only know to merge the entities if they share an _id. However, zentity prevents an _id from ever appearing twice, because zentity has an optimization that excludes every _id it discovers from subsequent queries in the job (source). This optimization must be applied for each entity rather than globally for the job. Each entity can have its own _id set to prevent duplicate hits to the same document, while allowing for other entities the chance to overlap with that _id.

Current structure of job.docIDs:

{
  INDEX_NAME: set(DOC_ID, ...),
  ...
}

Proposed new structure of job.docIDs:

{
  _ZID: {
    INDEX_NAME: set(DOC_ID, ...),
    ...
  },
  ...
}

Proposed additional structure to quickly determine if an _id belongs to two or more entities:

{
  DOC_ID: set(_ZID, ...),
  ...
}

When to perform transitive closure

Transitive closure should run just before the job is believed to have ended. This will limit the number of times that this expensive operation has to run. After transitive closure is complete, if any entities were merged, the job should run another hop of queries with the newly merged entities. Otherwise, if no entities were merged, the job is complete.

How to perform transitive closure

At the end of the job, transitive closure should be applied to the _id sets of all entities. Whenever two _id sets share an element, those sets need to be merged, and the attributes of those entities needs to be merged. The _zid that is the lexicographically lowest of the merged entities will become the _zid for the newly merged entity.

7. Scoping graph resolution

zentity should be able to scope resolution jobs by entity type and relation type, in addition to the current accepted scope of attributes, resolvers, and indices.

Current syntax for scoping resolution jobs:

{
  "scope": {
    "exclude": {
      "attributes": { ... },
      "resolvers": [ ... ],
      "indices": [ ... ]
    },
    "include": {
      "attributes": { ... },
      "resolvers": [ ... ],
      "indices": [ ... ]
    }
  }
}

Proposed new syntax for scoping resolution jobs:

{
  "scope": {
    "exclude": {
      "entities": {
        "attributes": { ... },
        "resolvers": [ ... ],
        "types": [ ... ]
      },
      "relations": {
        "types": [ ... ]
      },
      "indices": [ ... ]
    },
    "include": {
      "entities": {
        "attributes": { ... },
        "resolvers": [ ... ],
        "types": [ ... ]
      },
      "relations": {
        "types": [ ... ]
      },
      "indices": [ ... ]
    }
  }
}

8. Limiting graph traversal

zentity will need a parameter in the resolution job to limit its searches on linked entities by some number of degrees of separation from the entities in the request.

Current circuit breaker parameters include:

Proposed changes:

sapna3588 commented 1 month ago

Can we use this feature in the future release of the plugin as it seems quite helpful for graph resolution?