Graph Resolution - zentity 2.0

One of the most popular feature requests I've heard from the community has been to support the resolution of multiple entities and their relations. This issue documents my thoughts on how to implement that in zentity. I call this feature graph resolution because it introduces concepts of graph theory and entails resolving entities and relations in a graph. The feature would be significant enough to warrant the promotion of zentity to version 2.0. The actual implementation may differ from my initial outline below.

Foundational requirements

These are the minimum required capabilities of graph resolution in zentity:

Generating IDs for entities and relations - zentity must generate unique identifiers (_zid) for entities and relations.
Modeling relations - zentity must provide a way for users to model relations between entities.
Resolving relations - zentity must be able to apply relationship models to track relations between entities, and return those in the response of a resolution job.
Resolving multiple entities in one request - zentity must be able to return multiple entities in the response of a single resolution job.
Extracting entities from documents - zentity must be able to extract multiple entities from a given document.
Performing transitive closure - zentity must track the associations between _doc and _zid. Whenever a _doc appears for multiple _zid, zentity must merge the entities of those _zid and their relations.

Optimizations

These are optimizations that can be moved to a subsequent minor version release if needed:

Scoping graph resolution - zentity should be able to scope resolution jobs by entity type and relation type, in addition to the current accepted scope of attributes, resolvers, and indices.
Limiting graph traversal - zentity will need a parameter in the resolution job to limit its searches on linked entities by some number of degrees of separation from the entities in the request.

1. Generating IDs for entities and relations

zentity must generate unique identifiers for entities and relations. I will call this identifier a _zid (short for "zentity ID"). The _zid should be a composite value of existing data that together would uniquely identify an entity or relation.

1.1 Entity `_zid`

Proposed syntax of _zid for entities:

ENTITY_TYPE|ENTITY_INSTANCE|INDEX_NAME|base64(DOC_ID)

Defined as the following:

ENTITY_TYPE is the name of the entity model.
ENTITY_INSTANCE is an incrementing counter that differentes multiple instances of the entity type within a document. I expect this always to be 0 for now, until zentity supports treating nested objects as individual entities.
INDEX_NAME is the name of the first index in which a document for the entity was found.
base64(DOC_ID) is the base64-encoded value of the _id of the first document in which the entity was found.
The values are concatenated in the order listed above with a pipe (|).

Example (using the cross-cluster search syntax for the index name to show why a colon : shouldn't be used as a delimiter for the _zid):

person|0|us:my_index|Mg==

Benefits of the proposed syntax of _zid for entities:

Fast - Concatenating the values is much faster than computing an encoding or a hash digest.
Intuitive - Key information about the entity is readily apparent in the identifier. This will be useful when viewing the raw data of the relations between entities, as the relationship objects should only display the _zid for each entity for the sake of brevity.
Safe (de)serialization - The pipe symbol (|) is not allowed in entity names, attribute names, or index names. This means we can safely use it to concatenate the proposed values. A doc _id could contain this symbol, hence the requirement to use base64 encoding of the doc _id to allow for safe usage of the pipe delimiter.
Deterministic* - zentity performs entity resolution deterministically. If you submit the same request to the Resolution API twice, zentity will query the same indices in the same order. Thus, the proposed method of using the name of the first queried index and the _id of the first returned hit will yield the same _zid, as long as the state of the indices and their documents hasn't changed between those requests (see note below).

*Note - The_zid will NOT always be guaranteed to be the same across multiple responses from the Resolution API. They are ephemeral, and should be used only to uniquely identify the entities and relations of a single resolution request. Persisting these would be in scope of a future enhancement to persist and manage the outputs of entity resolution.

1.2 Relation `_zid`

Proposed syntax of _zid for relations:

RELATION_TYPE#RELATION_DIRECTION#_ZID_A#_ZID_B

Defined as the following:

RELATION_TYPE is the name of the relation model (or an empty value).
RELATION_DIRECTION is the direction of the relation (a>b, a<b, a<>b, or an empty value).
_ZID_A is the _zid of entity a in the relation.
_ZID_B is the _zid of entity b in the relation.
The values are concatenated in the order listed above with a hash (#). A hash is used instead of a pipe (|) because pipes will already appear in _ZID_A and _ZID_B

Examples:

residence#a>b#person|0|us:my_index|Mg==#address|0|us:my_index|Mg== - A relation where the type is residence and the direction is a>b.
residence##person|0|us:my_index|Mg==#address|0|us:my_index|Mg== - A relation where the type is residence that it has no direction.
#a>b#person|0|us:my_index|Mg==#address|0|us:my_index|Mg== - A relation that has no relation type and the direction is a>b.
##person|0|us:my_index|Mg==#address|0|us:my_index|Mg== - A relation that has no relation type and no direction.

Benefits of the proposed syntax of _zid for relations:

Fast - For the same reasons as entity _zid.
Intuitive - For the same reasons as entity _zid.
Safe (de)serialization - For the same reasons as entity _zid. The hash symbol (#) is not allowed in entity names, attribute names, or index names, and it will not appear in the _zid of entities a and b. This means we can safely use it to concatenate the proposed values.
Deterministic* - For the same reasons as the entity _zid.

2. Modeling relations

zentity must provide a way for users to model the relations between entities as they appear in documents. These relations could be either typed or untyped, and either directional, bidirectional, or undirected. A default relation could be untyped and undirected, representing the co-occurrence of two entities in a document.

Index name for relation models:

.zentity-models-relations

Relation model:

{
  "index": INDEX_NAME,
  "type": RELATION_TYPE,
  "direction": RELATION_DIRECTION,
  "a": ENTITY_TYPE,
  "b": ENTITY_TYPE
}

"index" (Required) - The name of the index in which the relation appears.
"type" (Optional) - An arbitrary string that describes the relation between the two entities (e.g. "lives at", "parent of", "child of", "owner of"). Can be null or omitted to represent an untyped relation.
"a" (Required) - The entity type of one entity in the relation.
"b" (Required) - The entity type of the other entity in the relation.
"direction" (Optional) - A string that specifies the direction (or lack thereof) between entities "a" and "b".
- Direcitonal values: "a>b", "a<b")
- Bidirectional values: ("a<>b")
- Undirected values: null or omitted
- Uppercase and lowercase should be accepted for these values, but the API handler should lowercase everything before saving the document to the .zentity-models-relations index.
- Whitespace should be accepted for these values, but the API handler should strip the whitespace before saving the document to the .zentity-models-relations index.
- The order of "a" or "b" should be accepted either way for these values, but the API hanlder should sort them before saving the document to the .zentity-models-relations index.
- Regular expression for accepted values (prior to normalization): ^\s*[abAB]\s*(<+\s*-*|-*\s*>+|<+\s*-*\s*>+)\s*[abAB]\s*$

3. Resolving multiple entities in one request

Currently, zentity performs entity resolution for a single entity. The request accepts inputs for a single entity, and the reponse provides data for a single entity.

Graph resolution MUST have the response provide data for one or many entities, and SHOULD have the request allow inputs for multiple entities.

3.1 Resolution API Request

Expected changes that preserve backwards compatibility:

Requests can express multiple entities as inputs to the resolution job in an "entities" field. If the user doesn't supply an "entities" field, zentity will fall back onto the current syntax for resolution requests.

Expected breaking changes:

Responses should always contain "entities" and "relations" as top-level fields.

Current syntax for requests:

POST _zentity/resolution/ENTITY_TYPE
{
  "attributes": { ... },
  "terms": [ ... ],
  "ids": { ... },
  "scope": { ... }
}

Current alternative syntax for requests using an embedded an entity model:

POST _zentity/resolution
{
  "attributes": { ... },
  "terms": [ ... ],
  "ids": { ... },
  "scope": { ... },
  "model": { ... }
}

Proposed syntax for requests using the "entities" syntax, which supports separately resolving one or many entities in a single resolution job:

POST _zentity/resolution
{
  "entities": [
    {
      "type": ...,
      "attributes": { ... },
      "terms": [ ... ],
      "ids": { ... }
    }
  ],
  "scope": { ... }
}

Propose alternative syntax for requests using embedded entity models:

POST _zentity/resolution
{
  "entities": [
    {
      "attributes": { ... },
      "terms": [ ... ],
      "ids": { ... },
      "model": { ... }
    }
  ],
  "scope": { ... }
}

When using the "entities" syntax, the values of "scope.*.attributes" and "scope.*.resolvers" must be prefixed with ENTITY_TYPE: to

3.2 Resolution API Response

Proposed syntax for responses:

POST _zentity/resolution
{
  "took" : INTEGER,
  "entities": [
    {
      "_zid": _ZID,
      "_type": ENTITY_TYPE,
      "_hits": [ ... ]
    },
    ...
  ],
  "relations": [
    {
      "_zid": _ZID,
      "_type": RELATION_TYPE,
      "_direction": RELATION_DIRECTION,
      "_a": _ZID,
      "_b": _ZID,
      "_hits": [
        {
          "_index": INDEX_NAME,
          "_id": DOC_ID
        },
        ...
      ]
    },
    ...
  ]
}

The response is a node-link graph structure, where the nodes are listed in the "entities" field and the links are listed in the "relations" field:

"entities" is a list of objects, where each object is an entity with a unique _zid, a _type, and a list of _hits that retains its current syntax.
"relations" is a list of objects, where each object is a relation between two entities "a" and "b".

4. Extracting entities from documents

Currently, zentity assumes that everything in a resolution job belongs to a single entity: the attributes for every query submitted to Elasticsearch, and the attributes from every document received from Elasticsearch.

zentity must be able to find all possible entities in the scope of the resolution job. The way it can do this is to check if the document contains non-empty values for every attribute of any resolver for every entity type.

Proposed implementation:

for each doc returned by a query:
    for each entity type in the scope of the job:
        for each resolver in the model of that entity type:
            if the doc contains non-empty values for each attribute in that resolver:
                consider the doc as a hit for that entity type, and use it input for subsequent queries

5. Resolving relations

Relations will be defined by the co-occurence of two entities in a document. By default, any co-occurrence of multiple entities in a document will create an untyped, undirected relation between each pair of those entities. Sometimes this might not be desired, and so there should be parameter to disable the creation of relations that aren't described by a user-created relation model.

6. Performing transitive closure

During the life of the resolution job, it's possible that two or more entities could be discovered to be the same entity (see example below). zentity must merge any entities (and their relations) that share transitive connections.

Example:

User provides inputs for the attributes of entity A.
zentity submits queries using the input and receives a document that matches entity A and also contains entity B.
zentity submits queries using the attributes of entity A and receives a document that matches entity A and contains entity C.
zentity submits queries using the attributes of entity B and receives a document that matches entity B and contains entity D.
zentity submits queries using the attributes of entity C and receives a document whose _id was one of entity A and a document whose _id was one of entity B.

In this example, zentity should merge entities A, B, and C, because it was shown that C = B and C = A, therefore A = B = C.

How to check for transitivity

zentity will only know to merge the entities if they share an _id. However, zentity prevents an _id from ever appearing twice, because zentity has an optimization that excludes every _id it discovers from subsequent queries in the job (source). This optimization must be applied for each entity rather than globally for the job. Each entity can have its own _id set to prevent duplicate hits to the same document, while allowing for other entities the chance to overlap with that _id.

Current structure of job.docIDs:

{
  INDEX_NAME: set(DOC_ID, ...),
  ...
}

Proposed new structure of job.docIDs:

{
  _ZID: {
    INDEX_NAME: set(DOC_ID, ...),
    ...
  },
  ...
}

Proposed additional structure to quickly determine if an _id belongs to two or more entities:

{
  DOC_ID: set(_ZID, ...),
  ...
}

When to perform transitive closure

Transitive closure should run just before the job is believed to have ended. This will limit the number of times that this expensive operation has to run. After transitive closure is complete, if any entities were merged, the job should run another hop of queries with the newly merged entities. Otherwise, if no entities were merged, the job is complete.

How to perform transitive closure

At the end of the job, transitive closure should be applied to the _id sets of all entities. Whenever two _id sets share an element, those sets need to be merged, and the attributes of those entities needs to be merged. The _zid that is the lexicographically lowest of the merged entities will become the _zid for the newly merged entity.

7. Scoping graph resolution

zentity should be able to scope resolution jobs by entity type and relation type, in addition to the current accepted scope of attributes, resolvers, and indices.

Current syntax for scoping resolution jobs:

{
  "scope": {
    "exclude": {
      "attributes": { ... },
      "resolvers": [ ... ],
      "indices": [ ... ]
    },
    "include": {
      "attributes": { ... },
      "resolvers": [ ... ],
      "indices": [ ... ]
    }
  }
}

Proposed new syntax for scoping resolution jobs:

{
  "scope": {
    "exclude": {
      "entities": {
        "attributes": { ... },
        "resolvers": [ ... ],
        "types": [ ... ]
      },
      "relations": {
        "types": [ ... ]
      },
      "indices": [ ... ]
    },
    "include": {
      "entities": {
        "attributes": { ... },
        "resolvers": [ ... ],
        "types": [ ... ]
      },
      "relations": {
        "types": [ ... ]
      },
      "indices": [ ... ]
    }
  }
}

8. Limiting graph traversal

zentity will need a parameter in the resolution job to limit its searches on linked entities by some number of degrees of separation from the entities in the request.

Current circuit breaker parameters include:

max_docs_per_query - Maximum number of docs per query result.
max_hops - Maximum level of recursion.
max_time_per_query - Timeout per query.

Proposed changes:

Add max_degrees and default it to 1.
Rename max_hops to max_rounds (and rename any other instance of "hop" to "round," because most people will envision a "hop" to mean a link from one entity to another, which isn't the purpose of this parameter).

zentity-io / zentity