zentity-io / zentity

Entity resolution for Elasticsearch.
https://zentity.io
Apache License 2.0
157 stars 28 forks source link

How can I define a marcher that just matches the Year part of the date field with a window of 2 #41

Closed usama-azakaw closed 3 years ago

usama-azakaw commented 4 years ago

I am looking to define a matcher on a date field that has a format of 'yyyy-MM-dd' , I want this matcher to pick those records where the Year part matches and window of 2 is allowed meaning year +-2 is allowed.

davemoore- commented 3 years ago

@usama-azakaw I'm sure my response comes too late for your needs, but I'll answer for the community.

I see two approaches depending on what your desired outcome is exactly. One approach is to require all documents in the entire job to be within 2 years of a given date. Another approach is require all documents for each hop to be within 2 years of a given date, which could chain over multiple hops to return results that go well beyond 2 years.

Here's an example of the two approaches.

Create the index

PUT date-example
{
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date",
        "format": "yyyy-MM-dd"
      },
      "id": {
        "type": "keyword"
      }
    }
  }
}

Index some documents - Each document has a different year. The first and last documents are more than 10 years apart from the rest, and won't match any resolution jobs.

POST date-example/_bulk?refresh
{"index": {"_id": "1" }}
{"@timestamp": "2000-01-01", "id": "foo"}
{"index": {"_id": "2" }}
{"@timestamp": "2013-01-01", "id": "foo"}
{"index": {"_id": "3" }}
{"@timestamp": "2014-01-01", "id": "foo"}
{"index": {"_id": "4" }}
{"@timestamp": "2015-01-01", "id": "foo"}
{"index": {"_id": "5" }}
{"@timestamp": "2016-01-01", "id": "foo"}
{"index": {"_id": "6" }}
{"@timestamp": "2017-01-01", "id": "foo"}
{"index": {"_id": "7" }}
{"@timestamp": "2018-01-01", "id": "foo"}
{"index": {"_id": "8" }}
{"@timestamp": "2019-01-01", "id": "foo"}
{"index": {"_id": "9" }}
{"@timestamp": "2030-01-01", "id": "foo"}

Create the entity model - This example borrows from the tutorial on date attributes.

PUT _zentity/models/date-example
{
  "attributes": {
    "timestamp": {
      "type": "date"
    },
    "id": {
      "type": "string"
    }
  },
  "resolvers": {
    "timestamp_id": {
      "attributes": [ "timestamp", "id" ]
    }
  },
  "matchers": {
    "exact": {
      "clause": {
        "term": {
          "{{ field }}": "{{ value }}"
        }
      }
    },
    "time_range": {
      "clause": {
        "range": {
          "{{ field }}": {
            "gte": "{{ value }}||-{{ params.window }}",
            "lte": "{{ value }}||+{{ params.window }}",
            "format": "{{ params.format }}"
          }
        }
      },
      "params": {
        "format": "yyyy-MM-dd",
        "window": "2y"
      }
    }
  },
  "indices": {
    "date-example": {
      "fields": {
        "@timestamp": {
          "attribute": "timestamp",
          "matcher": "time_range"
        },
        "id": {
          "attribute": "id",
          "matcher": "exact"
        }
      }
    }
  }
}

Resolve an entity - This example uses the "scope field to requires all documents in the job to be within two years of the given date.

Request:

POST _zentity/resolution/date-example?_source=false&queries
{
  "attributes": {
    "id": [ "foo" ],
    "timestamp": [ "2018-01-01" ]
  },
  "scope": {
    "include": {
      "attributes": {
        "timestamp": [ "2018-01-01" ]
      }
    }
  }
}

Response:

{
  "took" : 3,
  "hits" : {
    "total" : 4,
    "hits" : [ {
      "_index" : "date-example",
      "_type" : "_doc",
      "_id" : "5",
      "_hop" : 0,
      "_query" : 0,
      "_attributes" : {
        "id" : [ "foo" ],
        "timestamp" : [ "2016-01-01" ]
      }
    }, {
      "_index" : "date-example",
      "_type" : "_doc",
      "_id" : "6",
      "_hop" : 0,
      "_query" : 0,
      "_attributes" : {
        "id" : [ "foo" ],
        "timestamp" : [ "2017-01-01" ]
      }
    }, {
      "_index" : "date-example",
      "_type" : "_doc",
      "_id" : "7",
      "_hop" : 0,
      "_query" : 0,
      "_attributes" : {
        "id" : [ "foo" ],
        "timestamp" : [ "2018-01-01" ]
      }
    }, {
      "_index" : "date-example",
      "_type" : "_doc",
      "_id" : "8",
      "_hop" : 0,
      "_query" : 0,
      "_attributes" : {
        "id" : [ "foo" ],
        "timestamp" : [ "2019-01-01" ]
      }
    } ]
  }
}

Resolve an entity - This example requires all documents in each hop to be within two years of the given date. This will return all documents except the ones whose years are more than 10 years apart from the rest of the documents.

Request:

POST _zentity/resolution/date-example?_source=false&queries
{
  "attributes": {
    "id": [ "foo" ],
    "timestamp": [ "2018-01-01" ]
  }
}

Response:

{
  "took" : 6,
  "hits" : {
    "total" : 7,
    "hits" : [ {
      "_index" : "date-example",
      "_type" : "_doc",
      "_id" : "5",
      "_hop" : 0,
      "_query" : 0,
      "_attributes" : {
        "id" : [ "foo" ],
        "timestamp" : [ "2016-01-01" ]
      }
    }, {
      "_index" : "date-example",
      "_type" : "_doc",
      "_id" : "6",
      "_hop" : 0,
      "_query" : 0,
      "_attributes" : {
        "id" : [ "foo" ],
        "timestamp" : [ "2017-01-01" ]
      }
    }, {
      "_index" : "date-example",
      "_type" : "_doc",
      "_id" : "7",
      "_hop" : 0,
      "_query" : 0,
      "_attributes" : {
        "id" : [ "foo" ],
        "timestamp" : [ "2018-01-01" ]
      }
    }, {
      "_index" : "date-example",
      "_type" : "_doc",
      "_id" : "8",
      "_hop" : 0,
      "_query" : 0,
      "_attributes" : {
        "id" : [ "foo" ],
        "timestamp" : [ "2019-01-01" ]
      }
    }, {
      "_index" : "date-example",
      "_type" : "_doc",
      "_id" : "3",
      "_hop" : 1,
      "_query" : 0,
      "_attributes" : {
        "id" : [ "foo" ],
        "timestamp" : [ "2014-01-01" ]
      }
    }, {
      "_index" : "date-example",
      "_type" : "_doc",
      "_id" : "4",
      "_hop" : 1,
      "_query" : 0,
      "_attributes" : {
        "id" : [ "foo" ],
        "timestamp" : [ "2015-01-01" ]
      }
    }, {
      "_index" : "date-example",
      "_type" : "_doc",
      "_id" : "2",
      "_hop" : 2,
      "_query" : 0,
      "_attributes" : {
        "id" : [ "foo" ],
        "timestamp" : [ "2013-01-01" ]
      }
    } ]
  }
}