wetneb / openrefine-wikibase

This repository has migrated to:
https://gitlab.com/nfdi4culture/ta1-data-enrichment/openrefine-wikibase
Other
100 stars 24 forks source link

Wikidata reconciliation query/scores when multiple variable on same property and date precision #141

Closed wetneb closed 1 year ago

wetneb commented 2 years ago

Originally posted by @paulgirard at https://github.com/OpenRefine/OpenRefine/issues/4993.

I am not sure that is a bug or a question. Above all because scores are computed on wikidata side but sill I am sure someone like @wetneb might have insights.

I'd like to understand how to precisely tune a reconciliation against wikidata using a rich data-set. There are in this example two different reconciliation issue/questions:

  • reconciling against property using multiple values (I am using firstnames in people in this example)
  • reconciling against date when there is only years (I am using birthdate)

To Reproduce

Steps to reproduce the behavior:

  1. Create a project with this sample data with number detection on. This table list variations of possible ways to build the reconcile query.
    Person    name    Firstname1  firstname2  birthdate
    William Albert Ablett Ablett  William Albert  1877-07-09 00:00:00
    William Albert Ablett Ablett          1877-07-09 00:00:00
    William Albert Ablett Ablett  William Albert  1877
    William Albert Ablett Ablett          1877
    William Albert Ablett Ablett  William Albert      1877
    William Albert Ablett Ablett  William Albert      1877-07-09 00:00:00
    William Albert Ablett Ablett  William     1877-07-09 00:00:00
    William Albert Ablett Ablett  William     1877
  2. start a reconciliation on Person using all the properties like this:
    [
    {
    "op": "core/recon",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "Person",
    "config": {
      "mode": "standard-service",
      "service": "https://wikidata.reconci.link/fr/api",
      "identifierSpace": "http://www.wikidata.org/entity/",
      "schemaSpace": "http://www.wikidata.org/prop/direct/",
      "type": {
        "id": "Q5",
        "name": "être humain"
      },
      "autoMatch": false,
      "columnDetails": [
        {
          "column": "name",
          "propertyName": "nom de famille",
          "propertyID": "P734"
        },
        {
          "column": "Firstname1",
          "propertyName": "prénom",
          "propertyID": "P735"
        },
        {
          "column": "firstname2",
          "propertyName": "prénom",
          "propertyID": "P735"
        },
        {
          "column": "birthdate",
          "propertyName": "date de naissance",
          "propertyID": "P569"
        }
      ],
      "limit": 0
    },
    "description": "Reconcile cells in column Person to type Q5"
    }
    ]
  3. The score obtained are somehow mysterious to me image

Current Results

In the current results the best score is obtained when:

  • we only use one firstname
  • we use the year version of the date

Expected Behavior

Since the targeted entity https://www.wikidata.org/wiki/Q19832695 is very well described listing both firstnames in P735 and both full date and year only in P569, I would have expected that all variations (but rows 5 and 6) would be scored 100. I don't understand why the score are so varied. Maybe there is a way to reconcile a list of values against on property in a different way to achieve this expected behavior? And I don't understand neither why the year version of the date yields better score than the entire date?

Versions

  • Operating System: Linux mint
  • Browser Version: firefox 101.0
  • JRE or JDK Version: openjdk version "11.0.15" 2022-04-19
  • OpenRefine: 3.5.2

Additional context

I started a discussion on the mailing list about this but with less precise description: https://groups.google.com/g/openrefine/c/WK6-5kSZLRA

wetneb commented 2 years ago

Thanks for this question! I agree with you that it is very annoying that this matching score is opaque. I would like to instead expose more granular features, expressing the degree of matching of each supplied data field, for instance.

Implementing this functionality requires work in multiple areas:

Your feedback about how you would expect this to work in OpenRefine is welcome in https://github.com/OpenRefine/OpenRefine/issues/3139.

Perhaps this notion of matching features does not fit your bill, in which case I would also be interested to know what we should change about it: it is still time to adapt the specifications of the protocol.

paulgirard commented 2 years ago

Oh exposing features score is a great move! Yes I'll give my thoughts on how those would be used in Open Refine (https://github.com/OpenRefine/OpenRefine/issues/3139).

But before that I need a few clarifications :

  1. how does reconciliation work when submitting multiple values against the same property?

To come back to the candidate features, in the example I just posted would the recon system return one or two features for P735? Or to put it differently, are multiples column details on the same property merged or treated separately? And to finish, when a property as multiple value in the service, is the score returned for one feature the maximum/average/... score for all existing values?

  1. feature score value scale

From the examples seen in the specs features looks like having each their own value scale which I understand. For a user point of view it wight be useful to have a normalized version of the score to make it possible to combine different scores in a common scoring. Actually if the features spec provided by the recon service explain those scale a feature score facet system could suffice. I will add that to my comment on the open refine issue.

wetneb commented 2 years ago

To come back to the candidate features, in the example I just posted would the recon system return one or two features for P735? Or to put it differently, are multiples column details on the same property merged or treated separately? And to finish, when a property as multiple value in the service, is the score returned for one feature the maximum/average/... score for all existing values?

So far all those details are left to the discretion of the reconciliation service. The spec does not state any relationship between features and properties.

paulgirard commented 2 years ago

Yes I understand this. Sorry to insist on the multiple values issue but I think this issue example challenges the specs written here:

Global matching formula The score of each candidate is obtained as a weighted sum of the scores of individual features. It ranges from 0 to 100. When no candidates can be found matching the target type, candidates of wrong or no types are also returned, with their score divided by two. For each supplied property, all query values are matched against reference values and the maximum matching score of all pairs is used as the similarity score for this property. https://openrefine-wikibase.readthedocs.io/en/latest/scoring.html#global-matching-formula

Isn't there an issue here? But that concerns more wikidata recon server?

thadguidry commented 2 years ago

For each supplied property, all query values are matched against reference values and the maximum matching score of all pairs is used as the similarity score for this property.

@paulgirard To be clear there's not just 1 spec in discussion here; There are 2 specs (actually 3).

  1. That text above is specs of the Wikibase/Wikidata recon scoring algorithms (written by @wetneb ). A service spec.
  2. There is a separate W3C Entity Reconciliation spec that is undergoing W3C drafting status currently. The reconciliation api spec. https://github.com/reconciliation-api
  3. OpenRefine's general recon spec (based on Freebase and no longer used since version 3.1) The freebase api spec

A recon service could and can do some scoring aggregation on your multiple columns details on the same property. The similarity score (also called a disambiguation score in other services). Which is a good term because "disambiguation" is what you typically are doing by supplying more properties like a birthdate to help disambiguate between 2 identical entity strings "Willaim Albert" b. 1922 and "William Albert" b. 1895) The similarity (disambiguation) score returned for a set of supplied properties is optional for services to provide, but when they provide this...

  1. The plan is for OpenRefine to be enhanced to allow granular faceting and range control on scores and features per disambiguating property (a column in OpenRefine terms).
  2. and in the future perhaps provide some common feedback interface dialog to somewhat control a services algorithms if the service has a feedback mechanism implemented. This discussion is ongoing in the W3C Entity Reconciliation group
paulgirard commented 2 years ago

Thank you for the clarification.

But my question remains. If the WikiBase/wikidata recon service spec is accurate, my multiple columns details on the same property example should not return those scores differences. Unless the way OpenRefine wikibase extension craft the recon query is not the good one in such multiple-columns-on-the-same-property situation. Or I don't use open refine in a good way.

About the new system, it would definitely help but regardless the enhanced control on features scoring my question about querying multiple values on the same property remain: what is the good way to handle such a situation in open-refine-wikidata is unclear to me. I am very sorry to insist. I feel like I am missing an obvious explanation of fact...

ps: the new features scoring system would be neat to handle recon server data-set heterogeneity by identifying candidates which can't be scored on some feature because of missing data.

thadguidry commented 2 years ago

@paulgirard The WikiBase/wikidata recon service scoring itself is defined here by @wetneb in his code: https://github.com/wetneb/openrefine-wikibase/blob/master/wdreconcile/engine.py#L269 which is "Compute per-property score". Your example dataset has variation in 3 columns values (the properites) as shown in these Text facets I applied on each property column to visualize differences: image

Hence, you will see variation in the overall scores because there is variation in that matrix of different Firstname1, firstname2 and birthdate.

Taking the Person column, along with variation in values in your 3 property columns will result in variances in the score given as per @wetneb code in computing the overall score here: https://github.com/wetneb/openrefine-wikibase/blob/master/wdreconcile/engine.py#L302

wetneb commented 2 years ago

About the new system, it would definitely help but regardless the enhanced control on features scoring my question about querying multiple values on the same property remain: what is the good way to handle such a situation in open-refine-wikidata is unclear to me.

It is totally possible that the documentation of the Wikidata service is not fully accurate or that there is a bug in the code. To investigate this I would recommend to try formulate queries "by hand", that is, crafting the API call that executes the corresponding reconciliation query, and analyzing the score that it outputs. To craft the API call you can get some help from the testbench although it does not support multiple values per property, sadly (but it can be a basis to iterate from).

paulgirard commented 2 years ago

:pray: Thank you both! Now I am perfectly empowered to understand how the magic happens. If I discover anything suspicious I'll let you know.

paulgirard commented 2 years ago

So about the multiple values for one property. Here is what happens with the firstnames example. When submitting a recon query with multiple time the same property with different values/columns, open refine sends multiple properties objects with one value each :

{
    "q0": {
        "query": "William Albert Ablett",
        "type": "Q5",
        "properties": [
            {
                "pid": "P735",
                "v": "John"

            },
            {
                "pid": "P735",
                "v": "Albert"
            },
            {
                "pid": "P734",
                "v": "Ablett"
            }
        ]
    }
}

Which returns

{
    "q0": {
        "result": [
            {
                "description": "French painter, designer and engraver",
                "features": [
                    {
                        "id": "P735",
                        "value": 100
                    },
                    {
                        "id": "P734",
                        "value": 100
                    },
                    {
                        "id": "all_labels",
                        "value": 100
                    }
                ],
                "id": "Q19832695",
                "match": true,
                "name": "William Albert Ablett",
                "score": 81.81818181818181,
                "type": [
                    {
                        "id": "Q5",
                        "name": "human"
                    }
                ]
            }
        ]
    }
}

output of https://wikidata.reconci.link/en/api?queries={%22q0%22:%20{%22query%22:%20%22William%20Albert%20Ablett%22,%22type%22:%20%22Q5%22,%22properties%22:%20[{%22pid%22:%20%22P735%22,%22v%22:%20%22John%22},{%22pid%22:%20%22P735%22,%22v%22:%20%22Albert%22},{%22pid%22:%20%22P734%22,%22v%22:%20%22Ablett%22}]}}&timeout=20000

Now the recon service does not handle property ids duplication and rewrite the property score with the last value score uncountered in the property list: https://github.com/wetneb/openrefine-wikibase/blob/master/wdreconcile/engine.py#L297 Indeed in such a scenario the expected data format would be only one property object with an array listing the possible values.

This means that only the last value is taken into account in the global score calculation.

To illustrate let's just reverse the property order in the previous example

{
    "q0": {
        "query": "William Albert Ablett",
        "type": "Q5",
        "properties": [
            {
                "pid": "P735",
                "v": "Albert"

            },
            {
                "pid": "P735",
                "v": "John"
            },
            {
                "pid": "P734",
                "v": "Ablett"
            }
        ]
    }
}

which gives

{
    "q0": {
        "result": [
            {
                "description": "French painter, designer and engraver",
                "features": [
                    {
                        "id": "P735",
                        "value": 22
                    },
                    {
                        "id": "P734",
                        "value": 100
                    },
                    {
                        "id": "all_labels",
                        "value": 100
                    }
                ],
                "id": "Q19832695",
                "match": false,
                "name": "William Albert Ablett",
                "score": 67.63636363636364,
                "type": [
                    {
                        "id": "Q5",
                        "name": "human"
                    }
                ]
            }
        ]
    }
}

The only "given name" value taken into account is "John" which makes the score to be much worst.

I think that behavior is wrong.

Now the way to correct that needs discussion. Since the specs allow to have duplicated properties ids in the properties list I would vote for adding some code just ahead of the prepare_properties call to merge properties object with same ids: https://github.com/wetneb/openrefine-wikibase/blob/master/wdreconcile/engine.py#L155

Doing so would make as if duplicated properties id were actually forbidden but in a magical hidden way which is not the best thing I guess?

I would be very happy to contribute by posting a PR to work on this but I need your decisions on what's the correct way to handle this.

thadguidry commented 2 years ago

@paulgirard It depends on the recon service and if it provides controlled filter constraints or not. In your case, you are looking for a logical OR.

Back in Freebase days it might look like this:

?query="William Albert Ablett"
&filter=(any P735:"Albert" P735:"John" (and P734:"Ablett"))

Even when Freebase API was still available to handle that filter, the 2.6 version code from Freebase days however did not have support to construct query using filter arguments. The Suggest code (\OpenRefine-2.6-rc.2\main\webapp\modules\core\externals\suggest\suggest-4_3.js) did contain it however, but we did not incorporate full support for filter() handling against optional property columns to recon.

If a recon service wanted to introduce a search feature for a mechanism like filter() to accommodate AND, OR, NOT, SHOULD then I think OpenRefine already kept some support for that on the backend, but I defer to @wetneb on that.

In our reconciliation-api spec, we do have support for AND, OR, SHOULD as noted here: https://github.com/reconciliation-api/specs/blob/master/latest/index.html#L411 But I'm not sure if we really called out in the spec that you can provide multiple properties arrays (I don't think so, because we also didn't allow the type_strict to directly relate to a particular properties array. So with the current version of the api spec, you would have to query twice (form 2 queries with almost the same data, and only vary the type_strict and properties) which is totally not ideal compared to the very nice flexible filter constraint handling that the Freebase API originally provided.

@wetneb given the above, perhaps you can comment on if the query

?query="William Albert Ablett"
&filter=(any P735:"Albert" P735:"John" (and P734:"Ablett"))

would actually even be accomplished with a single query using the reconciliation api, or if it could not and we need to discuss in this issue https://github.com/reconciliation-api/specs/issues/88 ?


From old Freebase API docs (see https://developers.google.com/freebase/v1/search-overview and https://developers.google.com/freebase/v1/search-cookbook) :

This combining behavior can be overriden and better controlled with the filter parameter which offers a richer interface to combining constraints. It is an s-expression, possibly arbitrarily nested, where the operator is one of:

  • any, logically an OR
  • all, logically an AND
  • not
  • should, which can only be used at the top level and which denotes that the constraint is optional. During scoring, matches >that don't match optional constraints have their score divided in half for each optional constraint they don't match.
paulgirard commented 2 years ago

Note: It's possible to do a OR on a property like this

{
    "q0": {
        "query": "William Albert Ablett",
        "type": "Q5",
        "properties": [
            {
                "pid": "P735",
                "v": ["Albert", "John"]
            },
            {
                "pid": "P734",
                "v": "Ablett"
            }
        ]
    }
}

But to my knowledge it's not possible to issue such a query with open refine. And as stated before the way the recon API treats duplicated properties in a query does not look like ideal.

wetneb commented 2 years ago

Actually I think it is possible to issue such a query with OpenRefine, using the records mode.

paulgirard commented 2 years ago

It is possible? You mean with the current version or with a future release? Would actually fit my needs. Nevertheless I still think the duplicated properties situation I described here should be handled differently.

ps: Is there a way to log the reconciliation queries issued by open refine in some debug mode?

wetneb commented 2 years ago
Try the following dataset: ID title author
1 Flexible Solar Cells Giovanni Palmisano
2 Rosaria Ciriminna

Open this in OpenRefine and turn on the records mode. Then, reconcile the "title" column and add the second column as a property (for instance, "author (P50)" on Wikidata). This should generate a query similar to this one: https://reconciliation-api.github.io/specs/latest/#example-5

paulgirard commented 2 years ago

So after some tests I don't think that's what Open Refine does currently.

I did a reconciliation to Q13442814 with constraint on P2093 on your dataset :

[
  {
    "op": "core/recon",
    "engineConfig": {
      "facets": [],
      "mode": "record-based"
    },
    "columnName": "title",
    "config": {
      "mode": "standard-service",
      "service": "https://tools.wmflabs.org/openrefine-wikidata/en/api",
      "identifierSpace": "http://www.wikidata.org/entity/",
      "schemaSpace": "http://www.wikidata.org/prop/direct/",
      "type": {
        "id": "Q13442814",
        "name": "scholarly article"
      },
      "autoMatch": false,
      "columnDetails": [
        {
          "column": "author",
          "propertyName": "author name string",
          "propertyID": "P2093"
        }
      ],
      "limit": 0
    },
    "description": "Reconcile cells in column title to type Q13442814"
  }
]

I have 100 with both values like that

image

But If I update the data by using my name in the first author cell:

image

Score is 85 where it should be 100 as the second author matches perfectly one of the property value.

It looks like only the first row is used as this: https://wikidata.reconci.link/en/api?queries=%7B%22q0%22%3A%7B%22query%22%3A%22Flexible+Solar+Cells.%22%2C%22type%22%3A%22Q13442814%22%2C%22properties%22%3A%5B%7B%22pid%22%3A%22P2093%22%2C%22v%22%3A%22Paul+Girard%22%7D%5D%7D%7D

Indeed if I switch the author order I am back to 100 image

wetneb commented 2 years ago

Ok! Good to know. Perhaps it requires having column groups as well. Anyway this is not really discoverable at the moment, it really ought to be changed.

paulgirard commented 2 years ago

"column groups"? Oh I didn't know about this: https://docs.openrefine.org/technical-reference/architecture#column-groups I might test by importing a json then. I'll let you know.

Anyway this is not really discoverable at the moment, it really ought to be changed.

Yeah but in the meantime having a how-to could help the discoverability.

wetneb commented 1 year ago

Since the remaining action items for this are on OpenRefine's side (https://github.com/OpenRefine/OpenRefine/issues/3139), let us close this here.