Closed wetneb closed 1 year ago
Thanks for this question! I agree with you that it is very annoying that this matching score is opaque. I would like to instead expose more granular features, expressing the degree of matching of each supplied data field, for instance.
Implementing this functionality requires work in multiple areas:
Your feedback about how you would expect this to work in OpenRefine is welcome in https://github.com/OpenRefine/OpenRefine/issues/3139.
Perhaps this notion of matching features does not fit your bill, in which case I would also be interested to know what we should change about it: it is still time to adapt the specifications of the protocol.
Oh exposing features score is a great move! Yes I'll give my thoughts on how those would be used in Open Refine (https://github.com/OpenRefine/OpenRefine/issues/3139).
But before that I need a few clarifications :
To come back to the candidate features, in the example I just posted would the recon system return one or two features for P735? Or to put it differently, are multiples column details on the same property merged or treated separately? And to finish, when a property as multiple value in the service, is the score returned for one feature the maximum/average/... score for all existing values?
From the examples seen in the specs features looks like having each their own value scale which I understand. For a user point of view it wight be useful to have a normalized version of the score to make it possible to combine different scores in a common scoring. Actually if the features spec provided by the recon service explain those scale a feature score facet system could suffice. I will add that to my comment on the open refine issue.
To come back to the candidate features, in the example I just posted would the recon system return one or two features for P735? Or to put it differently, are multiples column details on the same property merged or treated separately? And to finish, when a property as multiple value in the service, is the score returned for one feature the maximum/average/... score for all existing values?
So far all those details are left to the discretion of the reconciliation service. The spec does not state any relationship between features and properties.
Yes I understand this. Sorry to insist on the multiple values issue but I think this issue example challenges the specs written here:
Global matching formula The score of each candidate is obtained as a weighted sum of the scores of individual features. It ranges from 0 to 100. When no candidates can be found matching the target type, candidates of wrong or no types are also returned, with their score divided by two. For each supplied property, all query values are matched against reference values and the maximum matching score of all pairs is used as the similarity score for this property. https://openrefine-wikibase.readthedocs.io/en/latest/scoring.html#global-matching-formula
Isn't there an issue here? But that concerns more wikidata recon server?
For each supplied property, all query values are matched against reference values and the maximum matching score of all pairs is used as the similarity score for this property.
@paulgirard To be clear there's not just 1 spec in discussion here; There are 2 specs (actually 3).
A recon service could and can do some scoring aggregation on your multiple columns details on the same property. The similarity score (also called a disambiguation score in other services). Which is a good term because "disambiguation" is what you typically are doing by supplying more properties like a birthdate to help disambiguate between 2 identical entity strings "Willaim Albert" b. 1922 and "William Albert" b. 1895) The similarity (disambiguation) score returned for a set of supplied properties is optional for services to provide, but when they provide this...
Thank you for the clarification.
But my question remains. If the WikiBase/wikidata recon service spec is accurate, my multiple columns details on the same property example should not return those scores differences. Unless the way OpenRefine wikibase extension craft the recon query is not the good one in such multiple-columns-on-the-same-property situation. Or I don't use open refine in a good way.
About the new system, it would definitely help but regardless the enhanced control on features scoring my question about querying multiple values on the same property remain: what is the good way to handle such a situation in open-refine-wikidata is unclear to me. I am very sorry to insist. I feel like I am missing an obvious explanation of fact...
ps: the new features scoring system would be neat to handle recon server data-set heterogeneity by identifying candidates which can't be scored on some feature because of missing data.
@paulgirard The WikiBase/wikidata recon service scoring itself is defined here by @wetneb in his code: https://github.com/wetneb/openrefine-wikibase/blob/master/wdreconcile/engine.py#L269 which is "Compute per-property score". Your example dataset has variation in 3 columns values (the properites) as shown in these Text facets I applied on each property column to visualize differences:
Hence, you will see variation in the overall scores because there is variation in that matrix of different Firstname1, firstname2 and birthdate.
Taking the Person column, along with variation in values in your 3 property columns will result in variances in the score given as per @wetneb code in computing the overall score here: https://github.com/wetneb/openrefine-wikibase/blob/master/wdreconcile/engine.py#L302
About the new system, it would definitely help but regardless the enhanced control on features scoring my question about querying multiple values on the same property remain: what is the good way to handle such a situation in open-refine-wikidata is unclear to me.
It is totally possible that the documentation of the Wikidata service is not fully accurate or that there is a bug in the code. To investigate this I would recommend to try formulate queries "by hand", that is, crafting the API call that executes the corresponding reconciliation query, and analyzing the score that it outputs. To craft the API call you can get some help from the testbench although it does not support multiple values per property, sadly (but it can be a basis to iterate from).
:pray: Thank you both! Now I am perfectly empowered to understand how the magic happens. If I discover anything suspicious I'll let you know.
So about the multiple values for one property. Here is what happens with the firstnames example. When submitting a recon query with multiple time the same property with different values/columns, open refine sends multiple properties objects with one value each :
{
"q0": {
"query": "William Albert Ablett",
"type": "Q5",
"properties": [
{
"pid": "P735",
"v": "John"
},
{
"pid": "P735",
"v": "Albert"
},
{
"pid": "P734",
"v": "Ablett"
}
]
}
}
Which returns
{
"q0": {
"result": [
{
"description": "French painter, designer and engraver",
"features": [
{
"id": "P735",
"value": 100
},
{
"id": "P734",
"value": 100
},
{
"id": "all_labels",
"value": 100
}
],
"id": "Q19832695",
"match": true,
"name": "William Albert Ablett",
"score": 81.81818181818181,
"type": [
{
"id": "Q5",
"name": "human"
}
]
}
]
}
}
Now the recon service does not handle property ids duplication and rewrite the property score with the last value score uncountered in the property list: https://github.com/wetneb/openrefine-wikibase/blob/master/wdreconcile/engine.py#L297 Indeed in such a scenario the expected data format would be only one property object with an array listing the possible values.
This means that only the last value is taken into account in the global score calculation.
To illustrate let's just reverse the property order in the previous example
{
"q0": {
"query": "William Albert Ablett",
"type": "Q5",
"properties": [
{
"pid": "P735",
"v": "Albert"
},
{
"pid": "P735",
"v": "John"
},
{
"pid": "P734",
"v": "Ablett"
}
]
}
}
which gives
{
"q0": {
"result": [
{
"description": "French painter, designer and engraver",
"features": [
{
"id": "P735",
"value": 22
},
{
"id": "P734",
"value": 100
},
{
"id": "all_labels",
"value": 100
}
],
"id": "Q19832695",
"match": false,
"name": "William Albert Ablett",
"score": 67.63636363636364,
"type": [
{
"id": "Q5",
"name": "human"
}
]
}
]
}
}
The only "given name" value taken into account is "John" which makes the score to be much worst.
I think that behavior is wrong.
Now the way to correct that needs discussion. Since the specs allow to have duplicated properties ids in the properties list I would vote for adding some code just ahead of the prepare_properties call to merge properties object with same ids: https://github.com/wetneb/openrefine-wikibase/blob/master/wdreconcile/engine.py#L155
Doing so would make as if duplicated properties id were actually forbidden but in a magical hidden way which is not the best thing I guess?
I would be very happy to contribute by posting a PR to work on this but I need your decisions on what's the correct way to handle this.
@paulgirard It depends on the recon service and if it provides controlled filter constraints or not. In your case, you are looking for a logical OR.
Back in Freebase days it might look like this:
?query="William Albert Ablett"
&filter=(any P735:"Albert" P735:"John" (and P734:"Ablett"))
Even when Freebase API was still available to handle that filter, the 2.6 version code from Freebase days however did not have support to construct query using filter arguments. The Suggest code (\OpenRefine-2.6-rc.2\main\webapp\modules\core\externals\suggest\suggest-4_3.js) did contain it however, but we did not incorporate full support for filter()
handling against optional property columns to recon.
If a recon service wanted to introduce a search feature for a mechanism like filter()
to accommodate AND, OR, NOT, SHOULD then I think OpenRefine already kept some support for that on the backend, but I defer to @wetneb on that.
In our reconciliation-api spec, we do have support for AND, OR, SHOULD as noted here:
https://github.com/reconciliation-api/specs/blob/master/latest/index.html#L411
But I'm not sure if we really called out in the spec that you can provide multiple properties
arrays (I don't think so, because we also didn't allow the type_strict
to directly relate to a particular properties
array. So with the current version of the api spec, you would have to query twice (form 2 queries with almost the same data, and only vary the type_strict
and properties
) which is totally not ideal compared to the very nice flexible filter constraint handling that the Freebase API originally provided.
@wetneb given the above, perhaps you can comment on if the query
?query="William Albert Ablett"
&filter=(any P735:"Albert" P735:"John" (and P734:"Ablett"))
would actually even be accomplished with a single query using the reconciliation api, or if it could not and we need to discuss in this issue https://github.com/reconciliation-api/specs/issues/88 ?
From old Freebase API docs (see https://developers.google.com/freebase/v1/search-overview and https://developers.google.com/freebase/v1/search-cookbook) :
This combining behavior can be overriden and better controlled with the filter parameter which offers a richer interface to combining constraints. It is an s-expression, possibly arbitrarily nested, where the operator is one of:
- any, logically an OR
- all, logically an AND
- not
- should, which can only be used at the top level and which denotes that the constraint is optional. During scoring, matches >that don't match optional constraints have their score divided in half for each optional constraint they don't match.
Note: It's possible to do a OR on a property like this
{
"q0": {
"query": "William Albert Ablett",
"type": "Q5",
"properties": [
{
"pid": "P735",
"v": ["Albert", "John"]
},
{
"pid": "P734",
"v": "Ablett"
}
]
}
}
But to my knowledge it's not possible to issue such a query with open refine. And as stated before the way the recon API treats duplicated properties in a query does not look like ideal.
Actually I think it is possible to issue such a query with OpenRefine, using the records mode.
It is possible? You mean with the current version or with a future release? Would actually fit my needs. Nevertheless I still think the duplicated properties situation I described here should be handled differently.
ps: Is there a way to log the reconciliation queries issued by open refine in some debug mode?
Try the following dataset: | ID | title | author |
---|---|---|---|
1 | Flexible Solar Cells | Giovanni Palmisano | |
2 | Rosaria Ciriminna |
Open this in OpenRefine and turn on the records mode. Then, reconcile the "title" column and add the second column as a property (for instance, "author (P50)" on Wikidata). This should generate a query similar to this one: https://reconciliation-api.github.io/specs/latest/#example-5
So after some tests I don't think that's what Open Refine does currently.
I did a reconciliation to Q13442814 with constraint on P2093 on your dataset :
[
{
"op": "core/recon",
"engineConfig": {
"facets": [],
"mode": "record-based"
},
"columnName": "title",
"config": {
"mode": "standard-service",
"service": "https://tools.wmflabs.org/openrefine-wikidata/en/api",
"identifierSpace": "http://www.wikidata.org/entity/",
"schemaSpace": "http://www.wikidata.org/prop/direct/",
"type": {
"id": "Q13442814",
"name": "scholarly article"
},
"autoMatch": false,
"columnDetails": [
{
"column": "author",
"propertyName": "author name string",
"propertyID": "P2093"
}
],
"limit": 0
},
"description": "Reconcile cells in column title to type Q13442814"
}
]
I have 100 with both values like that
But If I update the data by using my name in the first author cell:
Score is 85 where it should be 100 as the second author matches perfectly one of the property value.
It looks like only the first row is used as this: https://wikidata.reconci.link/en/api?queries=%7B%22q0%22%3A%7B%22query%22%3A%22Flexible+Solar+Cells.%22%2C%22type%22%3A%22Q13442814%22%2C%22properties%22%3A%5B%7B%22pid%22%3A%22P2093%22%2C%22v%22%3A%22Paul+Girard%22%7D%5D%7D%7D
Indeed if I switch the author order I am back to 100
Ok! Good to know. Perhaps it requires having column groups as well. Anyway this is not really discoverable at the moment, it really ought to be changed.
"column groups"? Oh I didn't know about this: https://docs.openrefine.org/technical-reference/architecture#column-groups I might test by importing a json then. I'll let you know.
Anyway this is not really discoverable at the moment, it really ought to be changed.
Yeah but in the meantime having a how-to could help the discoverability.
Since the remaining action items for this are on OpenRefine's side (https://github.com/OpenRefine/OpenRefine/issues/3139), let us close this here.
Originally posted by @paulgirard at https://github.com/OpenRefine/OpenRefine/issues/4993.