wetneb / openrefine-wikibase

This repository has migrated to:
https://gitlab.com/nfdi4culture/ta1-data-enrichment/openrefine-wikibase
Other
100 stars 24 forks source link

String search beginning with "Wikidata:" results in strange behavior #111

Closed diegodlh closed 1 year ago

diegodlh commented 3 years ago

I'm trying to reconcile some entities whose titles begin with "Wikidata:". See, for example, Q18507561, Q27042516, Q21503284, Q52824698.

I'm sending a POST request to the reconciliation API (both the https://wikidata.reconci.link/en/api and my local instances show the same behavior) with this data:

queries={"q0":{"query":<query_string>,"type":"Q386724","type_strict":"should","properties":[]}}

where <query_string> is, for example, "Wikidata: A New Platform for Collaborative Data Collection" (Q27042516).

This request returns an empty result array:

{
  "q0": {
    "result": []
  }
}

I repeat the request removing the colon after "Wikidata" in the query string (replacing the colon with another character seems to work as well). This time the request returns the expected ID (Q27042516).

Surprisingly, if I repeat the original request now, this time it does return the expected ID too. This seems to be a caching issue. Closing the local instance and starting it again with docker-compose build and docker-compose up does not seem to restart the cache (I'm not familiar with Docker, so I'm not sure what I'm supposed to do to restart it).

I'm not sure if this is specific to query strings beginning with "Wikidata:", but I tried query string "YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia" and had the expected result at first try (i.e., no need to try the colon-free query string).

antoine2711 commented 3 years ago

Interesting observation.

Do you think it could be related to Mediawiki namespaces?

I.e. File:, User:, Template:?

diegodlh commented 3 years ago

You are right! That's the problem indeed. It looks to me like a MediaWiki-API bug. I've reported it here.

In short, in the query+search request here, the srnamespace parameter seems to be overridden by srsearch values beginning with ":", where is a valid Wikidata parameter (e.g., Wikidata, User, etc). As a result, the API returns results from non-Main namespace (i.e., non-QIDs).

This non-QID strings finally are used to get entitites with wbgetentities, which results in an error.

I guess we could include some QID validation step, either at the end of the _srsearch function, or somewhere in the get_items function, but I wonder if all Wikibase instances follow the same /Q[0-9]+/ QID pattern. What do you think?

Surprisingly, if I repeat the original request now, this time it does return the expected ID too. This seems to be a caching issue.

The reason why a non-empty array result is returned after trying a query without the colon is probably because a correct result is retrieved from the local cache at this step.

antoine2711 commented 3 years ago

@diegodlh : this is more and more interesting.

My first impression here, to fix this issue, would be to prefix the search string with the Main Space of Wikidata (and maybe check first if it's not there already…). That way, we would fix our problem here.

Next thing that this makes me think about is: could I search for properties with this?! ;-) But this is a whole concept in itself, pretty OT of this issue.

For the suggestion of Qid validation, I would this this is good, but yes, wikibase for Wikimedia Commons uses Mids, and I think WD also have Eid, and Lid (E for EntitySchema, M for MediaInfo, etc.) Here's the list for WD: Namespaces.

So I guess the check should take that in account, if we could have this tool query those elements.

Are you looking to code these things?

Regards, Antoine

diegodlh commented 3 years ago

prefix the search string with the Main Space of Wikidata (and maybe check first if it's not there already…).

What would that be? I tried "Main:", but it didn't work.

In my code I finally removed the colon from query strings beginning with "some_string:" as a workaround, although adding a space at the beginning of the query string (i.e., " Wikidata: ..." instead of "Wikidata: ...") works as well.

could I search for properties with this?!

I think so, for example here. But you could also set the srnamespace to 120 (which refers to the "Property" namespace in Wikidata).

For the suggestion of Qid validation, I would this this is good, but yes, wikibase for Wikimedia Commons uses Mids

I guess we may have something like a wikibase_id_prefix in the config which defaults to Q for Wikidata.

and I think WD also have Eid, and Lid (E for EntitySchema, M for MediaInfo, etc.)

But those belong to different namespaces, so once a namespace is selected, one should only get QIDs, EIDs, LIDs, etc.

Are you looking to code these things?

Would you agree to wait and see what they say at the MediaWiki-API bug first?

antoine2711 commented 3 years ago

prefix the search string with the Main Space of Wikidata (and maybe check first if it's not there already…).

What would that be? I tried "Main:", but it didn't work.

In my code I finally removed the colon from query strings beginning with "some_string:" as a workaround, although adding a space at the beginning of the query string (i.e., " Wikidata: ..." instead of "Wikidata: ...") works as well.

Can you try just adding a semicolon with no text before? i.e. :Wikidata: A New Platform for Collaborative Data Collection

could I search for properties with this?!

I think so, for example here. But you could also set the srnamespace to 120 (which refers to the "Property" namespace in Wikidata).

If I had more time, I would do more than just try that now… ;-)

For the suggestion of Qid validation, I would this this is good, but yes, wikibase for Wikimedia Commons uses Mids

I guess we may have something like a wikibase_id_prefix in the config which defaults to Q for Wikidata.

and I think WD also have Eid, and Lid (E for EntitySchema, M for MediaInfo, etc.)

But those belong to different namespaces, so once a namespace is selected, one should only get QIDs, EIDs, LIDs, etc.

Yah.

Are you looking to code these things?

Would you agree to wait and see what they say at the MediaWiki-API bug first?

I'm in no hurry for that. I don't have much time to code. But I do know a few thing I would change or add. ;-)

Regards, Antoine

diegodlh commented 3 years ago

Can you try just adding a semicolon with no text before? i.e. :Wikidata: A New Platform for Collaborative Data Collection

Yes, I'd tried and it also work. But I think it does not because it is using the "": namespace, but because ":Wikidata" doesn't match any namespace, so the srnamespace parameter doesn't get overridden. Same thing happens with " Wikidata", which doesn't match any namespace either. You can easily try any combinations here.

wetneb commented 1 year ago

This is a problem on the Wikibase side, not in this repository.