wiseomran / ala

Automatically exported from code.google.com/p/ala
0 stars 0 forks source link

Unpredictable bulklookup behaviour #593

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
http://bie.ala.org.au/ws/species/bulklookup.json

Results seem to be very upredictable.
bulklookup of "Grevillea humilis" returns the record for Grevillea humilis 
subsp. humilis rather than the species itself; 

bulklookup of "Grevillea humili" returns the same subspecies record;

bulklookup of "Grevillea humil" returns the record for Anthotium humile

Also, records that don't match anything are not included in the returned data 
(the returned record list may contain less items than the number of submitted 
names). Since the returned names aren't guaranteed to be identical to the 
submitted names, this makes it unnecessarily difficult to match the results to 
the submitted query. Can we return an empty record for non-matched names rather 
than skipping them entirely?

[PS this may be a duplicate/already known issue; I've notified by email 
previously. But I can't find it listed as an issue here]

Original issue reported on code.google.com by antarcti...@gmail.com on 22 Feb 2014 at 10:59

GoogleCodeExporter commented 9 years ago
Thanks Ben. assigning to Natasha.

There seem to be problems in the BIE with the genus Grevillea.

Original comment by moyesyside on 24 Feb 2014 at 7:10

GoogleCodeExporter commented 9 years ago
Some notes:

Grevillea seems to be loaded ok
http://bie.ala.org.au/species/urn:lsid:biodiversity.org.au:apni.taxon:378603

but searches result in some errors

http://bie.ala.org.au/search?q=Grevillea

Original comment by moyesyside on 24 Feb 2014 at 7:14

GoogleCodeExporter commented 9 years ago
This has been fixed with the latest release. There was an issue where the bulk 
lookup was not preferring the "exact" name over other matches.

We have not inserted empty records for non-matched names yet. I am not sure of 
the regression effects on other apps that are using the service.

Original comment by natasha....@csiro.au on 28 Feb 2014 at 5:55

GoogleCodeExporter commented 9 years ago
I suggest we make it an additional boolean request parameter to include empty 
records for the non-matched name. We can default this to false (dont include 
empty records) to avoid regression problems.

Original comment by moyesyside on 28 Feb 2014 at 8:06

GoogleCodeExporter commented 9 years ago
I think that we will need to version the API.  At the moment this webservice 
expects the entire request body to be an array of names. We can not add extra 
params to it without changing the format to a JSON map of params.

What I propose is having 2 WS running together with version included in URI so 
that both calls will work.

Original comment by natasha....@csiro.au on 2 Mar 2014 at 10:45

GoogleCodeExporter commented 9 years ago
I have released a test bie-service instance that has a new service that will 
insert a null value when the lookup does not return a value. Thus the order in 
which the results appear correspond to the order of the supplied list.  The 
format of the JSON request body is slightly different. 

Test webservice (please be aware that there may be some performance issues on 
the test server)
http://118.138.243.151/bie-service/ws/species/lookup/bulk

Example JSON body:
{"names":"[\"Grevillea humilis\",\"Macropus rufus\", \"ZZZ nnn\"]"}

Here is the example:
http://apikitchen.com/#dE41G

Original comment by natasha....@csiro.au on 3 Mar 2014 at 3:20

GoogleCodeExporter commented 9 years ago
Thanks for getting stuck into this. A few comments:

1. Test instance works for me, although the names component of the JSON body 
now needs to contain a string representing a JSON array, rather than an actual 
JSON array. Wouldn't it make more sense to have it as 
{"names":["name1","name2"]}?

2. If a versioned API is going to create more headaches than it solves, is it 
sensible to think about allowing a URL parameter to the original web service to 
control the null-record behaviour? e.g. 
http://bie.ala.org.au/ws/species/bulklookup.json?includenull=true (and then 
with the POSTed JSON body the same as before)

3. Still not convinced that the name-matching logic is working as it should. 
e.g. for "Macropus rufus" I get the right name back, but for "Macropus rufu" I 
get Marsilea macropus, with a score of 0.04. Surely "Macropus rufus" is a 
better match to "Macropus rufu"?

Thanks

B

Original comment by antarcti...@gmail.com on 3 Mar 2014 at 6:10

GoogleCodeExporter commented 9 years ago
Thanks Ben.

1. Agree

2. My understanding is that this would work for some http servers but is not a 
recommended approach in the HTTP spec.

3. This is beginning to stretch the API in new ways. These calls weren't really 
intended to support fuzzy matches or matches for partial names. Is this really 
useful in a bulk context ? 

Original comment by moyesyside on 3 Mar 2014 at 9:38

GoogleCodeExporter commented 9 years ago
Re 3: my misunderstanding, perhaps. I confess that originally I wasn't 
expecting the bulklookup to give matches to partial names. It isn't necessary 
(for the R stuff) for it to do so - just exact-matches is fine. But currently 
this service *is* doing fuzzy/incomplete matching. If it's going to do it, then 
surely it should do so sensibly! My concern is that a user might submit a bunch 
of names including (say) "Macropus rufu", and not catch the fact that it's been 
matched to something totally different. (Yes they should check, but I'd argue 
that this particular example isn't a reasonably-expected result for a 
name-matching service). Does it make more sense to only return exact matches? 
(Or again, will that potentially break backwards compatibility with existing 
users?)

Original comment by antarcti...@gmail.com on 3 Mar 2014 at 10:07

GoogleCodeExporter commented 9 years ago
thank Ben. I agree. We should just support exact matches (with synonym 
resolution) and return null where an exact match wasn't found. Backwards 
compatibility shouldn't be an issue as this is a brand new URL path not in use.

Original comment by moyesyside on 3 Mar 2014 at 10:17

GoogleCodeExporter commented 9 years ago
Yep, that'll work for us. Ta.

Original comment by antarcti...@gmail.com on 3 Mar 2014 at 10:19

GoogleCodeExporter commented 9 years ago
OK changes made and deployed to the test server.

Webservice now accepts JSON params correctly:

{"names":["Grevillea humilis","Macropus rufus", "ZZZ nnn","Macropus rufus 
(Desmarest, 1822)"]}

Also matches should be better.

Original comment by natasha....@csiro.au on 5 Mar 2014 at 5:49

GoogleCodeExporter commented 9 years ago
All looks good to me! Thanks.

One remaining issue that may or may not make sense to address is names that can 
match multiple LSIDs. e.g. "Oenanthe" can match either birds or plants. I don't 
know if it makes sense to return all matches in this case? (e.g. the returned 
data structure would be an array of arrays). An alternative would be to add a 
"is_unique" column so that potential exact-but-incorrect matches can be flagged 
for the user to sort out via some other service. 

Original comment by antarcti...@gmail.com on 7 Mar 2014 at 4:01

GoogleCodeExporter commented 9 years ago
Ah, me again. Another problem, I think.

In searching for "Grevillea" I'm getting "Grevillea banksii" as the returned 
name. I'd expect this query to give me "Grevillea" the genus. I'm pretty sure 
that this is happening because the matching is happening on common names as 
well as scientific names. Searching on "red kangaroo" gets me "Macropus rufus".

The API says that this service takes a list of scientific names, it doesn't 
mention common names. Should it then be matching on common names? I think this 
is the cause of the problem.

Original comment by antarcti...@gmail.com on 8 Mar 2014 at 4:44

GoogleCodeExporter commented 9 years ago
OK so we have change the behaviour of the new bulklookup.  By default we are 
not allowing common name matches.  Common name matches can be included (if 
necessary) via the following JSON body:
{"names":["Grevillea"],"vernacular":true}

This will not limit it to only common name matches, rather it replicates the 
original behaviour for regression use.

This change has been deployed to the test server.

Original comment by natasha....@csiro.au on 2 Apr 2014 at 12:17

GoogleCodeExporter commented 9 years ago
released

Original comment by natasha....@csiro.au on 2 Apr 2014 at 2:13

GoogleCodeExporter commented 9 years ago
Not sure that the vernacular option is working. When I try it, I get a 
400 error with:

Format of input incorrect: Unexpected character ('f' (code 102)): was 
expecting double-quote to start field name

Same code without the vernacular part works fine.

(Same behaviour on both test and live server)

Original comment by antarcti...@gmail.com on 2 Apr 2014 at 7:01

GoogleCodeExporter commented 9 years ago
Im guessing you arent setting a content type of application/json in the request.

Heres an example

http://apikitchen.com/#tkcW9

Original comment by moyesyside on 2 Apr 2014 at 7:50

GoogleCodeExporter commented 9 years ago
Turned out it was because my conversion to JSON was writing vernacular 
as an array (i.e. "vernacular":[true] not "vernacular":true). Fixed at 
my end now.

Original comment by antarcti...@gmail.com on 2 Apr 2014 at 8:33