wiseomran / ala

Automatically exported from code.google.com/p/ala
0 stars 0 forks source link

Standardise assertions across all ALA services #650

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
In developing the R library (ALA4R) for interfacing with ALA data, 
inconsistencies in the assertions have been detected. 

The assertions developed by the ALA are the most significant value-add to the 
records available - particularly for the research community. 'Data Quality' has 
remained one of the most significant issues raised by the research community so 
it is vital that we communicate the assertions in a comprehensive and 
consistent manner. The intent in ALA4R is to enable users to visualize the 
assertions associated with a record set. For this to happen, assertions need to 
be comprehensive and consistent.

There are at least three sources of information related to record assertions in 
the ALA

1. http://biocache.ala.org.au/ws/assertions/codes
2. http://bit.ly/peFILi
3. Downloaded records from spatial portal, biocache, ALA4R etc

There needs to be one location where the list of assertions are, and  
(extremely important), their definitions can be found. This location should be 
linked to wherever assertions are exposed. There are currently missing 
assertions from the list in (2) above and others are not defined.

Assertions need to be a consistently named. At the moment, there are two 
different sets of names (and maybe more). I'd suggest that the names at 
http://biocache.ala.org.au/ws/assertions/codes be used throughout the systems.

As far as I can tell, the assertions appended to downloads are a consistent 
subset. If it was decided to remove assertions for which ALL records are 
"FALSE" (as was suggested happens by Dave M), fine - but this may conflict (at 
least by default) with Issue #489. There is a current maximum of 87 assertions 
so there is probably a case for removal of 'non applicable' assertions from the 
records downloads to reduce data volume.

The attached spreadsheet is a summary of the assertions from the three 
locations noted above. Highlighted cells in the spreadsheet represent missing 
information, as far as I can tell - without comprehensive definitions.

Original issue reported on code.google.com by leebel...@gmail.com on 27 Apr 2014 at 10:59

Attachments:

GoogleCodeExporter commented 9 years ago
Thanks Lee.

The difference between the names in here:

http://biocache.ala.org.au/ws/assertions/codes

and here

https://docs.google.com/spreadsheet/ccc?key=0AjNtzhUIIHeNdHJOYk1SYWE4dU1BMWZmb2h
iTjlYQlE&hl=en_US#gid=0

should just be formatting (camelCase vs uppercase+underscores). Feel free to 
add the camel case equivalent as an additional column to the spreadsheet.

We are going to stick with the camel case for biocache web services as changing 
this will be too painful for us and external users of the services e.g. the R 
library Kristen's group have written (as an aside, I assume you guys have 
looked at this).

The inconsistencies in downloaded assertions is something we'll look into. My 
understanding was that we where limiting the downloaded assertions to those 
applicable to the downloaded dataset. i.e. if one record in the dataset fails 
an test, then include the column). Natasha - can you update us on how this is 
working ?

Original comment by moyesyside on 27 Apr 2014 at 11:21

GoogleCodeExporter commented 9 years ago
Some comments regarding the reference to Issue 489:

By default the downloads include all the assertions that have at least one true 
value in the result set. This is not a conflict with Issue 489. Rather 489 
ensures that the assertions are reported in a consistent order 
(alphabetically).  

Issue 489 goes further by enhancing the downloads so that a user can turn off 
the assertions by providing a param qa=none.  

It also allows a user to supply a list of assertions that they are interested 
in. All assertions supplied will be included in the download irrespective of 
the value. Example: 
qa=decimalLatLongConverted,invalidCollectionDate,speciesOutsideExpertRange,state
CoordinateMismatch,invalidScientificName

The biocache-service changes for Issue 489 are in production.

We still need to modify the biocache hubs so that the download dialog allows a 
user to select a list of assertions to include in there downloads.

Original comment by natasha....@csiro.au on 27 Apr 2014 at 11:22

GoogleCodeExporter commented 9 years ago
Attached from Jeremy Van Der Wal regards differences between ws and download 

Original comment by leebel...@gmail.com on 9 Jun 2014 at 7:40

Attachments: