Open itsravenous opened 8 years ago
this will depend totally on the subject upload order being correct and not concurrent. I'm not a fan of this. We've got selection strategies in the API to deal with this and we should support this strategy there. We have plans to provide the ability to select the first page randomly and then provide a prev / next subject link to allow page turning.
Also this will suffer under a user influx / media event where we serve the start images to a large set of users and oversample before retirement (SGL, etc).
Completely understand. Sounds like we're better off waiting for the backend support. In the meantime, you and @eatyourgreens have mentioned priorities on subject sets (or subject set links?); @saschaishikawa suggested manually adding priorities to the existing OW data based on the page number in the metadata - does that sound viable?
To be honest the actual page linking feature is really just metadata each subject. i.e. add prev: subject_id_X, next: subject_id_y links to each one. I can help do this if needed but we really need that curated linked list to do it.
Priority selection strategies work but suffer from the oversampling issue under media events, from memory we used this for Annotate with small subject sets and it didn't work so well.
Oki doki! I'll take a look at the subjects and see what metadata we have on each set to infer order
Sorry forgot to add that we need add api support to allow a param to set the selection context (seen_before / retired) to the normal /subjects/id
route so we can mark those pages so you can wire up the UI correctly..
If we implement a linked list by adding metadata, does that mean we can only grab one subject at a time (because we won't know where the "next" link will point to)?
Yes... but you could still get it to construct a queue in the background while you're looking at the current subject
If we know the subject ids we can request in bulk via a URL like /resource_name?ids=1,2,3
https://github.com/RestPack/restpack_serializer/blob/master/README.md#by-primary-key
So we can get a random offset using subject selection service and then allow page turning / URL linking via subject ids.
If this doesn't work let me know, it should be supported.
Grabbing a list of subjects by ids works fine. @camallen so you're suggesting we fetch the first unseen subject randomly, using cellect, something like this:
params = {subject_set_id: 2493, sort: 'cellect', page: 1, page_size: 1};
zooAPI.type('subjects').get(params).then(function(res){console.log('Requested Subject(s): ', res);});
And then use the metadata
links within that subject to build a new batch of pages and add them to the local storage queue
params = {id: "next_id1", "next_id2", ...};
zooAPI.type('subjects').get(params).then(function(res){console.log('Requested Subjects: ', res);});
Is there a way to dump a list of subject ids for a given subject_set? Is it safe to assume sequentially increasing subject ids are in order of increasing page number? I'm trying to figure out the best way to create those linked lists.
Almost, they will be increasing but its a shared tablespace so the ok may not increment sequentially.
you'll have to traverse the subjects depending on the metadata (linked list vs array of subject ids), I'm not sure how you can create a page of data in 1 go (array of subject ids?) but you can get next / prev in 1 go.
As rog said you can construct a queue in the background and the normal mode of load / transcribe would help with this. Page turning may just want to get next on each turn page event...?
Cool. This is helpful. Here's a proposed fix that I wrote up, mostly to wrap my head around the problem and bounce ideas to make sure I'm not over complicating anything.
Proposed Fix: To curate a linked list of subjects in the proper page order.
This requires a traversal through all the subjects in order to determine a correspondence between subject id and page number, followed by a sorting of subject ids by page number. Note that for a given subject, the only reliable source to determine ordering is by the metadata.pageNumber
field. Timestamps and subject ids (which usually increase monotonically with page numbers) are unreliable due to asynchronous uploads, etc.
Use API to get the first subject, currSubject
(not necessarily the first page).
Store currSubject.id
and currSubject.metadata.pageNumber
in a hash and push to an array. Then use currSubject._meta.subjects.next_href
to get the second subject. For example, if currSubject._meta.subjects.next_href: '/subjects?page=2&page_size=1'
, we would extract page=2
and use the API call to get the second subject. Continue until all pages have been traversed, storing the subject id and page number each time.
Sort the array by page number. This'll give the proper ordering of subject ids.
For each subject id, we first get the subject hash and add next_subject_id
and prev_subject_id
fields to the meta hash. This modified subject can then be pushed back up via the Panoptes API.
Lastly, the Old Weather codebase needs to be modified to fetch a random unseen subject and, as a secondary step (with at least one additional API call), the next/prev subjects can be cached.
One idea would be to store the next 5 (or however many) subject ids in metadata instead. Something like
subject.metadata.next_subject_ids = [1066092,1066093,1066094,1066095,1066096]
subject.metadata.prev_subject_ids = [1066090,1066089,1066087,1066086,1066085]
We would just have to take care to handle the near-end cases where we have less array elements.
This looks good, a few thoughts.
Use API to get the first subject, currSubject (not necessarily the first page).
Yep just use the normal subject end point selection here (most likely passing the set_id and sort param).
Store currSubject.id and currSubject.metadata.pageNumber in a hash and push to an array. Then use currSubject._meta.subjects.next_href to get the second subject. For example, if currSubject._meta.subjects.next_href: '/subjects?page=2&page_size=1', we would extract page=2 and use the API call to get the second subject. Continue until all pages have been traversed, storing the subject id and page number each time.
So this will traverse the list and create the link list and then update the API subject metadata. I think we can just use the subjects export csv file for this instead of querying the API.
For each subject id, we first get the subject hash and add next_subject_id and prev_subject_id fields to the meta hash. This modified subject can then be pushed back up via the Panoptes API.
Yes!
Lastly, the Old Weather codebase needs to be modified to fetch a random unseen subject and, as a secondary step (with at least one additional API call), the next/prev subjects can be cached.
Re the first point, I don't think so, looks like it currently uses the correct url (except that damn page param)
https://panoptes.zooniverse.org/api/subjects?sort=queued&workflow_id=886&page=1&page_size=20&subject_set_id=2520
. That workflow is grouped and not prioritised so that URL will return a random subject for that set. Yes to the second step but that is just https://panoptes.zooniverse.org/api/subjects/subject_id?selection_context=true
(pending working selection context)
One idea would be to store the next 5 (or however many) subject ids in metadata instead. Something like
Could be an idea to test per one and per batch(5) response times manually to see how this goes..as one should be able to buffer at least 1 (probably more) during render / after page load, etc. That'll give you a good idea about how you want to do it.
At the basic level this is probably largely solved by sorting the subject fetch by
created_at
, eschewingcellect
for a manual check on the frontend for whether the user has seen the subject. We could then add some nice UI around showing the current page in the context of the log book and allow navigation between pages. @rogerhutchings says he has some nice designs somewhere :)