zooniverse / aggregation-for-caesar

Apache License 2.0
9 stars 15 forks source link

Auto detect "bad" trascription restults and shortcut the aggregation code #347

Open CKrawczyk opened 4 years ago

CKrawczyk commented 4 years ago

When the transcription aggregation parameters are not tuned correctly sometimes a large number of lines of text end up being seen as one line of text (and in the worst cases the entire page is seen as one line). When this happens the aggregation code attempts to align several hundred unique transcriptions, leading to slow down and if it hits 2mins a timeout.

As this case will return gibberish anyways, there is no harm stopping the aggregation early. This could be potentially be detected by looking at how many witnesses are added to collatex (https://github.com/zooniverse/aggregation-for-caesar/blob/master/panoptes_aggregation/reducers/optics_line_text_reducer.py#L201 and https://github.com/zooniverse/aggregation-for-caesar/blob/master/panoptes_aggregation/reducers/text_utils.py#L314), and putting a cap on that number (e.g. only let the first 25 be added for alignment).

Impacts of this cap:

alnah005 commented 3 years ago

I'm currently trying to use the Poly line text reducer for the ACLS project. I was wondering if this problem could be related to the execution never completing. I think it would be helpful to use/add a verbose option in the code to know where the code gets stuck. I've attached sample of the log file that I have to show that the reduction seems to be working up until 99%.

Reducing: N/A% |                                               | ETA:  --:--:--
Reducing:   0% |                                               | ETA:   1:31:43
Reducing:   0% |                                               | ETA:   1:56:51
Reducing:   0% |                                               | ETA:   1:34:34
Reducing:   0% |                                               | ETA:   1:06:23
Reducing:   0% |                                               | ETA:   0:58:09
Reducing:   0% |                                               | ETA:   0:50:04
Reducing:   0% |                                               | ETA:   0:37:12

.....

Reducing:   7% |###                                            | ETA:   0:09:25
Reducing:   7% |###                                            | ETA:   0:09:24
Reducing:   7% |###                                            | ETA:   0:09:24
Reducing:   7% |###                                            | ETA:   0:09:24
Reducing:   7% |###                                            | ETA:   0:09:24
Reducing:   7% |###                                            | ETA:   0:09:24
Reducing:   7% |###                                            | ETA:   0:09:24
Reducing:   7% |###                                            | ETA:   0:09:23
Reducing:   7% |###                                            | ETA:   0:09:22
Reducing:   7% |###                                            | ETA:   0:09:21
Reducing:   7% |###                                            | ETA:   0:09:20
Reducing:   7% |###                                            | ETA:   0:09:20
Reducing:   7% |###                                            | ETA:   0:09:21

.....

Reducing:  46% |#####################                          | ETA:   0:07:38
Reducing:  46% |#####################                          | ETA:   0:07:38
Reducing:  46% |#####################                          | ETA:   0:07:38
Reducing:  46% |#####################                          | ETA:   0:07:38
Reducing:  46% |#####################                          | ETA:   0:07:38
Reducing:  46% |#####################                          | ETA:   0:07:38
Reducing:  46% |#####################                          | ETA:   0:07:37
Reducing:  46% |#####################                          | ETA:   0:07:37
Reducing:  46% |#####################                          | ETA:   0:07:37
Reducing:  46% |#####################                          | ETA:   0:07:37
Reducing:  46% |#####################                          | ETA:   0:07:38
Reducing:  46% |######################                         | ETA:   0:07:37
Reducing:  46% |######################                         | ETA:   0:07:36
Reducing:  46% |######################                         | ETA:   0:07:36
Reducing:  46% |######################                         | ETA:   0:07:36
Reducing:  46% |######################                         | ETA:   0:07:36
Reducing:  46% |######################                         | ETA:   0:07:35
Reducing:  47% |######################                         | ETA:   0:07:35
Reducing:  47% |######################                         | ETA:   0:07:35
Reducing:  47% |######################                         | ETA:   0:07:35
Reducing:  47% |######################                         | ETA:   0:07:35
Reducing:  47% |######################                         | ETA:   0:07:34

.....

Reducing:  99% |############################################## | ETA:   0:00:00
Reducing:  99% |############################################## | ETA:   0:00:00
Reducing:  99% |############################################## | ETA:   0:00:00
Reducing:  99% |############################################## | ETA:   0:00:00
Reducing:  99% |############################################## | ETA:   0:00:00
Reducing:  99% |############################################## | ETA:   0:00:00
Reducing:  99% |############################################## | ETA:   0:00:00
Reducing:  99% |############################################## | ETA:   0:00:00

It has been stuck at 99% for about an hour or so. In addition, using the -s option I was able to get a final output, I'm just not sure how confident I can be on the final result being complete.

Update:

After leaving it run for a further 6 hours, it completed.

CKrawczyk commented 3 years ago

Yeah, 6 hours for 1% sounds like this bug.

Adding a new verbose level to print out the current subject ID might be a good way to figure out where it gets stuck so you know what subject ID will give back junk at least (and that could be used to figure out what classification from the extract file is messing it up).