Open loveeklund-osttra opened 1 month ago
I've looked into this some more and I don't think it is the fail_fast that is causing the issue. From what I've been able to see the issue seems to arise when this load_job https://github.com/z3z1ma/target-bigquery/blob/main/target_bigquery/batch_job.py#L63 fails for some reason. I've also discovered that you can get a "silent" error where a load_job fails without any error being raised. I've seen this happen when the load_job fails on the last load. This unfortunately also moves the state forward. I'd really like to get some help looking into this as I don't fully understand all the parts of the target with the workers and where errors are caught and not. I'll see if I can figure it out and come up with a fix, but if I can't we have to stop using this target :(
Accidentally closed the issues...
I think I somewhat understand what happens, it's something with the parallelization and it not waiting properly when it goes to requeue the job in BatchJobWorker.run . I'll try to get some more details soon
If you want to replicate the error you can check out this commit https://github.com/loveeklund-osttra/target-bigquery/tree/308859d93da38135a30433edb523c970f4bdb371
install the tap-testsource
and run meltano run tap-testsource target-bigquery
and you should see that your job does't fail, your state gets moved forward and the bigquery load job fails.
I've tried with the other loading methods as well and it gets the same error for all of them except gcs_stage, which actually fails because it triggers the loading of data to bigquery in cleanup and not in the run of the worker.
I added some logging statements to get some clarity into why it fails and I think the problem is the requeueing logic that happens in the batchjobworker.run causes the pipeline to not wait for it to finish properly
I'm going to see if I can fix it by removing the retrying logic in the workers run commands
I think it is the behaviour described here. https://github.com/z3z1ma/target-bigquery/blob/9d1d0b08606a716a5a36f53b3388cbd6055535a8/target_bigquery/target.py#L544C9-L549C79
I suspect what happened was that one on my workers failed on a bad row but the other was able to write out data. Resulting in state being moved forward without any data from the bad sink being written. What is the upside vs downside that is referenced in the comment? is it that data gets read from source but not written to target?