state gets moved forward without data being written when batch load fails.

z3z1ma / target-bigquery

target-bigquery is a Singer target for BigQuery. It supports storage write, GCS, streaming, and batch load methods. Built with the Meltano SDK.

MIT License

28 stars 38 forks source link

state gets moved forward without data being written when batch load fails. #101

Open loveeklund-osttra opened 1 month ago

loveeklund-osttra commented 1 month ago

I think it is the behaviour described here. https://github.com/z3z1ma/target-bigquery/blob/9d1d0b08606a716a5a36f53b3388cbd6055535a8/target_bigquery/target.py#L544C9-L549C79

I suspect what happened was that one on my workers failed on a bad row but the other was able to write out data. Resulting in state being moved forward without any data from the bad sink being written. What is the upside vs downside that is referenced in the comment? is it that data gets read from source but not written to target?

loveeklund-osttra commented 1 month ago

I've looked into this some more and I don't think it is the fail_fast that is causing the issue. From what I've been able to see the issue seems to arise when this load_job https://github.com/z3z1ma/target-bigquery/blob/main/target_bigquery/batch_job.py#L63 fails for some reason. I've also discovered that you can get a "silent" error where a load_job fails without any error being raised. I've seen this happen when the load_job fails on the last load. This unfortunately also moves the state forward. I'd really like to get some help looking into this as I don't fully understand all the parts of the target with the workers and where errors are caught and not. I'll see if I can figure it out and come up with a fix, but if I can't we have to stop using this target :(

loveeklund-osttra commented 1 month ago

Accidentally closed the issues...

I think I somewhat understand what happens, it's something with the parallelization and it not waiting properly when it goes to requeue the job in BatchJobWorker.run . I'll try to get some more details soon

loveeklund-osttra commented 1 month ago

If you want to replicate the error you can check out this commit https://github.com/loveeklund-osttra/target-bigquery/tree/308859d93da38135a30433edb523c970f4bdb371 install the tap-testsource and run meltano run tap-testsource target-bigquery and you should see that your job does't fail, your state gets moved forward and the bigquery load job fails.

I've tried with the other loading methods as well and it gets the same error for all of them except gcs_stage, which actually fails because it triggers the loading of data to bigquery in cleanup and not in the run of the worker.

I added some logging statements to get some clarity into why it fails and I think the problem is the requeueing logic that happens in the batchjobworker.run causes the pipeline to not wait for it to finish properly

I'm going to see if I can fix it by removing the retrying logic in the workers run commands