ytsaurus / ytsaurus-spyt

YTsaurus SPYT provides an integration with Apache Spark
Apache License 2.0
9 stars 4 forks source link

Lock on a jar when running spyt-job #11

Open MrSandmanRUS opened 1 month ago

MrSandmanRUS commented 1 month ago

Currently, when launching a spyt-job, the jar file from which the launch was made is locked. This imposes restrictions on updating or deleting this jar file while Jobs are running. The lock is not lifted after the Job is started, but is lifted only after the completion of the spyt-job.

How to reproduce:

  1. Put jar on YTSaurus.
  2. Run the spyt-job from this jar file.
  3. While the job is running, a lock is hanging on the jar file.
  4. If you try to delete or overwrite the jar file, the spyt-job will crash with a TimeoutException error.

Expected behavior: The jar file is used only at the very start, during its loading into RAM. After launching the spyt-job, there should be no locks on the jar. The jar can be overwritten and deleted while the spyt-job is running without dropping the spyt-job itself.

zlobober commented 1 month ago

What if the node that executes some parts of the computation goes offline? We need to be able to recover from that, so the jar lifetime must be at least as the duration of the job.

In general, all kinds of computation engines in YT take snapshot locks on all required artifacts for the whole lifespan of the corresponding computation.

Why is this even a problem for you?

MrSandmanRUS commented 1 month ago

This logic is unusual after using spark on hadoop.

In theory, after launching, the jar file should be loaded into RAM and it doesn't matter to us what happens to the Jar file.

For example, we have processes that run from a Jar file. We want to release an update. If in the case of spark we could just download a new jar and delete the old one, then in this case we need to enter a new version with the version in the jar name + monitor when we can delete old jar files so as not to clutter the directory.

zlobober commented 1 month ago

And in practice, how does the recovery happen if an executor node fails? Storing something only in RAM does not look like a resilient option.

Cc @alextokarew

MrSandmanRUS commented 1 month ago

I assume that spark can restore the operation of a fallen node using a saved jar image on RAM in other nodes.