weso / sparkwdsub

Spark processing of wikidata subsets
MIT License
0 stars 3 forks source link

Assembly doesn't find some required cats libraries #1

Closed labra closed 2 years ago

labra commented 2 years ago

Although the project seems to work locally and the tests pass. When trying to launch it using spark-submit it raises an error that doesn't find some cats libraries.

We were using decline to parse command line arguments and it failed because it didn't find the cats methods used by decline, so we changed it to use plain arguments. It seems to start working, but crashes when starting to work.

At this moment this is the output of the crash:

21/08/31 14:05:37 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 36909.
21/08/31 14:05:37 INFO NettyBlockTransferService: Server created on 192.168.1.134:36909
21/08/31 14:05:37 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/08/31 14:05:37 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.1.134, 36909, None)
21/08/31 14:05:37 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.134:36909 with 434.4 MiB RAM, BlockManagerId(driver, 192.168.1.134, 36909, None)
21/08/31 14:05:37 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.1.134, 36909, None)
21/08/31 14:05:37 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.1.134, 36909, None)
java.lang.NoSuchMethodError: cats.kernel.CommutativeSemigroup.$init$(Lcats/kernel/CommutativeSemigroup;)V
    at cats.UnorderedFoldable$$anon$1.<init>(UnorderedFoldable.scala:81)
    at cats.UnorderedFoldable$.<init>(UnorderedFoldable.scala:81)
    at cats.UnorderedFoldable$.<clinit>(UnorderedFoldable.scala)
    at fs2.internal.Scope.$anonfun$open$1(Scope.scala:138)
    at cats.effect.IOFiber.runLoop(IOFiber.scala:381)
    at cats.effect.IOFiber.execR(IOFiber.scala:1151)
    at cats.effect.IOFiber.run(IOFiber.scala:128)
    at cats.effect.unsafe.WorkerThread.run(WorkerThread.scala:359)
Exception in thread "main" java.lang.NoSuchMethodError: cats.kernel.CommutativeSemigroup.$init$(Lcats/kernel/CommutativeSemigroup;)V
    at cats.UnorderedFoldable$$anon$1.<init>(UnorderedFoldable.scala:81)
    at cats.UnorderedFoldable$.<init>(UnorderedFoldable.scala:81)
    at cats.UnorderedFoldable$.<clinit>(UnorderedFoldable.scala)
    at fs2.internal.Scope.$anonfun$open$1(Scope.scala:138)
    at cats.effect.IOFiber.runLoop(IOFiber.scala:381)
    at cats.effect.IOFiber.execR(IOFiber.scala:1151)
    at cats.effect.IOFiber.run(IOFiber.scala:128)
    at cats.effect.unsafe.WorkerThread.run(WorkerThread.scala:359)

My feeling is that the assembly merging policy we use to generate the assembly is failing either because it discards some required file to locate those classes or because it uses an old version of cats.

This is the build.sbt lines that declare the assembly merging policy:

 ThisBuild / assemblyMergeStrategy := {
     case PathList("META-INF", xs @ _*) => MergeStrategy.discard
     case x => MergeStrategy.first
    },
. . .

and when assembling, the system says:

info] compiling 1 Scala source to /home/labra/src/wikidata/sparkwdsub/target/scala-2.12/classes ...
[info] Strategy 'discard' was applied to 663 files (Run the task at debug level to see details)
[info] Strategy 'first' was applied to 176 files (Run the task at debug level to see details)

One intuition is that some required META-file is discarded...

The documentation on sbt-assembly describes several strategies about assembling that we may need to consider.

thewillyhuman commented 2 years ago

HI, @labra this issue has been actively reported in Spark 3 and Cats 2.2.0 (https://github.com/typelevel/cats/issues/3628, https://issues.apache.org/jira/browse/SPARK-33077). Apparently should be solved... Anyway, I managed to submit to spark the application providing the flags indicated by https://github.com/typelevel/cats/issues/3628#issuecomment-859810657.

Unfortunately, once I try this workaround I ended on #2.

thewillyhuman commented 2 years ago

Usage of

--conf spark.driver.userClassPathFirst=true
--conf spark.executor.userClassPathFirst=true

is more harmful than beneficial.

By using these flags, any spark class loses preference over any of the user's flags or any other flags that may be on the user's premises. Therefore the errors that appear afterward are very painful to solve. The solution we have found is to shadow the dependencies of cats that we use to be able to use them under another name. In our case we simply add the following lines in the build.sbt and set Spark dependencies as provided.

assembly / assemblyShadeRules := {
  val shadePackage = "org.globalforestwatch.shaded"
  Seq(
    ShadeRule.rename("cats.kernel.**" -> s"$shadePackage.cats.kernel.@1").inAll
  )
}

This solution was found in this GitHub commit https://github.com/wri/gfw_forest_loss_geotrellis/commit/bef3d50cd107ed024b997d4203486f78965e0122.

thewillyhuman commented 2 years ago

Looks like this issue hsa been solved in the latest wdsub releases.