zendesk / scala-flow

A lightweight library intended to make developing Google DataFlow jobs in Scala easier.
Apache License 2.0
14 stars 1 forks source link

Use the Coder type class to implicitly determine coders #9

Closed dasch closed 7 years ago

dasch commented 7 years ago

Dataflow's built-in CoderRegistry system doesn't support registering coders for types that have type class restrictions. This presents us with problems, as we'd like to use such types. Furthermore it's a rather involved process to make a coder class take type parameters.

Since we're in Scala-land we might as well use the tools available to us: type classes.

This PR replaces the use of CoderRegistry with implicitly defined coders.

dasch commented 7 years ago

@bcasey-zd wrote:

This is neat alternate approach.

There's one major downside: It forces users to always use our methods e.g. map, filter. If they use their own transforms (generically call apply or custom PTransform) then they have to remember to call setCoder

For that reason I'm not sure we should replace the existing scala-flow/core - but it could go into an experimental package e.g scala-flow/implicits-experimental People could play with it and the feedback would be interesting.

The problem is that the other approach doesn't work with all the types we need, so unless we can fix that it's a no-go :-/

I would be in favor of being more explicit about wrapping PCollection and providing our own apply method with the added constraint (probably named something other than apply, since that conflicts with the Scala convention). This would make everything type safe and avoid runtime errors when there's no encoder for a type. I'd very much like us to pursue this direction.

bcasey-zd commented 7 years ago

The problem is that the other approach doesn't work with all the types we need, so unless we can fix that it's a no-go :-/

The implicits are only needed for the way those Monoid types are currently structured - there are other approaches, so we're not blocked.

I would be in favor of being more explicit about wrapping PCollection and providing our own apply method with the added constraint (probably named something other than apply, since that conflicts with the Scala convention). This would make everything type safe and avoid runtime errors when there's no encoder for a type. I'd very much like us to pursue this direction.

Sure, this is more aligned with the Scio approach. Downside is that either we wrap, and need to make sure all functionality is available, or we don't and there are unexpected gaps that people can fall into. Right now we get that for free.

I do agree that's it an interesting avenue worth pursuing - a separate sub-module (in the main project) seems like the sensible approach

dasch commented 7 years ago

@bcasey-zd I feel that having the PCollections be type safe (in that there's a guaranteed coder if it compiles) is important. I'm not sure I have the Scala fu to make this a separate package – wouldn't I need to replace all the collection functions with versions that enforces the type constraint?

bcasey-zd commented 7 years ago

@bcasey-zd I feel that having the PCollections be type safe (in that there's a guaranteed coder if it compiles) is important. I'm not sure I have the Scala fu to make this a separate package – wouldn't I need to replace all the collection functions with versions that enforces the type constraint?

True, the compile safety is a nice benefit, but Dataflow does already give us "deploy" safety. 🤞

Another nice benefit is that the dedicated Scala coder classes should no longer be necessary, e.g something like implicit def listCoder[T](implicit t: Coder[T]): Coder[List[T]] = DelegateCoder[List[T], JList[T]](JListCoder.of(coder), _.asJava, _.asScala.toList) should work.

By sub-module I meant a separate sbt project rather than a separate package. Keep the same class names and packages - it'll essentially be an alternate version. Have a look at our Dataflow project to see to create how to add a sub-project to an sbt multiproject build. This will create a separate jar artefact - so our build will publish 2 jars.

dasch commented 7 years ago

I think we should either adopt this approach or abandon it – it's way too early to branch off and try to support two different code bases that implement the same functionality. I'd very much like to see if we can use this rather than try to subclass everything in order to get around the limitations of CoderRegistry.

dasch commented 7 years ago

Also:

dasch commented 7 years ago

@bcasey-zd wrote

True, the compile safety is a nice benefit, but Dataflow does already give us "deploy" safety.

I've often introduced an encoding problem only to notice it a lot later when I tried to deploy – it would be really great to get that guarantee by the compiler.

bcasey-zd commented 7 years ago

Ok, I've combed through the Dataflow and Beam source code and I think we'll be ok - CoderRegistry is used to infer coders only if one isn't explicitly set.

The only gap is the apply method, we can workaround it with the transform method. It's not perfect but I think the benefits outweight the cost.

I'll pull these changes into a new PR and we can take it from there.