zambezi / ez-build

Zambezi build tool
MIT License
2 stars 1 forks source link

Refactor ez-build to implement pipelines as transformation functions #33

Open mstade opened 8 years ago

mstade commented 8 years ago

Expected Behavior

Recent work in #31 spurred an idea to refactor ez-build to properly define pipelines as a composition of transformation functions, working on a data structure representing the project. What this means is that each transformation function would focus on a single transformation step, such as linking (see #32,) modifying a data structure representing the project. This data structure would include project metadata such as project version, designated directories etc, but also things like files so we can rationalize the reading/writing of files.

For example, if a transformation step reads some files and outputs some others, it wouldn't actually do the reading/writing of the files, but actually operate on a data structure representing those files, such that we could decide when those actions take place. This is useful for performance, but also stability – not having each transformation having to implement file I/O means less chance for errors and greater testability, since we can just mock the project data structure.

The process of running a build would look as such:

  1. Inspect the project and build the project data structure
  2. Given the various CLI options, build or choose which pipeline(s) to run
  3. Execute the pipelines in parallel, with each pipeline returning a transformed project data structure
  4. Merge the results of executing pipelines – unresolvable conflicts (i.e. same file written twice) should be considered bugs; a key aspect of pipelines should be zero overlap (i.e. no interdependencies) such that they can indeed be executed in parallel

Pipelines should be just a transformation function that is a composition of other transformation functions, taking the project data structure as input and returning a modified data structure as output. The data structure should be immutable, so the output should always be a new copy!

Thus we need to define three things:

  1. The project data structure
  2. The transformation function signature
  3. How the transformations are applied, i.e. how file I/O occurs

    Current Behavior

Currently, the pipeline design is a bit ad-hoc, and while file I/O is mostly centralized, it's difficult to know where and how to add new functionality (again, such as the linking described in #32) with ease.

FabienDeshayes commented 8 years ago

One question:

3 . Execute the pipelines in parallel

Isn't your most common case be a single pipeline? I struggle to see how you can have multiple. Even if you treat css and js files separately, there might the case where they need to be treated together in the same pipeline?

mstade commented 8 years ago

Warning: the following is largely just a brain dump that may or may not make sense.

Currently there are three pipelines that may be executed in parallel: JS, CSS, and copy files; so no, the common case isn't a single pipeline. Will they always be executed in parallel? I don't know, but even if they don't the idea of multiple pipelines still makes sense.

When reasoning about the build process we should think about the source as the raw materials, and a pipeline as a conveyor belt attached to zero or more transformers. At the beginning of the conveyor belt, we place a copy of the raw materials in the form of a project data structure, and each transformer along the way will modify this structure as it sees fit, before passing it along to the next transformer. Thus, a pipeline – or conveyor belt – is really just a sequence or composition of transformers. A transformer is a function that takes a project data structure as input, and returns a (potentially new) data structure as output. This would make them ripe for composition, meaning creating a new pipeline is just a matter of creating a new composition of transformation functions.

Thus, a pipeline to build "development" Javascript, i.e. mostly unoptimized and modified to enable things like hot module replacement, would be distinctly different from a "production" JS pipeline – even though they might share the same transformers. These pipelines will probably never be executed in parallel, but the step from having an application design that easily enables defining these kinds of pipelines to executing those pipelines in parallel is miniscule.

Interestingly (or not, I don't know) since pipelines are just compositions of these transformation functions, pipelines themselves are composable. This would make it trivial to share parts of a pipeline by breaking it up into multiple pieces, where there pieces can then be used to define different pipelines where there is indeed overlap. (For instance, the dev/prod pipelines described in the previous paragraph.)

Coming back to your question, that there may be cases where indeed you want to treat CSS and JS in the same pipeline – consider CSS modules, for instance. This can be implemented by composing the CSS and JS pipelines, turning it from two functions that can be executed in parallel, into something that must be executed sequentially, in case the CSS modules transformation has to work on already transformed CSS, or the other way around where the CSS pipeline must work on CSS transformed by the CSS modules transformation. At least, the naive implementation is to just compose the pipelines.

A more complex implementation may be to allow pipeline dependencies, but only in the one direction. I.e. you can have a pipeline setup like this:

           JS                    CSS
           |                      |
           |              postcss transforms
           |                      |
           |             generate CSS modules
           |                      |
           | <------------------- |
           |                      |
     ES2015 to ES5           output files
           |
      output files

In this case, the JS pipeline depends on the CSS pipeline, and specifically it depends on it having generated CSS modules. I'm not sure what the API for this would look like, but it's interesting because essentially so long as the dependency is always between transforms there can never be a race condition and so we can still execute things in parallel. We'd just have a case of pipelines waiting for their dependency to come around. There's a risk of creating dependencies where a pipeline will never finish. Consider the graph above and you'll see that if the CSS pipeline had a dependency on the ES2015 to ES5 transform having completed before executing postcss transforms, then neither pipeline would work. I don't imagine pipelines to be dynamic, and given there's probably a nice way to declare these pipelines and dependencies, it's probably also quite easy to catch this kind of issue in static analysis when building ez-build itself.

I feel maybe this entire train of thought is half baked, and we should perhaps just implement all the features we want first as part of a v1 of ez-build, and then focus on refactoring for a v2. Although, if the refactoring means no breakage it should really just be v1.1 I guess. Go figure. :o)

FabienDeshayes commented 8 years ago

I think it makes a lot of sense @mstade . Definitely a great base of work for post v1 :)