zubtsov / spark-jobs

Apache License 2.0
0 stars 0 forks source link

Initial idea, design & requirements #1

Open zubtsov opened 3 years ago

zubtsov commented 3 years ago

The idea of this project is to minimize the amount of boilerplate code each spark job has and to allow a data engineer to specify just the bare minimum of configuration and to focus on the business logic and things that really matter. You don't have to invent artificial concepts for your job to make the code more OOP-looking. You don't have to think about job architecture because it's predefined & general enough to support most (if not all) of the use cases. In some sense it resembles Apache Maven which defines standard project structure, but here the standard job architecture is defined.

The job can be represented as a graph of named tables (vertices) and dependencies between them (edges) which defines the order of evaluation.

There should be a dedicated entity/abstraction for accessing tables by name. Tables produced within the workflow should be accessible by default with the ability to override their reading. External tables should be defined by an engineer.

There must be different pre-defined:

  1. Reading approaches e.g. full, incremental, rolling period etc.
  2. Writing approaches e.g. merge, table overwrite, partition overwrite, append etc. Data engineer should be able to add new ones or customize the existing.

There should be the ability to pass context information and general metadata between tables builders.

There should be the ability to configure the data format of each table (delta, avro, parquet, csv, etc.)

There should be the ability to skip tables building based on custom metrics/queries.

There should be the ability to write both stateless & stateful jobs. For stateful jobs, there should be an abstraction/entity used to store its state in external places.

We should also think about how to incorporate & organize data quality validation & improvement logic.

We should also think about templating of configuration (to allow deployment on different DEV/QA/UAT/PROD environments, which may have different paths for tables etc., for example).

There should be an ability to share repeating parts of the graph by using some kind of prototyping or whatever to avoid duplication of configuration/code.

Tables builders should be created either as Scala code or as XML configuration + SQL code.

You should be able to define your own generic job which is specific to your project and use it as a template as well.

Use annotations + dependency injection for defining dependencies between tables? https://www.playframework.com/documentation/2.8.x/ScalaDependencyInjection

You still must be able to fine-tune & optimize performance if needed.

Is it general enough to assume that a table builder can produce only one table?

In the future, we should collect different use cases and accumulate them to make the job more and more generic.

zubtsov commented 3 years ago

How to handle the case when table has multiple different building logics depending on something (presense/absense of dependencies for example)?

zubtsov commented 3 years ago

Table can have different schema depemding on the presense/absense of the input tables

zubtsov commented 3 years ago

Ability to add pre/ post table/job action

zubtsov commented 3 years ago

Regex pattern for input dependency names

zubtsov commented 3 years ago

Annotations are convenient but they allow defining ETL jobs workflows before compilation. We need to think about defining ETL job workflow dynamically at runtime.

zubtsov commented 3 years ago

We also need to be able to resolve dependencies at compile time (using source code generation), rather than runtime. It would be more efficient with annotations approach. Without annotations it's probably worth doing at runtime.