wemake-services / wemake-python-styleguide

The strictest and most opinionated python linter ever!
https://wemake-python-styleguide.rtfd.io
MIT License
2.55k stars 378 forks source link

Make WPS428 configurable (Airflow example) #1737

Open bryzgaloff opened 4 years ago

bryzgaloff commented 4 years ago

Rule change request

Make WPS428 (Found statement that has no effect) configurable.

Thesis

Let WPS428 treat some operators as having side effects and not violating the original rule.

Reasoning

WPS introduces rule WPS428: Found statement that has no effect. The majority of Airflow-based codebases violate this rule because of using >> operator to define dependencies between tasks.

Please see the last line of the following example from the official documentation: t1 >> [t2, t3]. It defines that both tasks t2 and t3 are run only when t1 is completed. And this line violates WPS428 since its result is not assigned to anything.

There is also a method-style way to define dependencies: t2.set_upstream(t1), t3.set_upstream(t1). However, bitshift composition is recommended and is more common.

I don't want to disable this check completely and would rather like to ignore bitshift operators only since they are very uncommon in Airflow-based codebases for purposes other than setting up dependencies.

--

Also huge thanks for you work! It has dramatically simplified teaching my team to use best coding practices without much participation from my side!

sobolevn commented 4 years ago

Hi!

However, bitshift composition is recommended and is more common.

Very good point! I was thinking about the same method when I needed functional pipelines. But, python is not really suited for this method in my opinion, so I have decided to go with pipe() function: https://returns.readthedocs.io/en/latest/pages/pipeline.html

I don't want to disable this check completely and would rather like to ignore bitshift operators only since they are very uncommon in Airflow-based codebases for purposes other than setting up dependencies.

There's not much we can do here. I cannot also just ignore this operator completely in our source code, since

And I cannot think of any simple configuration format for this task. The first idea is that we can include and exclude nodes in the config section. But, this is not going to work as-is. Because, you still don't want 1 >> 2 to be valid even if >> is ignored from this check.

So, I would recommend to either:

  1. Use methods, methods are great! I would prefer a method over a custom dsl in 9 out of 10 cases
  2. Use noqa, I hope that this is not a really common task in your code base

Also huge thanks for you work! It has dramatically simplified teaching my team to use best coding practices without much participation from my side!

Happy to hear that! 😊

I am going to leave this open for some time, maybe you (or others) would have some ideas about what can be done in this case.

bryzgaloff commented 4 years ago

so I have decided to go with pipe() function

Thanks for an example, but this does not look suitable for Airflow. Airflow is a scheduler and the data is not actually processed within Airflow-based code: it only defines the dependencies while actual code execution is usually run on an external environment (e.g. Airflow schedules a task to be executed by either Spark or Celery worker).

The first idea is that we can include and exclude nodes in the config section

I strongly vote for this. See my arguments below.

Because, you still don't want 1 >> 2 to be valid even if >> is ignored from this check.

Yup, 1 >> 2 will still violate the rule but the trick is that bitshift operation are truly very uncommon for the domain where Airflow is used (data engineering). Usually Airflow DAG file defines dependencies only by instantiating some DAG operators (instance of an operator is called task in Airflow terms) and bitshifting them to each other. No actual code and data operating is usually present within DAG files. So as for my case, I would be definitely satisfied with having an ability to completely disable check for concrete operator (bitshift in that case).

we can include and exclude nodes in the config section

Could you please give me a little hint on which parts of WPS can I patch to achieve the desired functionality?

Use methods, methods are great!

As I have said, bitshift is really a recommended and them most common approach.

Use noqa, I hope that this is not a really common task in your code base.

Not really :) Using bitshifts is exactly very common :) Here is an example from our real project:

            unload_task = _unload_to_file(export_config)
            if not export_config.is_incremental:
                _latest_only(export_config.athena_table_name) >> unload_task
            unload_task \
                >> _upload_to_s3(export_config.athena_table_name) \
                >> [
                    _clear_export_dir_and_logs(export_config.athena_table_name),
                    *_repair_tables(export_config),
                ]

So generally I mean using this configuration is safe enough for a specific area of software engineering (data engineering). I may try to contribute if you help me with navigating through WPS code.

sobolevn commented 4 years ago

Could you please give me a little hint on which parts of WPS can I patch to achieve the desired functionality?

Here you go: https://github.com/wemake-services/wemake-python-styleguide/search?q=StatementHasNoEffectViolation

bryzgaloff commented 3 years ago

@sobolevn will you have a chance to review? https://github.com/wemake-services/wemake-python-styleguide/pull/1760