zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.04k stars 6.18k forks source link

Zephyr is missing documentation and validation of device drivers #50788

Open bjarki-andreasen opened 1 year ago

bjarki-andreasen commented 1 year ago

Introduction

Zephyr is missing documentation and build time validation of device drivers. This results in avoidable, obscure errors during development, and having to look through source code, Kconfig files and devicetree bindings files to determine the existence and capabilities of device drivers. This proposal adds a build step which uses predefined Kconfig symbols to determine device driver capabilities, which are used to validate device drivers early, providing helpful, descriptive errors, and for generating device driver documentation.

Problem description

An example devicetree fragment shows the modem bg95 which needs a device driver which is compatible with "quectel,bg95"

&uart0 {
    bg95 {
        compatible = "quectel,bg95";
    }; 
};

Obscure errors during development

In this example, we have two drivers which are compatible with "quectel,bg95", MODEM_GSM_PPP and MODEM_QUECTEL_BG9X.

If we build an application without selecting either compatible driver, we get the obscure

undefined reference to __device_dts_ord_ ...

error. If we select both, we get

__device_dts_ord_ ... defined multiple times

These errors are obscure and they occur at the last stage of the build, costing time both in diagnosing the issue and processing the build itself.

Missing device driver documentation

If we search for the bg95 on the Zephyr documentations page, it will return only the bindings file, as it matches "quectel,bg95". Neither of the compatible device driver's Kconfig symbols MODEM_GSM_PPP and MODEM_QUECTEL_BG9X will be returned.

To consistently discover all compatible device drivers, knowing only the devices compatible string, we must go to the source code, search for all files which include the sanitized compatible "quectel_bg95", look for the CMakeLists.txt, files which include source code files, to determine which Kconfig symbol selects the individual device drivers.

To then determine which API the device drivers support, we must find the source file which includes the relevant DEVICE_DT_DEFINE(), and look to the type of the API struct passed to its API parameter. We can now search for the API on the Zephyr documentation page.

Proposed change

Use Kconfig select feature to select predefined symbols which are mapped to device driver properties by symbol name.

An example symbol for the MODEM_QUECTEL_BG9X driver follows:

    config MODEM_QUECTEL_BG9X
        bool "Enable quectel modem device driver"
        select DEVICE_DRIVER  -> Declares that this symbol enables a device driver
        select DT_COMPAT_QUECTEL_BG95  -> Declares that this device driver supports the quectel,bg95
        select DT_COMPAT_QUECTEL_BG96  -> Declares that this device driver supports the quectel,bg96
        select DEVICE_DRIVER_API_NET_IF  -> Declares that this device exposes the net_if device driver API.

Using the information declared using predefined mapped device driver properties, the obscure linker errors

undefined reference to __device_dts_ord_ ...

and

__device_dts_ord_ ... defined multiple times

will be replaced with the following errors which will show up early in the build, before compilation,

Devicetree node bg95 has status "okay" but no compatible device driver has been included in the build.
The following compatible device drivers where found: MODEM_GSM_PPP, MODEM_QUECTEL_BG9X

and

Conflicting device drivers compatible with devicetree node bg95 have been included in the build.
Only one of the following device drivers may be included in the build: MODEM_GSM_PPP, MODEM_QUECTEL_BG9X

The solution saves time by catching device driver errors in an early stage of the build, and by returning helpful concrete errors. It also enables generating documentation for all existing device drivers, ensuring that searching for bg95 on the Zephyr documentations page will return links to all supported device driver's documentation pages, which internally link to relevant devicetree bindings, Kconfig symbols and API references.

Detailed RFC

The following steps are required to implement this proposed solution:

  1. Determine naming conversions which identify device driver properties from symbol names
  2. Add device driver validation module
  3. Start adding new symbols to device drivers to have them included by device driver validation module
  4. Add device driver documentation generation module.
  5. Deprecate using depends on DT_HAS_compat_ENABLED to raise dependency errors, since this is now handled by device driver validation module.

All predefined symbols connected with the devicetree with the prefix DT_ are generated by dts which runs before Kconfig. Generic symbols like DEVICE_DRIVER will be declared in zephyr/drivers/Kconfig, Pattern matching symbols, like DEVICE_DRIVER_API_<api> may be declared where appropriate. This allows for out-of-tree device driver symbol definition.

Proposed change (Detailed)

The proposal updates kconfig.py to additionally export all symbols which select other symbols to a dict named selects.pickle. It will convert the following symbol

config MODEM_QUECTEL_BG9X
    bool "Enable quectel modem device driver"
    select DEVICE_DRIVER
    select DT_COMPAT_QUECTEL_BG95
    select DT_COMPAT_QUECTEL_BG96
    select DEVICE_DRIVER_API_NET_IF

to

selects = {
    "MODEM_QUECTEL_BG9X": ["DEVICE_DRIVER", "DT_COMPAT_QUECTEL_BG95", "DT_COMPAT_QUECTEL_BG96", DEVICE_DRIVER_API_NET_IF]
}

selects.pickle is then used with the output from dts kconfig.dts, and the predefined symbol naming convention, to identify device drivers, which are then exported as a dict named device_drivers.pickle.

device_drivers = {
    "MODEM_QUECTEL_BG9X": {
        "compatible": ["quectel,bg95", "quectel,bg96"],
        "api": ["net_if],
    },
}

device_drivers.pickle is then used with the output from dts edt.pickle and Kconfig .config to validate device drivers as described in this proposal.

device_drivers.pickle is also used to generate device driver documentation.

From the build systems perspective, it will go from this:

- Generate edt.pickle and kconfig.dts
- Generate .config
- Compile

to this:

- Generate edt.pickle and kconfig.dts
- Generate .config and selects.pickle
- Generate device_drivers.pickle using kconfig.dts and selects.pickle
- Validate device drivers using edt.pickle, .config and device_drivers.pickle
- Compile

How to implement change step by step:

  1. Update Kconfig to generate selects.pickle
  2. Add a cmake module which runs after Kconfig and dts
  3. Create script which imports selects.pickle and kconfig.dts and generates device_drivers.pickle
  4. Create script which importsdevice_drivers.pickle, edt.pickle and .config., and uses them to perform the validation described in this proposal.
  5. Add execution of newly created scripts to newly created cmake module
  6. Merge cmake module with upstream

At this point, validation of device drivers is fully supported. Adding predefined symbols to existing device driver Kconfigs should begin here. A device driver validation error will only occur if a compatible device driver, required by an enabled devicetree node, has been declared using predefined symbols. The more device driver Kconfigs are updated, the better the validation support will be.

  1. Add support for generating device driver documentation from device_drivers.pickle.
  2. Wait until all device drivers have been declared using predefined symbols.
  3. Update the device driver validation module to raise a validation error even if no compatible device driver exists, enforcing device drivers being declared by predefined symbols.

Dependencies

The proposal requires updating Kconfig files for device drivers to use the new predefined symbols. It also strongly encourages not using depends on DT_HAS_compat_ENABLED which will be replaced with select DT_COMPAT_compat.

Concerns and Unresolved Questions

Overlap between current Kconfig dependency error and device driver validation error

The devicetree currently generates symbols which are used in kconfig to determine if a compatible exists, DT_HAS_<compat>_ENABLED, which some Kconfig symbols depend on to determine if they should be selected. This causes a kconfig dependency error if a driver is selected with no supported devices in the devicetree.

There is overlap between that feature and the validation offered by this solution. Kconfig runs before this solution, resulting in the error which would have been provided by this solution being silenced, as the build fails when Kconfig raises the dependency error.

Since the error raised by this solution is more descriptive and helpful, using Kconfig to invoke a dependency error in this case should be discouraged.

Alternatives

Using YAML files to declare device driver properties, instead of predefined Kconfig symbols. Pros using YAML files:

Cons

galak commented 1 year ago
  • This approach is limited to a 1-1 link between compatible and device driver, which is not compatible with drivers which support multiple devices.

This is not true, we can easily describe 1 driver to multiple dts compatibles in Kconfig, we do it today (see drivers/adc/Kconfig.ads1x1x):

config ADC_ADS1X1X
        bool "ADS1X1X driver"
        default y
        depends on DT_HAS_TI_ADS1013_ENABLED || DT_HAS_TI_ADS1014_ENABLED || \
                   DT_HAS_TI_ADS1015_ENABLED || DT_HAS_TI_ADS1113_ENABLED || \
                   DT_HAS_TI_ADS1114_ENABLED || DT_HAS_TI_ADS1115_ENABLED || \
                   DT_HAS_TI_ADS1119_ENABLED
galak commented 1 year ago

One downside to the YAML file approach is we have to keep information in-sync information between the Kconfig and capabilities YAML file. If we shift this information to just the capabilities YAML than we introduce yet another "configuration" system.

bjarki-andreasen commented 1 year ago
  • This approach is limited to a 1-1 link between compatible and device driver, which is not compatible with drivers which support multiple devices.

This is not true, we can easily describe 1 driver to multiple dts compatibles in Kconfig, we do it today (see drivers/adc/Kconfig.ads1x1x):

config ADC_ADS1X1X
        bool "ADS1X1X driver"
        default y
        depends on DT_HAS_TI_ADS1013_ENABLED || DT_HAS_TI_ADS1014_ENABLED || \
                   DT_HAS_TI_ADS1015_ENABLED || DT_HAS_TI_ADS1113_ENABLED || \
                   DT_HAS_TI_ADS1114_ENABLED || DT_HAS_TI_ADS1115_ENABLED || \
                   DT_HAS_TI_ADS1119_ENABLED

Corrected in proposal

bjarki-andreasen commented 1 year ago

One downside to the YAML file approach is we have to keep information in-sync information between the Kconfig and capabilities YAML file. If we shift this information to just the capabilities YAML than we introduce yet another "configuration" system.

The YAML files do not configure anything. They declare that a device driver exists, which Kconfig symbol will include it, and what it will offer if it is included. If the device driver is changed to a degree where the declaration is no longer consistent with what the device driver offers, the declaration must be updated.

The only information from Kconfig which must be in-sync with the YAML files is the symbol which selects the device driver. This is a primary function of the YAML file, pointing out which specific Kconfig symbol will include the device driver.

There is a clear separation of responsibilities between YAML files and Kconfig. The YAML files declare device drivers and their capabilties. The capabilities of a device driver is determined by the source code. The source code is configured by Kconfig.

galak commented 1 year ago

We already have a solution and pattern to the problem of having a driver enabled w/o devicetree node being enabled. That is what the depends on DT_HAS_TI_ADS1013_ENABLED in Kconfig accomplishes. I'm sure there are some cases that still need to be added in Kconfig.

In general we should not be having multiple drivers for the same device. If this is really needed, Kconfig can describe the mutual exclusion between the drivers if needed.

bjarki-andreasen commented 1 year ago

We already have a solution and pattern to the problem of having a driver enabled w/o devicetree node being enabled. That is what the depends on DT_HAS_TI_ADS1013_ENABLED in Kconfig accomplishes. I'm sure there are some cases that still need to be added in Kconfig.

As mentioned in the issue, under Concerns and Unresolved Questions, this is expanded upon. The solution proposed with this issue also does the reverse check, is there a node which is enabled, for which any driver exists, which has not been enabled, and if so, notify the user of the existence of these drivers. The error reporting proposed in this solution is more helpful than causing a Kconfig dependency error as well.

In general we should not be having multiple drivers for the same device. If this is really needed, Kconfig can describe the mutual exclusion between the drivers if needed.

Having multiple drivers is beneficial for out-of-tree drivers, where you want to overwrite an existing driver with your own implementation. It is also useful for devices which can be interacted with in different ways, like a generic cover all driver, like MODEM_GSM_PPP, and a tailored driver, like MODEM_QUECTEL_BG9X. Turning it around, we should not disallow or prevent having multiple drivers which support the same devices either, it is a valid usecase.

gregshue commented 1 year ago

In general we should not be having multiple drivers for the same device. If this is really needed, Kconfig can describe the mutual exclusion between the drivers if needed.

Hmm. I would not be surprised to see a chip vendor provide a licensed module with an enhanced device driver. Providing a configuration and build system that supports this seems to be consistent with the Project mission and vision.

gregshue commented 1 year ago

select DEVICE_DRIVER -> Declares that this symbol enables a device driver

AFAICT, selecting select DEVICE_DRIVER only tells the config system that at least one device driver exists in the system. Won't that always be true outside of test cases? How is the DEVICE_DRIVER setting bound to this symbol?

bjarki-andreasen commented 1 year ago

select DEVICE_DRIVER -> Declares that this symbol enables a device driver

AFAICT, selecting select DEVICE_DRIVER only tells the config system that at least one device driver exists in the system. Won't that always be true outside of test cases? How is the DEVICE_DRIVER setting bound to this symbol?

In the current Kconfig output .config, that is correct. This proposal adds a step to Kconfig.py (which has access to all individual symbols and their contexts) which creates a new output: selects.pickle which is a dictionary containing all symbols which select any other symbol. For example:

This Kconfig symbol

config MODEM_QUECTEL_BG9X
    bool "Enable quectel modem device driver"
    select DEVICE_DRIVER
    select DT_COMPAT_QUECTEL_BG95
    select DT_COMPAT_QUECTEL_BG96
    select DEVICE_DRIVER_API_NET_IF

becomes

selects = {
    "MODEM_QUECTEL_BG9X": ["DEVICE_DRIVER", "DT_COMPAT_QUECTEL_BG95", "DT_COMPAT_QUECTEL_BG96", DEVICE_DRIVER_API_NET_IF]
}

in the new selects.pickle output (pickle is just a object serialization format like JSON, but binary).

gregshue commented 1 year ago

@bjarki-trackunit It seems to me that this proposal is changing the semantics of the 'select' keyword. If true, then @carlescufi's response in this #build-system message is appropriate here:

Ulf, the original author of Kconfiglib, was always very careful to match 1:1 the Kconfig language from upstream Linux. If we start deviating from it, we need a very good reason, because then we won't be using "Kconfig" anymore, but our own dialect.

I strongly prefer any Kconfig language deviations to be with new keywords (introducing easily recognizable extensions) rather than changing semantics (introducing inconsistencies).

bjarki-andreasen commented 1 year ago

@bjarki-trackunit It seems to me that this proposal is changing the semantics of the 'select' keyword. If true, then @carlescufi's response in this #build-system message is appropriate here:

Ulf, the original author of Kconfiglib, was always very careful to match 1:1 the Kconfig language from upstream Linux. If we start deviating from it, we need a very good reason, because then we won't be using "Kconfig" anymore, but our own dialect.

I strongly prefer any Kconfig language deviations to be with new keywords (introducing easily recognizable extensions) rather than changing semantics (introducing inconsistencies).

The semantics are not changed, select still selects a reverse dependency, we simply deduce meaning from the names of the selected symbols. Kconfig does not care if a selected symbol is named DEVICE_DRIVER, or ABCD, we can however reserve some symbol names and decide that they indicate some property.

We already do this with symbols like SERIAL_HAS_DRIVER and SOC_STM32U585XX, (just discovered TAINTEDBLOBS* as well). This solution essentially goes all the way with this approach by agreeing on a symbol naming convention that can be recognized by the build system.

The only change to Kconfig.py is just having it produce an extra file containing the symbol selections on a pr symbol basis.

mkschreder commented 1 year ago

@bjarki-trackunit do you have some code that implements at least a portion of this proposal?

P.S. The cryptic error messages are easily solved if you know that you can just search for the ord in generated headers (build/zephyr/include/generated/device_extern.h) and from there find the offending dts node.

bjarki-andreasen commented 1 year ago

@bjarki-trackunit do you have some code that implements at least a portion of this proposal?

So far I have simply made a quick and dirty addition to kconfig.py to export the symbols and their selections locally as a POC, if we agree on the overarching concept, using kconfig symbol names as metadata, we can go ahead with actual implementation in future PRs.

P.S. The cryptic error messages are easily solved if you know that you can just search for the ord in generated headers (build/zephyr/include/generated/device_extern.h) and from there find the offending dts node.

I would not say the error message is solved, you are describing a way of diagnosing the obscure error. The obscurity of the error is rooted in the "undefined reference to", and the fact that it is a compilation error.

The build system should ensure that the user has included all required drivers before compilation starts, and it should be able to help the user find an appropriate driver if it detects a device missing one.

This proposal provides a way of determining metadata from Kconfig, with that information, we can solve a host of issues with one system, instead of having many different bespoke solutions. The less rules, obscure errors and manual steps the user must recognize and perform to use the project, the better adoption will be.

mkschreder commented 1 year ago

I think a good check to implement would be to find a way to guarantee that all device tree nodes that are 'status = "okay"' are instantiated (have a corresponding driver included). I don't have an off the cuff suggestion yet on how to do this elegantly but one possible way is to simply have a pre-build check that goes over the generated zephyr.dts (final device tree) and over the full list of files included in the build and cross references device tree compatible strings to all DT_DRV_COMPAT defines found in the list of built files (Although I have a hunch that there could be a better way by using variables generated by device instantiation macros).

I would like to hear more input from others on an elegant way to implement this cross reference check.

henrikbrixandersen commented 1 year ago

I think a good check to implement would be to find a way to guarantee that all device tree nodes that are 'status = "okay"' are instantiated (have a corresponding driver included).

What problem are you trying to solve here? Having devictree nodes with status = "okay" but no Zephyr driver instantiated is a perfectly valid configuration.

bjarki-andreasen commented 1 year ago

I think a good check to implement would be to find a way to guarantee that all device tree nodes that are 'status = "okay"' are instantiated (have a corresponding driver included). I don't have an off the cuff suggestion yet on how to do this elegantly but one possible way is to simply have a pre-build check that goes over the generated zephyr.dts (final device tree) and over the full list of files included in the build and cross references device tree compatible strings to all DT_DRV_COMPAT defines found in the list of built files (Although I have a hunch that there could be a better way by using variables generated by device instantiation macros).

I would like to hear more input from others on an elegant way to implement this cross reference check.

Crawling through code is not exactly an elegant solution IMO. DT_DRV_COMPAT is only used for drivers which only support one device compatible, some drivers support multiple compatibles. We also have many different macros which instantiate devices, NET_IF_DEFINE, DEVICE_DEFINE, DEVICE_DT_DEFINE etc. and even "worse" for this solution, we have macros, which contain DEVICE_DT_DEFINE(), meaning we need to run a preprocessor first. It is rather complex and will be hard to maintain.

The core of the solution in this proposal is to tell the build system about the drivers, in a way which keeps that information in one place without duplication. We can then use this information to implement the behavior we want.

Your example with status="okay" we can handle more elegantly if the build systems knows which drivers exists for that compatible, if any. A device can have the status okay without any driver, but if a driver exists, we can inform the user that he may want to enable one of the compatible drivers.

The solution should be helpful, without blocking the user, or requiring manual steps which could be automated, and the meta information should be provided by the driver developers, to help the end users (developer and end user are often the same people, but still :))

mkschreder commented 1 year ago

I think a good check to implement would be to find a way to guarantee that all device tree nodes that are 'status = "okay"' are instantiated (have a corresponding driver included).

What problem are you trying to solve here? Having devictree nodes with status = "okay" but no Zephyr driver instantiated is a perfectly valid configuration.

What is the logical sense of that? If one sets something to "status = 'okay'" then one almost expects the driver to automatically be enabled and included so that the node is in fact "okay". The fact that we need to enable the driver as well is almost a bit of a nuisance. But of course we can have multiple drivers that implement a compatible so I understand the necessity of the extra step of picking a driver - however, I would expect that having status = okay and no driver included in build should in fact generate an error.

mkschreder commented 1 year ago

@gregshue at this time I'm just brainstorming ideas and it's good to map out possible solutions and issues with each.

What about making the mappings from device tree bindings to device drivers? One could for instance link from "dts/bindings/watchdog/nxp,lpc-wwdt.yaml" to "drivers/watchdog/wdt_mcux_wwdt.c" and always in the same source tree. So for example, if one wants to add a custom driver, one would extend the existing binding in the directory tree of the custom driver and then the linking will search directories first in the extended yaml location, then in the location of any base bindings.

What are issues of linking dts bindings to one or more device driver implementations?

bjarki-andreasen commented 1 year ago

I think a good check to implement would be to find a way to guarantee that all device tree nodes that are 'status = "okay"' are instantiated (have a corresponding driver included).

What problem are you trying to solve here? Having devictree nodes with status = "okay" but no Zephyr driver instantiated is a perfectly valid configuration.

What is the logical sense of that? If one sets something to "status = 'okay'" then one almost expects the driver to automatically be enabled and included so that the node is in fact "okay". The fact that we need to enable the driver as well is almost a bit of a nuisance. But of course we can have multiple drivers that implement a compatible so I understand the necessity of the extra step of picking a driver - however, I would expect that having status = okay and no driver included in build should in fact generate an error.

Not all nodes have drivers, even when the node is enabled, and some nodes are supported by multiple drivers, so we need the user to enable the one they want to use.

gregshue commented 1 year ago

for instance link from "dts/bindings/watchdog/nxp,lpc-wwdt.yaml" to "drivers/watchdog/wdt_mcux_wwdt.c" ...

Consider that the link needs to still be meaningful when the driver implementation and/or filename changes. (E.g., when a file is refactored into separate files, when a driver supports some of the versions of a HW block.)

... and always in the same source tree.

For end users that are limited to reuse an upstream repository (no additional commits) fixing any mapping errors must be done through configuration settings and mappings in a different repository. If both repositories have parallel directory structures, is it the same source tree? 🤔

What are issues of linking dts bindings to one or more device driver implementations?

In addition to those above, consider:

gregshue commented 1 year ago

@nashif @MaureenHelm The considerations I mentioned above are other examples of User Needs that need to be part of the conversation. Should I create issues for these as well?

mkschreder commented 1 year ago

@gregshue do you see any reason to ever have a block in device tree enabled (status = "okay") and not have a driver for it compiled in? I'm just curious what your take on this is. Here I'm talking only about nodes that have a compatible string (ie expect a driver to handle them).

gregshue commented 1 year ago

🤔 ... Different users, different needs ... Is this good enough?

As a contracted designer of bitstreams for FPGA with embedded CPU cores, I need to generate and validate device tree files describing the memory-mapped blocks in the bitstream, so that my client's own firmware engineers have DT files ready for whenever they choose to implement drivers.

Appended NOTE: The bitstream itself can be verified outside of Zephyr, so the contractor may have no need to create drivers in Zephyr.