Error in device initialization discarded

nashif commented 7 years ago

Reported by M R Rosen:

Currently, if a device fails to initialize and returns a error code, that information is discarded. Ideally, in the event a device fails to initialize, the device cannot be retrieved by the device_get_binding function (return NULL) to allow dependent devices or applications to fail according. See kernel/device.c:68 on current master branch (device_get_binding implementation immediately follows with no checks for device initialization success; as this information is not recorded). This probably requires a separate RAM structure as the initialization function and string name are and should remain in ROM.

(Imported from Jira ZEP-1618)

nashif commented 7 years ago

by M R Rosen:

In my exploration of the linker scripts, it looks like the structures for device initialization are actually in RAM? If so, just adding a flag to them or zeroing out their names upon failed initialization should work to fix this issue.

nashif commented 7 years ago

by Andrew Boie:

Is this something we actually want? The usual design pattern used in the codebase to date is to not check return values/error conditions in the kernel, for code size and performance reasons.

nashif commented 7 years ago

by M R Rosen:

As mentioned in the gerrit thread, there are some optional-at-runtime devices that would be better not to have to expand the driver APIs to test if they are present. Having device_get_binding fail is a clean way for the application or higher-level drivers to know if the device is present as whether that function returns a valid pointer is frequently used in the codebase. I generally agree that checking return values increases code complexity and space, but this is one area where it would be useful for us and possibly others. Not to push everything into Kconfig, but this might be a candidate for an optional flag whether or not to check device initialization.

nashif commented 7 years ago

by Ramesh Thomas:

The devices are not expected to fail in production. If this is for debug, then the drivers should use ASSERT to catch the error.

Note that there is already some code existing in the get_binding function to do something like that. The check for driver_api != NULL got added with the idea that drivers would set it to NULL if their init fails. In my opinion that should be removed for the same reason I gave above.

If code like these are removed then the device structure can be moved to ROM. Refer to a related discussion https://lists.zephyrproject.org/pipermail/zephyr-devel/2016-October/004146.html

nashif commented 7 years ago

by Carles Cufi:

If all the drivers are going to ASSERT() instead of returning errors then IMHO we need more than ever a scalable assert infrastructure that is permanently there (i.e. one cannot disable it) and that does not make the ROM size explode. The requirements for such an infrastructure are outlined here: GH-788

nashif commented 7 years ago

by M R Rosen:

Ramesh Thomas

Thanks for reminding me about the API pointer feature, I had forgotten about that! Though, Im not sure if all drivers use this indirect feature and if the struct is going to move into ROM, this will no longer work. Though I can appreciate the notion that things should work in production, I would argue this isnt universally true. For SoC and board-level devices, I agree, failure in these is a much larger problem that, if happening on a production system, likely means the hardware is no good. However, as mentioned from my project, there are instances in devices where hardware is added or removed after deployment/sale and I think it would be better to allow the application code to automatically handle these cases via the device infrastructure in the kernel rather than hacking it out or (what I would consider worse), requiring users to reflash the device on the install of new modules. Better the original application have all the reasonable drivers and simply require a reboot than multiple versions of the application and a reflash in the event of a module being added or removed.

nashif commented 7 years ago

by Ramesh Thomas:

If this is not a very common feature of devices that use Zephyr then a driver API seems like a more economical solution to me. This is because the expansion is contained in those specific drivers and not made common in the kernel.

I recall from some very early emails in which the API pointer clearing method was getting proposed purely for debugging init failures. The embedded system experts in the threads had brought up all the points that has been brought up here and in the recent email thread https://lists.zephyrproject.org/pipermail/zephyr-devel/2017-February/007156.html. If runtime checking of user failures is not to be done by kernel then, in my opinion, that should be removed. Also that would help moving the containing structure to ROM.

nashif commented 7 years ago

by M R Rosen:

Ramesh Thomas

I agree that it might not be the kernel's job to check for device driver failure, but I think it does need to be well-defined how the system (or how developers) handles the event of a driver failing to initialize (so both driver and application developers know how to handle initialization failure), where the "its just unhandled" isnt really a sufficient answer. In the documentation at present, the API for device_get_binding says "pointer to device structure; NULL if not found or cannot be used", telling both device driver implementers and application developers that the behavior is to return NULL in the event of a failed initialization with the assumption on the part of driver implementers that returning a non-zero return code would handle that and application developers would assume a failed initialization would result in NULL pointer (this is largely true thanks to the api pointer thing I suppose, though I dont think thats made very clear that that was the intended way to handle meeting this API requirement). In fact, the DEVICE_AND_API_INIT macro actively discourages device driver implementers from knowing and using the API pointer technique for a failed initialization. In short, its unclear at present whose job it is to handle this and all signs point to the kernel right now even though it doesnt...

At least where its an option to be enabled or disabled kernel checking device drivers return value lets projects/systems where any device driver failure can be left unhandled minimize their memory usage and runtime while projects/systems where devices might be expected to fail initialization have a clean way of detecting and thus dealing with the error. I know there are discussion about trying to minimize how much the Kconfig space is growing, but a CONFIG flag would be quite appropriate for this kind of thing. And its also true that projects can always modify Zephyr to better meets their needs, but I feel like this is more general than just my case.

nashif commented 7 years ago

by Mark Linkmeyer:

I'm moving this bug to the To Do state because it's a High bug which needs to be fixed in order to release 1.7.

nashif commented 7 years ago

by Andy Ross:

This was discussed in the triage meeting last week. Architecture changes aren't going to make it into 1.7 at this stage, and we don't have a clear design (or even a unified understanding of the need). Needs more discussion.

nashif commented 7 years ago

by Mark Linkmeyer:

You're right Andy Ross . I mistakenly moved it thinking it was a high priority bug that needed to be fixed. Sorry 'bout that!

Ah wait.... :-) I see you converted it to a story earlier today.

zephyriot / zephyr-issues

Error in device initialization discarded #1479