zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
11k stars 6.7k forks source link

up_squared: samples/boards/up_squared/gpio_counter failed. #28551

Closed chen-png closed 3 years ago

chen-png commented 4 years ago

Describe the bug when running samples/boards/up_squared/gpio_counter on up_squared board, have configured the GPIO pins according to readme file, but it reported an error "ERROR: cannot set HAT pin 37 to OUT (-22)".

To Reproduce Steps to reproduce the behavior:

  1. west build -b up_squared -p auto samples/boards/up_squared/gpio_counter
  2. west flash ~/up_squared.sh
  3. see error

Logs and console output Booting Zephyr OS build v2.4.0-rc2 ERROR: cannot set HAT pin 37 to OUT (-22)

Environment (please complete the following information):

dcpleung commented 4 years ago

Have you set up the BIOS according to README and main.c? By default, all those pins are not set as GPIOs and cannot be used as such.

chen-png commented 4 years ago

Have you set up the BIOS according to README and main.c?

yes, I have configured BIOS accroding to README in main.c file.

chen-png commented 4 years ago

git bisect and found the first bad commit is 9f846374 .

jenmwms commented 3 years ago

I can reproduce this issue. Thanks @chen-png for bisecting to 9f84637 , I'll look at it and sync with @andrewboie on suggestions.

jenmwms commented 3 years ago

I'm looking into the device mapping here. A few notes about this GPIO/HAT Pin (37), it is 3rd of 4 GPIOs listed as OUT in https://github.com/zephyrproject-rtos/zephyr/blob/master/samples/boards/up_squared/gpio_counter/src/main.c, and it is the only of those 4 GPIOs that does not have a function-mode associated (see https://wiki.up-community.org/Pinout_UP2) toggled in the BIOS.

Also, looking at recent patch that might be related https://github.com/zephyrproject-rtos/zephyr/pull/31274.

jenmwms commented 3 years ago

^that patch https://github.com/zephyrproject-rtos/zephyr/pull/31274 is EHL and UART driver, not UP2 and GPIO driver. Might not help here.

jenmwms commented 3 years ago

@chen-png @dcpleung Can you please help me understand the value "(-22)" in the error message? I see it is ret from main.c line 126, but not sure which is associated with -22 (displayed differently than -E...)

jenmwms commented 3 years ago

Perhaps it is -EINVAL: Invalid argument GPIO, which is consistent with linux error codes...

dcpleung commented 3 years ago

The error values are defined in lib/libc/minimal/include/errno.h: -22 is -EINVAL.

jenmwms commented 3 years ago

It get's through the get_gpio_dev() ok (return 0, successful) in this check in main.c:

for (i = 0; i < NUM_PINS; i++) {
        if (get_gpio_dev(&counter_pins[i]) != 0) {
            return;
        }
    }

so something is happening with the arguments, for only HAT Pin 37, to gpio_pin_configure in

        ret = gpio_pin_configure(counter_pins[i].gpio_dev,
                     counter_pins[i].pin,
                     GPIO_OUTPUT_LOW);
jenmwms commented 3 years ago

Interesting, this seems to effect OUT and IN pins, depending on something TBD. If I remove HAT pin 37 from the structure _pin counter_pins[], the 3 remaining pins in the array are successfully set to OUT (no error). Then we instead observe error with setting the input pin:

Booting Zephyr OS build zephyr-v2.4.0-3009-gfc1b5de4c307 
ERROR: cannot set HAT pin 16 to IN (-22)

I'll try printk in the gpio driver (gpio_intel_apl.c) to see where it is failing. It is not clear if/how the patch at bisect would be a breaking point.

jenmwms commented 3 years ago

building on that, also interesting if I avoid HAT pin 37 and 16, and instead use HAT pin 38 as IN not OUT, I can progress to the first counter value but does not advance:

Booting Zephyr OS build zephyr-v2.4.0-3009-gfc1b5de4c307 
counter: 0x0
chen-png commented 3 years ago

building on that, also interesting if I avoid HAT pin 37 and 16, and instead use HAT pin 38 as IN not OUT, I can progress to the first counter value but does not advance:

Booting Zephyr OS build zephyr-v2.4.0-3009-gfc1b5de4c307 
counter: 0x0

did you short the related pins after you changed the input pin number?

jenmwms commented 3 years ago

did you short the related pins after you changed the input pin number?

Yes, thanks for checking. Counter does not increment. Curious. Edit: I also tried this modified configuration (using different GPIO as INPUT), back at the point when the sample works well by default, and the counter does not increment with the modification there either. That route may not help debug atm. Stepping back to study the driver more.

jenmwms commented 3 years ago

Ok, the driver gets the -EINVAL error at the failed permission check for the pin (HAT pin 37). Wonder what might cause this to start failing. Looking into this. Here are my debug prints in the driver (sample is unchanged):

--- a/drivers/gpio/gpio_intel_apl.c
+++ b/drivers/gpio/gpio_intel_apl.c
@@ -232,10 +232,14 @@ static int gpio_intel_apl_config(const struct device *dev,
         raw_pin = cfg->pin_offset + pin;

+       printk("TEST GPIO DRIVER before check_perm\n pin %d offset %d raw_pin %d\n", pin, cfg->pin_offset, raw_pin);
+
        if (!check_perm(dev, raw_pin)) {
                return -EINVAL;
        }

+       printk("TEST GPIO DRIVER after check_perm ok\n");

which gave this output (each gpio device has an iteration of the driver, see the error in the 2nd gpio in the array)

Booting Zephyr OS build zephyr-v2.4.0-3265-g401ee4ae5bf0  ***
TEST GPIO DRIVER before check_perm
 pin 17 offset 0 raw_pin 17
TEST GPIO DRIVER after check_perm ok
TEST GPIO DRIVER before check_perm
 pin 14 offset 32 raw_pin 46
ERROR: cannot set HAT pin 37 to OUT (-22)
andrewboie commented 3 years ago

I think a PR to add LOG_ERR() statements to all failed runtime checks in the driver might be useful for debugging issues like this in the future

jenmwms commented 3 years ago

PR to add LOG_ERR() statements

Great suggestion! I can work on that. I started with printk but doesn't help long term like logging statements.

jenmwms commented 3 years ago

A bit more on the printk study from earlier...here is the output from when the sample works. Interesting that the 2 pins (HAT Pin 37 and HAT Pin 16) which give us trouble as we move up the tree, both have an offset of 32:

Booting Zephyr OS build zephyr-v2.3.0-1094-g9e627ed6d2ed  ***
TEST GPIO DRIVER before check_perm
 pin 17 offset 0 raw_pin 17
TEST GPIO DRIVER after check_perm ok
TEST GPIO DRIVER before check_perm
 pin 14 offset 32 raw_pin 46
TEST GPIO DRIVER after check_perm ok
TEST GPIO DRIVER before check_perm
 pin 18 offset 0 raw_pin 18
TEST GPIO DRIVER after check_perm ok
TEST GPIO DRIVER before check_perm
 pin 19 offset 0 raw_pin 19
TEST GPIO DRIVER after check_perm ok
TEST GPIO DRIVER before check_perm
 pin 5 offset 32 raw_pin 37
TEST GPIO DRIVER after check_perm ok
counter: 0x0
counter: 0x1
jenmwms commented 3 years ago

Ok, since the permission check failed for the pin (HAT pin 37), diving into check_perm() which does two checks for the pin under configuration: check if Host SW owns the pin, and check if function of pad is GPIO. Here, I can narrow the failure down to the latter case. This is snippet where the check fails:

    offset = data->pad_base + (raw_pin << 3);
    val = sys_read32(regs(dev) + offset);
    if (val & PAD_CFG0_PMODE_MASK) {
        /* mode is not zero => not functioning as GPIO */
        printk("Function of pad is NOT as GPIO\n");
        return false;
    }
jenmwms commented 3 years ago

I suspect we are reading the incorrect address, but still investigating. The offset seems to have something to do with it based on testing above (both pins with 32-bit offset encounter issue.)

jenmwms commented 3 years ago

Indeed this looks to be an address issue. I narrowed it down to this change from https://github.com/zephyrproject-rtos/zephyr/commit/9f84637499a451eb94c53ecf527afaf5bfc9cada# image

pad_base is not correct after this change. Investigating root cause, may be due to DEVICE_MMIO_NAMED_MAP( ) or regs(dev). Also possible that this change helped expose a bug we did not see before.

Couple of observations are the now-inconsistent value of data->pad_base and the reg base address used to calculate pad_base (see above snippet). The driver's init function is hit 10 times, despite only 5 GPIOs selected in the sample. Not an issue, but a consistent pattern.

Adding printk to show data->pad_base and the equivalent result of the static inline regs(dev) like this:

@@ -533,6 +544,8 @@ int gpio_intel_apl_init(const struct device *dev)

        DEVICE_MMIO_NAMED_MAP(dev, reg_base, K_MEM_CACHE_NONE);
        data->pad_base = sys_read32(regs(dev) + REG_PAD_BASE_ADDR);
+       mm_reg_t namedMMIO = DEVICE_MMIO_NAMED_GET(dev, reg_base);
+       printk("gpio init              pad_base 0x%x,                    namedMMIO 0x%x\n", data->pad_base, (uint32_t) (namedMMIO));

results in this for the code after https://github.com/zephyrproject-rtos/zephyr/commit/9f84637499a451eb94c53ecf527afaf5bfc9cada# (see lines starting with gpio init pad_base below):

gpio init              pad_base 0xffffffff,                    namedMMIO 0xafff7001
gpio init              pad_base 0x500,                    namedMMIO 0xafff6000
gpio init              pad_base 0xffffffff,                    namedMMIO 0xafff4001
gpio init              pad_base 0x500,                    namedMMIO 0xafff3000
gpio init              pad_base 0xffffffff,                    namedMMIO 0xafff1002
gpio init              pad_base 0xffffffff,                    namedMMIO 0xaffef001
gpio init              pad_base 0x500,                    namedMMIO 0xaffee000
gpio init              pad_base 0xffffffff,                    namedMMIO 0xaffec002
gpio init              pad_base 0xffffffff,                    namedMMIO 0xaffea001
gpio init              pad_base 0x500,                    namedMMIO 0xaffe9000
*** Booting Zephyr OS build zephyr-v2.5.0-188-gfc0cdba03ed1  ***
before check_perm                              pin 17,      offset 0,   raw_pin 17,   namedMMIO 0xafff3000
>> check_perm read pad owner                   offset 40,   val 0,   REG_PAD_OWNER_BASE 0x20
>>>> Host SW owns the pin OK (bits)            offset 1,   val 0,   PAD_OWN_MASK 0x3
>>>> Function of pad is GPIO OK
check_perm ok
next device
before check_perm                              pin 14,      offset 32,   raw_pin 46,   namedMMIO 0xaffef001
>> check_perm read pad owner                   offset 52,   val ffffffff,   REG_PAD_OWNER_BASE 0x20
>>>> Host SW does NOT own the pin (bits)       offset 6,   val 3,   PAD_OWN_MASK 0x3
ERROR: cannot set HAT pin 37 to OUT (-22)

The pad_base 0xffffffff is suspicious, confirmed with @andyross. I'm not sure if the values should be same or unique when using the MMIO mapping based on regs(dev).

printk of data->pad_base when the sample worked was (see lines starting with gpio init pad_base below):

gpio init              pad_base 0x500,                    cfgRegBase 0x910000
gpio init              pad_base 0x500,                    cfgRegBase 0x910000
gpio init              pad_base 0x500,                    cfgRegBase 0x910000
gpio init              pad_base 0x500,                    cfgRegBase 0x910000
gpio init              pad_base 0x500,                    cfgRegBase 0x910000
gpio init              pad_base 0x500,                    cfgRegBase 0x910000
gpio init              pad_base 0x500,                    cfgRegBase 0x910000
gpio init              pad_base 0x500,                    cfgRegBase 0x910000
gpio init              pad_base 0x500,                    cfgRegBase 0x910000
gpio init              pad_base 0x500,                    cfgRegBase 0x910000
*** Booting Zephyr OS build zephyr-v2.3.0-1095-ga18d1d801c6f  ***
before check_perm                              pin 17,      offset 0,   raw_pin 17,   cfgRegBase 0x910000
>> check_perm read pad owner                   offset 40,   val 0,   REG_PAD_OWNER_BASE 0x20
>>>> Host SW owns the pin OK (bits)            offset 1,   val 0,   PAD_OWN_MASK 0x3
>>>> Function of pad is GPIO OK
check_perm ok
next device
before check_perm                              pin 14,      offset 32,   raw_pin 46,   cfgRegBase 0x910000
>> check_perm read pad owner                   offset 52,   val 0,   REG_PAD_OWNER_BASE 0x20
>>>> Host SW owns the pin OK (bits)            offset 6,   val 0,   PAD_OWN_MASK 0x3
>>>> Function of pad is GPIO OK
check_perm ok
next device
before check_perm                              pin 18,      offset 0,   raw_pin 18,   cfgRegBase 0x910000
>> check_perm read pad owner                   offset 40,   val 0,   REG_PAD_OWNER_BASE 0x20
>>>> Host SW owns the pin OK (bits)            offset 2,   val 0,   PAD_OWN_MASK 0x3
>>>> Function of pad is GPIO OK
check_perm ok
next device
before check_perm                              pin 19,      offset 0,   raw_pin 19,   cfgRegBase 0x910000
>> check_perm read pad owner                   offset 40,   val 0,   REG_PAD_OWNER_BASE 0x20
>>>> Host SW owns the pin OK (bits)            offset 3,   val 0,   PAD_OWN_MASK 0x3
>>>> Function of pad is GPIO OK
check_perm ok
next device
before check_perm                              pin 5,      offset 32,   raw_pin 37,   cfgRegBase 0x910000
>> check_perm read pad owner                   offset 48,   val 0,   REG_PAD_OWNER_BASE 0x20
>>>> Host SW owns the pin OK (bits)            offset 5,   val 0,   PAD_OWN_MASK 0x3
>>>> Function of pad is GPIO OK
check_perm ok
next device
>> check_perm read pad owner                   offset 48,   val 0,   REG_PAD_OWNER_BASE 0x20
>>>> Host SW owns the pin OK (bits)            offset 5,   val 0,   PAD_OWN_MASK 0x3
>>>> Function of pad is GPIO OK
counter: 0x0
jenmwms commented 3 years ago

I'm looking into why the data->pad_base is getting set as 0xffffffff, when its expected to be 0x500. It has something to do with the result of the static inline regs(dev), perhaps it is undefined behavior with using different types.

jenmwms commented 3 years ago
  1. Today OOC I tried the sample on up_squared_32 and it too fails after 9f84637 (MMIO-related changes). Before that bisect, it works. This appears to be another symptom.

I started looking at it because that patch modified how we get the reg_base. The value in gpio_intel_apl_config was a uint32_t reg_base, but updated to use DEVICE_MMIO_NAMED_ROM(reg_base) instead

In API docs, we see

#define DEVICE_MMIO_NAMED_ROM ( name ) struct z_device_mmio_rom name and that struct is defined in include/sys/device_mmio.h image

So, maybe there's an issue with these data types as it propagates.

  1. Also looking at the definition of the helper macros. The use of _dev is different than some other implementations which are thoroughly tested, such as uart_ns16550.c
jenmwms commented 3 years ago

This is still on my radar. Sorry for the delay.

jenmwms commented 3 years ago

Discussed this with Spoorthy offline and we are looking into possible issues with data types, as mentioned earlier about the struct member changing from a uint32_t to a struct with different size. This may have introduced some overflow or undefined behavior.

jenmwms commented 3 years ago

Hardcoding pad_base=0x500 to overcome the suspicious 0xffffffff value did not improve the issue. All gpio devices are init'd with it, but still same issue. Perhaps that is only one aspect.

Reviewing the prints in previous comments, I saw another difference in addition to pad_base: val was 0 when it worked. After the bisected patch, val is non-zero. Before/Working

>> check_perm read pad owner                   offset 52,   val 0,   REG_PAD_OWNER_BASE 0x20
>>>> Host SW owns the pin OK (bits)            offset 6,    val 0,   PAD_OWN_MASK 0x3

After/Error

>> check_perm read pad owner                   offset 52,   val ffffffff,   REG_PAD_OWNER_BASE 0x20
>>>> Host SW does NOT own the pin (bits)       offset 6,    val 3,          PAD_OWN_MASK 0x3
jenmwms commented 3 years ago

I'm digging around this part for regs(dev) (snippet below) because I found that even if I brute force through the check_perms by hardcoding pad_base to expected 0x500 and all the val to 0, this trips in the next stage of configuring interrupts.

reg = regs(dev) + data->pad_base + (raw_pin * 8U);

where

static inline mm_reg_t regs(struct device *dev)
{
    return DEVICE_MMIO_NAMED_GET(dev, reg_base);
}
jenmwms commented 3 years ago

Noticed a couple things when I was looking at this alongside the tests for DEVICE_MMIO_NAMED_* macros (tests/kernel/device/src/mmio.c). Using those as example implementations, I'm considering:

jenmwms commented 3 years ago

Worked backward on the mapping, leading to z_phys_map(). Building with CONFIG_ASSERT=y does throw any asserts for this sample (was making sure the z_phys_map() did not throw asserts). The *virt_ptr (which is reg_base in this case) appears to be valid address, and we observe it does function as expected for some of the GPIOs defined in this sample. Subtle bug that might be in z_phys_map(), so also looking into this more with Flavio.

jenmwms commented 3 years ago

This issue still present after move to unified gpio driver for Intel SoC https://github.com/zephyrproject-rtos/zephyr/pull/34719