New non-volatile storage system

rghaddab commented 1 week ago

Introduction

In recent years, advances to process nodes in embedded hardware have made it necessary to support non-volatile technologies different from the classical on-chip NOR flash, which is written in words but erased in pages. These new technologies do not require a separate erase operation at all, and data can be overwritten directly at any time. On top of that, complexity of firware has not stopped growing, making it necessary to ensure that a solid, scalable storage mechanism is available for all applications. This storage needs to support millions of entries with solid CRC protection and multiple advanced features.

Problem description

In Zephyr, there are currently a few alternatives for non-volatile memory storage:

NVS: Basic ID-based storage, but optimized for devices with page erase
LittleFS: Full file system, optimized for devices with page erase
FCB: Very little use, extremely bare-bones storage

None of them are optimal for the current new wave of solid-state non-volatile memory technologies, including resistive (RRAM) and magnetic (MRAM) random-access, non-volatile memory, because they rely on the "page erase" abstraction whereas these devices do not require an erase operation at all, and data can be overwritten directly. Additionally, none of the storage systems above is a good match for the widely used settings subsystem, given that they were never designed to operate as a backend for it.

The closest one is NVS, and an analysis of why it is not suitable can be found in the Alternatives section of this issue.

Proposed change

Create a new storage mechanism that fulfills the following requirements:

Simple Key-Value Storage (i.e. no file/folder abstractions)
32-bit IDs
Support for entries in multiple formats
CRC-24 for entries that require it
No limits in value length
Metadata entries can also store small (1 to 4 bytes) data entries
Optimized for bigger (e.g. 16-byte) write block sizes
Support for no-erase-required flash drivers (i.e. RRAM, MRAM, etc)
Designed from the start to be efficient when used as a backend for the settings susbystem
Designed from the start to be able to serve as a backend of the Secure Storage subsystem (link)

Potential names

ZMS: Zephyr Memory Storage
NVMS: Non-Volatile Memory Storage
IDVS: ID Value Storage
ZKVS: Zephyr Key-Value Storage

Detailed RFC

Proposed change (Detailed)

General behavior:

ZMS divides the memory space into sectors (minimum 2), and each sector is filled with key/value pair until it is full , we close it then the storage system will move forward to the next sector until it reaches the end and then it starts again from the first sector after garbage collecting it and erasing its content.

Mounting the FS:

Mounting the filesystem will start by getting the flash parameters, checking that the file system properties are correct (sector_size, sector_count ...) Then initializes the file system.

Initialization of ZMS:

As the ZMS has a fast-forward write mechanism, we must find the last sector and the last pointer of the entry where it stopped the last time. It must look for a closed sector followed by an open one, then within the open sector, it finds (recover) the last written ATE (Allocation Table Entry). After that it checks that the sector after this one is empty, or it will erase it.

Composition of a sector.

A sector is organized in this form :	Sector N
data0
data1
...
...
ate1
ate0
gc_done
empty_ate
close_ate

Close ATE is used to close a sector if a sector is full Empty ATE is used to erase a sector ATEn are entries that describe where the data is stored, its size and its crc32 Data is the written value

ZMS Key/value write :

To avoid rewriting the same data with the same ID again, it must look in all the sectors if the same ID exist then compares its data, if the data is identical no write is performed. If we must perform a write, then an ATE and Data (if not a delete) are written in a sector If the sector is full (cannot hold the current data + ATE) we have to move to the next sector, garbage collect the sector after the newly opened one then erase it. Data size that is smaller or equal to 4 bytes are written within the ATE

ZMS read (with history):

By default it looks for the last data with the same ID and retrieves its data. If history count is provided that is different than 0, older data with same ID is retrieved.

ZMS how does the cycle counter works ?

Each sector has a lead cycle counter which is a uin8_t that is used to validate all the other ATEs. The lead cycle counter is stored in the empty ATE. To become valid, an ATE must have the same cycle counter as the one stored in the empty ATE. Each time an ATE is moved from a sector to another it must get the cycle counter of the destination sector. To erase a sector, the cycle counter of the empty ATE is incremented. All the ATEs in that sector become invalid

ZMS how to close a sector ?

To close a sector a close ATE is added at the end of the sector and it must have the same cycle counter as the empty ATE When closing a sector, all the remaining space that has not been used is filled with garbage data to avoid having old ATEs with a valid cycle counter.

ZMS structure of ATE (Allocation Table Entries)

An entry has 16 bytes divided between these variables :

struct zms_ate {
    uint32_t id;     /* data id */
    uint32_t offset; /* data offset within sector */
    uint16_t len;    /* data len within sector */
    union {
        uint32_t crc_data; /* crc for data */
        uint32_t data;     /* used to store small size data */
    };
    uint8_t cycle_cnt; /* cycle counter for non erasable devices */
    uint8_t crc8;      /* crc8 check of the entry */
} __packed;

id has 32 bits
sector size is now 32 bits, that's why offset is also 32 bits => That allows to define large sectors of needed.
length of data is 16 bits (could be changed to 32 bits in the future) which can store data up to 64K
crc_data/data is a field that can store small size data (<= 4 bytes) or the crc32 for bigger data
cycle_cnt is used to validate an ATE within a sector
crc8 is the crc of the ATE (could be changed to crc24 in the future)

ZMS wear leveling feature

This storage system is optimized for devices that do not require an erase. Using storage systems that rely on an erase-value (NVS as an example) will need to emulate the erase with write operations. This will cause a significant decrease in the life expectancy of these devices and will cause more delays for write operations and for initialization. ZMS introduces a cycle count mechanism that avoids emulating erase operation for these devices. It also guarantees that every memory location is written only once for each cycle of sector write.

Dependencies

Only on flash drivers.

Concerns and Unresolved Questions

The first draft of this new storage system will not include all the features listed in the proposed change section. This is intended to minimize the effort of reviewing this new storage system for developers that are familiar to NVS filesystem. More changes will come in future patch lists

Alternatives

The one alternative we have considered would be to expand the existing NVS codebase in order to remove its described shortcomings. This is in fact how this new proposal was born, once expanding NVS was identified as suboptimal.

Among other issues, we identified the following:

NVS was never designed for devices that do not have an erase operation available
NVS limits the max value length to be the size of a sector (32KB)
NVS was designed to be simple and compact, so extending it is not necessarily a good option
NVS performs poorly when the storage mechanism gets close to being full
Slow Garbage Collector that can go through all sectors for a single write operation
Switching to the next sector is a time consuming operation
NVS was not designed to be used as a backend for the settings subsystem, causing latency (up to minutes) and other issues

More info in these Pull Requests:

butok commented 1 week ago

Zephyr platforms have a maximum write size of up to 512 bytes. Will ZMS support it?

rghaddab commented 1 week ago

Zephyr platforms have a maximum write size of up to 512 bytes. Will ZMS support it?

@butok I saw this RFC https://github.com/zephyrproject-rtos/zephyr/issues/77576 Although this storage system is still (could change in the future) not optimized for larger block write size, we could add a hidden config for that with a warning for users that want to increase the default maximum write block size.

carlescufi commented 1 week ago

Architecture WG:

Presentation by @rghaddab of the new system, including some basic technical details
@bjarki-andreasen asks whether the API is similar to NVS, Riadh answers that it is
A question on whether this new system should coexist with NVS is raised, @henrikbrixandersen and @carlescufi suggest that there's no reason why the two systems could coexist. There's a large base of NVS installations in the wild, and the two systems differ in scope and maturity.

andrisk-dev commented 1 week ago

I was thinking about one thing when learning about how NVS works - is separating data ATE From actual data worth it?

We can make the data more dense by storing ATE - data pairs from the start of the sector.	Sector start
ATE 1
DATA 1
ATE 2
DATA 2a
DATA 2b
ATE 3
DATA 3
.....
gc_done
emty_ate
close_ate

The advantages would be that the ATE and DATA could be placed right next to each other - so we waste less space in case of larger write block size. On the other end the disadvantage is that we would need to do some address calculation to find every other data ATE except of the first one. But I think it would not be a reason for a noticeable slowdown - just calculate ATE start address + ATE size + data length and align it to the next start of the write block.

de-nordic commented 1 week ago

I was thinking about one thing even when learning about how NVS works - is separating data ATE From actual data worth it?

Yes it is. It is easier to recover if something happens, otherwise you may just write something that looks like ATE to data and glitch device into attempting to read storage as your data mandates or loop. Also if you write in the loop you may basically loop the ATE/DATA storage without a way to figure out where it really ends. It is much easier to keep things working if you keep users out of area where metadata of your storage is stored.

Same happens with any block-device oriented FS, where metadata is separated from data streams.

Of course there is also a way to do that, for example introducing different alphabets for metadata and data, but this means that you end up in some 8 to N encodings (N > 8), and have to make sure that user data will not get encoded to look like metadata.

andrisk-dev commented 1 week ago

I was thinking about one thing even when learning about how NVS works - is separating data ATE From actual data worth it?

Yes it is. It is easier to recover if something happens, otherwise you may just write something that looks like ATE to data and glitch device into attempting to read storage as your data mandates or loop. Also if you write in the loop you may basically loop the ATE/DATA storage without a way to figure out where it really ends. It is much easier to keep things working if you keep users out of area where metadata of your storage is stored.

Same happens with any block-device oriented FS, where metadata is separated from data streams.

Of course there is also a way to do that, for example introducing different alphabets for metadata and data, but this means that you end up in some 8 to N encodings (N > 8), and have to make sure that user data will not get encoded to look like metadata.

OK I understand the reason now.

For the purpose of saving space I really like the small data inside an ATE feature.

For data that is a little larger than 4 bytes - would it be acceptable to write that right after ATE if the block is large enough? So maybe there could be another rule that if there is enough space in a block right after the ATE for the data than it would be stored there.

Technically this is also mixing ATE and data but we would search for ATEs only on the start of the blocks anyway.

What do you think about such feature?

rghaddab commented 1 week ago

For data that is a little larger than 4 bytes - would it be acceptable to write that right after ATE if the block is large enough?

This could be done once the multiple format entries feature is added. Which means that you can have a different format which is larger and that holds N bytes of data

de-nordic commented 1 week ago

For the purpose of saving space I really like the small data inside an ATE feature.

For data that is a little larger than 4 bytes - would it be acceptable to write that right after ATE if the block is large enough? So maybe there could be another rule that if there is enough space in a block right after the ATE for the data than it would be stored there.

Technically this is also mixing ATE and data but we would search for ATEs only on the start of the blocks anyway.

What do you think about such feature?

The original design of NVS and the ZMS here intends to work with devices with relatively small write block sizes, wbs, that can be appended without altering other data (unless area is overwritten); this allows placing metadata in small chunks of constant size and data at variable size, with no mandated boundaries (except write block size) between data.

Because ATE have same sizes, if wbs becomes large, it should be possible to start placing some data in it, for example if you have 32byte long wbs and 16 bytes of ATE, then any data of size <= 16 bytes can go into ATE, and it would not be a problem as wbs and size of ATE set boundaries, which means that the ATE and data in ATE wbs are still separated.

Eventually you may have to erase some part of storage, but that happens because device requires it, for example flash, before it can be written. Using a magnetic tape analogy: erase head has to erase data before r/w head can write in are previously used.

I understand that what you are trying to solve in your case @andrisk-dev , is a problem of relatively big write block size of your device that equals to erase block size - so you basically have a block device. You can see a difference here, where you can not really append data, directly on storage, you have to basically replace entire block contents, unless you are willing to append data at wbs of sector size.

In your case, the scheme you have presented in comment https://github.com/zephyrproject-rtos/zephyr/issues/77929#issuecomment-2326925453 could work, if you decide to divide your sector into ATE and data part assuming that you always write both as a single sector, for every sector, even if it carries continuation of data from previous sector, has that ATE part reserved and not available for users. Still, you will probably have some unused space wasted. Amiga on OFS has been doing that to datablocks, where each sector had reserved 24 bytes for OFS header, which means that user data could only take 488 bytes out of 512 byte sector (https://en.wikipedia.org/wiki/Amiga_Old_File_System, https://wiki.osdev.org/FFS_(Amiga))

What I understand is you are trying to provide your users with small reliable storage for basic data or settings, but I do not think that this PR will effectively solve your problem, at least not without significant complexity being introduced, as it is basically based on ability to freely append data at small granularity of xRAM and small wbs Flash devices, something your device does not provide. We can try to bend it your way, but I would rather focus first on making it solid solution for the devices it has been originally designed for.

andrisk-dev commented 1 week ago

Thanks for your replies @rghaddab @de-nordic ,

I understand that the first version is to be as simple as possible. I think one solution that would enable us at NXP to make most of the 512 bytes write block size is to have ATE in different format, maybe we can call it long ATE here, which could store information about multiple data records in one place. The format would include information about a number of data items stored in that ATE and a list of metadata about all of them would follow. That way, even if the data stored individually would be still sparse in flash, when reallocating the data from erased sector to a new one we could pack the data much more densely.

As this is more of a future release thing, I think the main question for now is how would the filesystem distinguish between normal entry fromat and an entry in different format. I think that should be decided now to make sure the "Support for entries in multiple formats" is possible in the future.

rghaddab commented 1 week ago

As this is more of a future release thing, I think the main question for now is how would the filesystem distinguish between normal entry fromat and an entry in different format.

This change is planned as following : The first byte of an ATE will be a format-type field that defines what is the ATE format that should be considered. For example : 0 => default format, 1 => format for big data 2=> ... All the write/read/ATE validation functions will have different behavior depending on the format that is read from the first byte. This is of course should be done in the initialization phase and we must verify that the ATEs are valid if we choose a custom format. At a certain point there will be different files containing each the corresponding function for each format. The main file will only have pointer to these functions depending on the format

dleach02 commented 5 days ago

Zephyr platforms have a maximum write size of up to 512 bytes. Will ZMS support it?

@butok I saw this RFC #77576 Although this storage system is still (could change in the future) not optimized for larger block write size, we could add a hidden config for that with a warning for users that want to increase the default maximum write block size.

@rghaddab, This needs to be a requirement on ZMS to not artificially limit the size. Optimize later if needed. Add warnings to make sure the users are aware of the impacts.

zephyrproject-rtos / zephyr