Deduplication, support select_streams on single pass, support headers-only

cboulay commented 3 years ago

This PR deduplicates code, allows use of select_streams without double traversal of file (#60) and adds the ability to extract headers only.

It was a bit tricky to do while holding onto a couple top-level functions. I hope I did it to everyone's satisfaction, though it's maybe a little messier than it could have been otherwise.

I think the logger is probably a little too verbose by default, especially when skipping over non-selected streams or grabbing only headers. But I'd like some feedback before I clean that up.

I also think some of the top level functions (even the _ prefixed ones) can be re-ordered more logically, but I wanted to limit the already-large diff.

There are quite a few changes here so I'm tagging lots of reviewers. Please test on your data files. If we're happy with my previous PR then I'll merge that in here so we can test on data files with problematic clock synchronization.

cbrnr commented 3 years ago

Am I right in assuming that resolve_streams only needs stream headers? Indeed the functions it calls (parse_xdf, parse_chunks, _read_chunks) only operate on stream headers so the answer seems like an obvious "yes", but those functions' doc strings don't state this explicitly so I was just wondering if you were planning on building these out in the future?

@cboulay yes, to implement select_streams I was lazy and used parts of existing code that I'd written in an effort to write an XDF reader from scratch. Therefore, the functions you mention only need stream headers, but they are also intended to be used for reading data. At that time I thought it was more important to ship the select streams feature than keeping the code base free from duplicated code.

In my deduplication effort, I'm reusing your _read_chunks method, but this means adding a lot of other functionality to it, only to disable that functionality with stream_headers_only=True when being called via parse_xdf (via resolve_streams). Is this compatible with your use-case?

I need parse_chunks and parse_xdf - so yes, these changes should be compatible. In any case, I will test if everything still works with these changes, just let me know when this is ready for review.

cboulay commented 3 years ago

@cbrnr , ah, I was operating under the assumption that you only needed headers, so I renamed parse_chunks to parse_stream_header_chunks to be explicit. Sorry about that. We can rename it back.

But, we should discuss a little about whether or not that function is required at all. After the changes in this PR, there remains a difference in stream header formatting between load_xdf and your functions. Namely, load_xdf's stream header dict has all of its values wrapped in a list, even the simple strings (e.g. host) and single numbers (e.g. srate). This is because of the use of defaultdict in _xml2dict. I'm not a big fan of this, but changing this now would break any downstream applications that use pyxdf, so it's there to stay. Do you absolutely need to have your stream headers "flatted" or "delistified"?

On that same train of thought, if you do build out parse_chunks further, and eventually have it handle data and build concatenated arrays, would you be satisfied if the format of the completed streams was the same as that currently returned by load_xdf? If yes, then maybe we can move the stream-building code out of load_xdf and into parse_chunks, then parse_chunks can be used by in load_xdf, with an added kwarg to not flatten the stream header dict values.

Please take a look at what I have in place now and let me know how you would like to proceed.

I think we should leave fixing up the 'verbose' usage until a later issue.

cbrnr commented 3 years ago

@cboulay I will need to dig a little bit into my code to see what I really need. This will take some time, and since I'm technically on vacation right now it would be great if you could give me two weeks or so to check on this. For now, I think that streamlining the code base should take precedence over a specific use case I might have. After all, I can always pull out the functions I need into my other projects. But if possible, let's discuss this next year 😄 (i.e. in ~2 weeks).

cboulay commented 3 years ago

Of course! We need the headers-only feature sooner-than-later but we're happy go use the branch for now. Happy New Year :partying_face:

agricolab commented 3 years ago

Generally speaking, i am totally ok in rewriting my code to accomodate this PR and think it is worthwhile.

I just have a few questions. One of the tests for my liesl toolbox failed, for a functtionality where I am using parse_chunks to inspect xdf files from the cli. That is easy to accomodate with a try catch and import parse_stream_header_chunks instead. But then it complains that there is also no source_id anymore in the output of the new parse_stream_header_chunks? I can dig a little bit deeper into the code base, but if you can tell me quickly what would be the process to recover the output of the old parse_chunks with the new functions, that'd save me some time and brain juice. Thanks!

cboulay commented 3 years ago

@agricolab - Based on what Clemens said, I think I should rename parse_stream_header_chunks back to parse_chunks, so you don't need to worry about that.

Let me explain the organization as I see it before we decide what to do.

Path 1 - using load_xdf:

open_xdf unzips if necessary, reads and checks the magic code
_read_chunks generator function yields a dict per chunk. i. Error checking ii. Basic processing to put the data into the dict (e.g., _xml2dict, _normalize_header, _read_chunk3, using StreamData to track timestamps). iii. kwargs to control select_streams, stream_headers_only
- select_streams uses match_streaminfos with only a single stream.
Yielded chunks are further processed in load_xdf to build up temp_stream_data.
Optional synchronize clocks
Optional dejitter

Path 2 - using resolve_streams

parse_xdf
open_xdf as above
_read_chunks as above
parse_chunks

Note that we should not change the format returned by Path1 as this would break 3rd party applications we don't control.

Remaining things to change:

Move the "extra" processing that happens in load_xdf to a function named parse_chunk
Use parse_chunk inside both load_xdf and parse_chunks
As much as we can, change the data formats in Path 2 so it's similar to that returned by Path 1.
- Where they cannot be changed, modify in Path 2.
- skip_desc can probably be removed because I don't think Clemens actually needs to exclude desc from the info.
Somewhere along the way I guess I lost 'source_id', or at least it's not in the same location as previously. Which is correct?

After the above changes, the main difference between Path 1 and Path 2 is that Path 2 loads all the chunks before parsing them, whereas Path 1 parses them as they come in and aggregates them after they've all been parsed. Path 1 also has optional timestamp fixes.

With select_streams and stream_headers_only, there's really not much of a performance difference between the two paths.

tstenner commented 3 years ago

I haven't tested the PR yet, but from what I've read I'm all for it.

I have only two things to add:

_read_chunks should yield classes (i.e. an Iterable[Chunk] with Chunk as below) so it's possible to add type info for IDEs and Cython. These could be shared with a not yet existing python xdf writer:

class Chunk:
    tag: int
    header_len: int
    payload_len: int

_normalize_header discards all fields not listed and is quite verbose. The default fields that don't need type conversion could be added with a list comprehension, i.e.

fields = ['name', 'type']
res = { key: header.get(key,[None]) for key in fields}
res['channel_count'] = [int(header['channel_count'][0]]

cboulay commented 3 years ago

@cbrnr Have you had a chance to look at this? Can you please take a look and make sure that your use-cases work or suggest how to make it work?

cbrnr commented 3 years ago

@cboulay sorry, I completely forgot. I will take a look today.

cbrnr commented 3 years ago

@cboulay can you maybe rename the function back to parse_chunks? Otherwise, I immediately run into an error because I need this function. After that, testing will be a bit easier for me.

cboulay commented 3 years ago

I’m unavailable today. Please go ahead and make that change on your end and I’ll resolve later.

cbrnr commented 3 years ago

Alright, all of my files still work with this PR. In addition to comments already mentioned, I suggest that we make open_xdf a private function (i.e. rename it to _open_xdf). Furthermore, the public API is currently a bit confusing with the following functions (I'm already excluding open_xdf):

load_xdf: the main function which loads an XDF file
match_streaminfos: find stream IDs matching specified criteria
resolve_streams: resolve streams in given XDF file, i.e. return information on all streams
parse_xdf
parse_chunks

I'm using parse_chunks(parse_xdf(fname)) only to extract stream infos, so we could maybe simply expose one function that returns the composition of the two. Not sure if anyone else is using one or both of these functions separately though. I'm also wondering if this could be completely replaced by resolve_streams - I'll have to take a look.

@cboulay could you compile a new to do list summarizing all the points mentioned in this thread? It doesn't mean that you have to implement all of the suggestions, but I'd like to discuss all points with everyone.

cbrnr commented 3 years ago

BTW, test_load_from_memory fails with a KeyError because testfile does not have an "info" field. This test works on master.

xdf-modules / pyxdf

Deduplication, support select_streams on single pass, support headers-only #74