Open cboulay opened 3 years ago
Am I right in assuming that resolve_streams only needs stream headers? Indeed the functions it calls (parse_xdf, parse_chunks, _read_chunks) only operate on stream headers so the answer seems like an obvious "yes", but those functions' doc strings don't state this explicitly so I was just wondering if you were planning on building these out in the future?
@cboulay yes, to implement select_streams
I was lazy and used parts of existing code that I'd written in an effort to write an XDF reader from scratch. Therefore, the functions you mention only need stream headers, but they are also intended to be used for reading data. At that time I thought it was more important to ship the select streams feature than keeping the code base free from duplicated code.
In my deduplication effort, I'm reusing your _read_chunks method, but this means adding a lot of other functionality to it, only to disable that functionality with stream_headers_only=True when being called via parse_xdf (via resolve_streams). Is this compatible with your use-case?
I need parse_chunks
and parse_xdf
- so yes, these changes should be compatible. In any case, I will test if everything still works with these changes, just let me know when this is ready for review.
@cbrnr , ah, I was operating under the assumption that you only needed headers, so I renamed parse_chunks
to parse_stream_header_chunks
to be explicit. Sorry about that. We can rename it back.
But, we should discuss a little about whether or not that function is required at all. After the changes in this PR, there remains a difference in stream header formatting between load_xdf
and your functions. Namely, load_xdf
's stream header dict has all of its values wrapped in a list, even the simple strings (e.g. host) and single numbers (e.g. srate). This is because of the use of defaultdict
in _xml2dict
. I'm not a big fan of this, but changing this now would break any downstream applications that use pyxdf, so it's there to stay.
Do you absolutely need to have your stream headers "flatted" or "delistified"?
On that same train of thought, if you do build out parse_chunks
further, and eventually have it handle data and build concatenated arrays, would you be satisfied if the format of the completed streams was the same as that currently returned by load_xdf
? If yes, then maybe we can move the stream-building code out of load_xdf
and into parse_chunks
, then parse_chunks
can be used by in load_xdf
, with an added kwarg to not flatten the stream header dict values.
Please take a look at what I have in place now and let me know how you would like to proceed.
I think we should leave fixing up the 'verbose' usage until a later issue.
@cboulay I will need to dig a little bit into my code to see what I really need. This will take some time, and since I'm technically on vacation right now it would be great if you could give me two weeks or so to check on this. For now, I think that streamlining the code base should take precedence over a specific use case I might have. After all, I can always pull out the functions I need into my other projects. But if possible, let's discuss this next year 😄 (i.e. in ~2 weeks).
Of course! We need the headers-only feature sooner-than-later but we're happy go use the branch for now. Happy New Year :partying_face:
Generally speaking, i am totally ok in rewriting my code to accomodate this PR and think it is worthwhile.
I just have a few questions. One of the tests for my liesl toolbox failed, for a functtionality where I am using parse_chunks
to inspect xdf files from the cli. That is easy to accomodate with a try catch and import parse_stream_header_chunks
instead. But then it complains that there is also no source_id
anymore in the output of the new parse_stream_header_chunks
? I can dig a little bit deeper into the code base, but if you can tell me quickly what would be the process to recover the output of the old parse_chunks
with the new functions, that'd save me some time and brain juice. Thanks!
@agricolab - Based on what Clemens said, I think I should rename parse_stream_header_chunks
back to parse_chunks
, so you don't need to worry about that.
Let me explain the organization as I see it before we decide what to do.
Path 1 - using load_xdf
:
open_xdf
unzips if necessary, reads and checks the magic code_read_chunks
generator function yields a dict per chunk.
i. Error checking
ii. Basic processing to put the data into the dict (e.g., _xml2dict
, _normalize_header
, _read_chunk3
, using StreamData to track timestamps).
iii. kwargs to control select_streams
, stream_headers_only
select_streams
uses match_streaminfos
with only a single stream.load_xdf
to build up temp_stream_data
.Path 2 - using resolve_streams
parse_xdf
open_xdf
as above_read_chunks
as aboveparse_chunks
Note that we should not change the format returned by Path1 as this would break 3rd party applications we don't control.
Remaining things to change:
load_xdf
to a function named parse_chunk
parse_chunk
inside both load_xdf
and parse_chunks
skip_desc
can probably be removed because I don't think Clemens actually needs to exclude desc
from the info.After the above changes, the main difference between Path 1 and Path 2 is that Path 2 loads all the chunks before parsing them, whereas Path 1 parses them as they come in and aggregates them after they've all been parsed. Path 1 also has optional timestamp fixes.
With select_streams
and stream_headers_only
, there's really not much of a performance difference between the two paths.
I haven't tested the PR yet, but from what I've read I'm all for it.
I have only two things to add:
_read_chunks
should yield classes (i.e. an Iterable[Chunk]
with Chunk
as below) so it's possible to add type info for IDEs and Cython. These could be shared with a not yet existing python xdf writer:class Chunk:
tag: int
header_len: int
payload_len: int
_normalize_header
discards all fields not listed and is quite verbose. The default fields that don't need type conversion could be added with a list comprehension, i.e.fields = ['name', 'type']
res = { key: header.get(key,[None]) for key in fields}
res['channel_count'] = [int(header['channel_count'][0]]
@cbrnr Have you had a chance to look at this? Can you please take a look and make sure that your use-cases work or suggest how to make it work?
@cboulay sorry, I completely forgot. I will take a look today.
@cboulay can you maybe rename the function back to parse_chunks
? Otherwise, I immediately run into an error because I need this function. After that, testing will be a bit easier for me.
I’m unavailable today. Please go ahead and make that change on your end and I’ll resolve later.
Alright, all of my files still work with this PR. In addition to comments already mentioned, I suggest that we make open_xdf
a private function (i.e. rename it to _open_xdf
). Furthermore, the public API is currently a bit confusing with the following functions (I'm already excluding open_xdf
):
load_xdf
: the main function which loads an XDF filematch_streaminfos
: find stream IDs matching specified criteriaresolve_streams
: resolve streams in given XDF file, i.e. return information on all streamsparse_xdf
parse_chunks
I'm using parse_chunks(parse_xdf(fname))
only to extract stream infos, so we could maybe simply expose one function that returns the composition of the two. Not sure if anyone else is using one or both of these functions separately though. I'm also wondering if this could be completely replaced by resolve_streams
- I'll have to take a look.
@cboulay could you compile a new to do list summarizing all the points mentioned in this thread? It doesn't mean that you have to implement all of the suggestions, but I'd like to discuss all points with everyone.
BTW, test_load_from_memory
fails with a KeyError
because testfile
does not have an "info"
field. This test works on master.
This PR deduplicates code, allows use of select_streams without double traversal of file (#60) and adds the ability to extract headers only.
It was a bit tricky to do while holding onto a couple top-level functions. I hope I did it to everyone's satisfaction, though it's maybe a little messier than it could have been otherwise.
I think the logger is probably a little too verbose by default, especially when skipping over non-selected streams or grabbing only headers. But I'd like some feedback before I clean that up.
I also think some of the top level functions (even the _ prefixed ones) can be re-ordered more logically, but I wanted to limit the already-large diff.
There are quite a few changes here so I'm tagging lots of reviewers. Please test on your data files. If we're happy with my previous PR then I'll merge that in here so we can test on data files with problematic clock synchronization.