xraypy / xraylarch

Larch: Applications and Python Library for Data Analysis of X-ray Absorption Spectroscopy (XAS, XANES, XAFS, EXAFS), X-ray Fluorescence (XRF) Spectroscopy and Imaging, and more.
https://xraypy.github.io/xraylarch
Other
127 stars 62 forks source link

Support for SOLEIL NeXus file format #409

Closed kaarelmand closed 3 months ago

kaarelmand commented 1 year ago

The file format used in the two beamlines I've been to at SOLEIL is NeXus, which is HDF5 under the hood. The files can, thus, be opened using larch.io.h5group on the command line, but XAS Viewer still refuses to open it with the following traceback:

[larch.io.specfile_reader.DataSourceSpecH5] ERROR : 'measurement' not found -> use 'set_scan' method first
[larch.io.specfile_reader.DataSourceSpecH5] ERROR : object of type 'NoneType' has no len()
[larch.io.specfile_reader.DataSourceSpecH5] ERROR : 'measurement' not found -> use 'set_scan' method first
[larch.io.specfile_reader.DataSourceSpecH5] ERROR : object of type 'NoneType' has no len()
[larch.io.specfile_reader.DataSourceSpecH5] ERROR : 'measurement' not found -> use 'set_scan' method first
[larch.io.specfile_reader.DataSourceSpecH5] ERROR : object of type 'NoneType' has no len()
[larch.io.specfile_reader.DataSourceSpecH5] ERROR : 'measurement' not found -> use 'set_scan' method first
[larch.io.specfile_reader.DataSourceSpecH5] ERROR : 'instrument/positioners' not found -> use 'set_scan' method first
[larch.io.specfile_reader.DataSourceSpecH5] ERROR : 'measurement' not found -> use 'set_scan' method first
Traceback (most recent call last):
  File "/home/kaarel/Develop/xraylarch/larch/wxxas/xasgui.py", line 1152, in onReadDialog
    self.onRead(path)
  File "/home/kaarel/Develop/xraylarch/larch/wxxas/xasgui.py", line 1167, in onRead
    self.show_subframe('spec_import', SpecfileImporter,
  File "/home/kaarel/Develop/xraylarch/larch/wxxas/xasgui.py", line 1105, in show_subframe
    self.subframes[name] = frameclass(self, **opts)
  File "/home/kaarel/Develop/xraylarch/larch/wxlib/specfile_importer.py", line 341, in __init__
    self.curscan = self.specfile.get_scan(curscan)
  File "/home/kaarel/Develop/xraylarch/larch/io/specfile_reader.py", line 703, in get_scan
    motor_names = self.get_scan_motors()
  File "/home/kaarel/Develop/xraylarch/larch/io/specfile_reader.py", line 514, in get_scan_motors
    return [i for i in counters if i in all_motors]
TypeError: 'NoneType' object is not iterable

The .nxs file in question is from the PUMA beamline at SOLEIL, and is provided as an attachment.

Aside from allowing to load the SOLEIL .nxs files in XAS viewer, it would be nice to provide a more convenient access function on the command line interface. For example, to get fluorescence data out of the attached file, you would need to go rather deep into the hierarchy of the created Group:

from larch import Interpreter
from larch.io import h5group

_larch = Interpreter()
h5 = h5group("ALARA401_JV07-map1-spot1-0063.nxs", _larch)
h5.exp.scan_data.data_01

Since most of the other innumerable streams of data in these .nxs files seem to be sample and instrumentation metadata, it might make sense for this convenience function to attach the contents of h5.exp.scan_data to the root of the Group, as these are the data most relevant to XAFS analysis in practice. Perhaps the NXData identifier can be used to push up the relevant data further in the hierarchy.

ALARA401_JV07-map1-spot1-0063.zip

newville commented 1 year ago

@kaarelmand Thanks - I think we would be willing to say that the files from SOLEIL should be supported, and not assume that all H5 files follow the conventions of the ESRF/Bliss/Spec beamlines.

Is "root.exp.scan_data" meant to be some universal description of scans? That doesn't seem very NeXuS-like to me ;). But, if this is the H5 schema that SOLEIL uses, then sure, let's use that.

@maurov Can we add a way to detect what conventions an H5 file uses before assuming it is Spec/Bliss H5 file? That way we might cover French and European conventions for H5 ;).

maurov commented 1 year ago

@kaarelmand sure we are willing to have Larch and xas_viewer be able to read seamlessly as much data formats as possible, so this should not be difficult to implement, we just need to know the structure of the HDF5 file you have sent.

I may be wrong, but to me, the HDF5 you have provided, is not following NeXuS directives at all. From NeXuS it takes only the file extension. Below what I see when I look at its structure:

image

The error you get is due to the fact that we use by default the spech5 API in the module larch.io.specfile_reader.DataSourceSpecH5. This is the standard data scheme used at ESRF.

I think we can extend this module to SOLEIL scheme, but we need a clear description of how the data are structured in the HDF5 container.

Otherwise, if you want to contribute directly to larch, feel free to submit a pull request. The only constraint we ask is to use silx.io.open to read the file instead of using directly h5py.

kaarelmand commented 1 year ago

Thanks for considering this!

I've attached five .nxs files from SOLEIL. Two are from the PUMA beamline: the same XAFS file attached above and an XRF mapping .nxs file; and three are from the LUCIA beamline: one for XRF mapping data, one for a normal XAFS run, and one for a flyscan XAFS -- i.e., where the actuators are not stopping for the measurements, but instead the fluorescence at each energy is integrated over some distance "on the fly"; this last one may be difficult for XAS Viewer to parse. Probably the XRF map data are not useful here (unless you want to support them in GSEMapViewer), but I included them just in case.

Based on this small sample, it seems like the root.exp.scan_data format for hosting data is standardized throughout SOLEIL. I don't think this is in conflict with NeXus directives. On the image above, exp has the class NXentry and right below it is scan_data with the class NXdata, just as is described in the NeXus design document. Of the various entires under scan_data, the first one has a primary attribute, which suggests it is the controlling variable for any plotting (energy scale or monochromator position in this case), whereas other data channels have signal attributes, suggesting these are the responding variables to be plotted. This, too, is described under the NeXus data storing rules, although it corresponds to the now-deprecated Version 1 schema for finding plottable data.

I can try making a PR, although I'm very new at this and it'll take me a bit of time.

soleil-nexus_files.zip

maurov commented 1 year ago

@kaarelmand thanks for sending more examples data from SOLEIL XAFS beamlines. I apologize for my early comment about not following NeXus directives. I completely missed the NX* entries this morning, before a good coffee and while I was alignining the beamline in parallel ;)

I propose to use silx.io.nxdata for reading the NeXus data into larch. I will have a look this week and give an update here. Is that fine for you?

@newville are you in hurry to release 0.9.67 or you could wait having this included in the next release? In my opinion, it would be great having a first support for SOLEIL XAFS data in the next release.

kaarelmand commented 1 year ago

No worries at all; that course of action works great for me!

maurov commented 1 year ago

@woutdenolf I put you in the loop for this, as discussed this morning. It would be great to have a proof of concept generic NeXus reader in Larch, e.g. larch.io.nexus_reader.

If you are willing to do so, you could start by implementing the methods in larch.io.specfile_reader.DataSourceSpecH5. By doing so, it will be straightforward to have xas_viewer GUI be able reading NeXus files without too much changes.

newville commented 1 year ago

@maurov @kaarelmand @woutdenolf I have a couple of thoughts here.

a) I am not at all opposed to a "generic HDF5/NeXuS" browser to select data. I'm not sure whether using silx.io or nexpy.nexusformat code or straight h5py would be the best approach, but this seems worth pursuing. I'm willing to work toward this. Help or suggestions would be very much appreciated. I might also ask the APS "nexus people" about such questions.

b) I am also in favor of being able to identify data from "common sources" (beamlines, data-acquisition systems) and making sensible (but optional) default choices for how to read data. So, collecting example data sets and figuring out schema for "SOLEIL Nexus" (and "Elettra", etc) would be very helpful.

c) I am also in favor of working toward a common schema or at least a set of translations or aliases for the various formats based on (or related to) HDF5. Like, I'm toying with the idea of switching my HDF5 format for XRF maps to use Zarr. Also, and perhaps coincidentally, I had a meeting this morning for planning an XAFS meeting (Q2XAFS) for next August in Australia that will include real discussions on data formats. One of the "problems" identified is definitely how to handle the XAFS data in the various required-and-not-loved HDF5 formats.

So: any volunteers or suggestions for who are the right people to be in that conversation?

maurov commented 1 year ago

@newville thanks for your comments. I propose, as first step, to let @woutdenolf work out a proof of concept code working for the SOLEIL NeXus files sent by @kaarelmand (let's consider only the XAFS data and take out the XRF one for the moment).

Personally, I am afraid of those never-ending discussions on data formats that never converge to anything usable in practice and in the meantime most of the users simply convert these fancy HDF5/NeXus into ASCII files. This said, I would be more than happy to take part (virtually) in the next Q2XAFS meeting. I propose to discuss this topic elsewhere and keep this issue for the specific case initially raised by @kaarelmand .

newville commented 1 year ago

@maurov Thanks. I am fine with focussing on the issue as raised here (cannot read files from SOLEIL easily) and moving the larger-scale discussion elsewhere.

maurov commented 1 year ago

@kaarelmand

To give some news from my side, I am lost in the NeXus complexity and, as human, I am not able to understand how to get a simple (energy, I0, mu) array from the PUMA and LUCIA data. Please, could you post a simple example of code using h5py that gets the those arrays out of two XAFS data from these SOLEIL beamlines?

Furthermore, could you send an example of data file with multiple XAFS scans? For the LUCIA ones, to me it is practically impossible to know how to move from / down to the data, because the name of the first group is specific to the sample. I think the easiest for me would be to discuss directly with beamline people at SOLEIL. Please, could you send me their contacts via private email?

@woutdenolf any news from your side on this?

woutdenolf commented 1 year ago

I will have a look at this mid October.

woutdenolf commented 1 year ago

I'm implementing a generic XAS source for Nexus. I will support 1 scan == 1 XAS scan. I could support Fullfield XAS (1 scan == many XAS scans) but as larch analysis is scan per scan, I don't think it makes sense.

@kaarelmand Some of the files does not contain XAS data. For example xasflyscan seems to contain only XRF spectra. You will first not to convert that to XAS data (1D data, energy vs. mu, I0, I1, ...).

newville commented 1 year ago

@woutdenolf Thanks very much for working on this! I agree that support for processing and analyzing full-field XAFS is something we don't really consider, but we should consider how to do that. But, if it is clear that there is a single (or even common) way to represent such data, I would not at all be opposed to "provisional support for reading it". That would at least allow display, slicing, converting/merging into 1-D XAS spectra.

woutdenolf commented 1 year ago

We could easily add other XasDataSource classes for fullfield but we're talking about thousands of spectra. Opening all of them in xas_viewer is rather pointless imo. You would need an entirely different type of interface if you want to handle XAS imaging or tomography.

maurov commented 1 year ago

@woutdenolf thank you very much indeed for working on this. I agree with you that opening fullfield data in xas_viewer does not make sense, but Larch is also a library and aims handling more X-ray spectroscopy data, beyond XAFS. For example, Matt uses Larch for XRF imaging and tomography on his beamline; I use larch for XES, RIXS or any peak-like containing data for peak-fitting. Larch is also used for X-ray refraction data used in many techniques like DAFS, ReflEXAFS and spectral ptychography. So, to my opinion, if we want to implement a "generic NeXus file reader in Larch" we should include the possibility to read such data from the beginning.

We will review #412 as soon we have time for this, but I would recommend first adding your reader with an example (a Larch script or a Jupyter notebook in pure python would be great!) how to read and plot XAFS spectra from the NeXus files provided by @kaarelmand. At this stage I would not change the usual way Larch and xas_viewer read data. @newville what is your opinion on this?

@mretegan I think it would be nice to have your opinion on this too.

woutdenolf commented 1 year ago
class XasScan(NamedTuple):
    name: str
    description: str
    info: str
    start_time: str
    labels: List[str]
    data: ArrayLike

Ok then XasScan.data can have shape (nlabels, nenergy) for a single spectrum scan and shape (nlabels, npoints, nenergy) for a multi spectrum scan.

woutdenolf commented 1 year ago

Btw, do you prefer

class XasScan(NamedTuple):
    name: str
    description: str
    info: str
    start_time: str
    labels: List[str]
    data: ArrayLike

or

class XasScan(NamedTuple):
    name: str
    description: str
    info: str
    start_time: str
    data: Dict[str, ArrayLike]

or even

class XasScan(NamedTuple):
    name: str
    description: str
    info: str
    start_time: str
    data: pandas.DataFrame
maurov commented 1 year ago

@woutdenolf how to represent the data in memory (let's call it the "data model") can be a never ending discussion. In Larch this is currently done with the Group object and all functions work with it, so I think the base 1D data structure should stay like this.

What we are missing in my opinion is a "Group of Groups" common object in Larch that can be nested (tree-like data model). At the moment we store Groups either in lists or in dictionaries. I think it would be beneficial to enhance this aspect for the moment.

When more than one dataset or more complex structure is needed I think we should stick to it and expand it to

newville commented 1 year ago

@woutdenolf @maurov Thanks, I'm falling a bit behind due to other beamline stuff.

Yes to "Groups" (basically an empty class to access with Thing.attribute instead of Thing['attribute']) for general containers of data.

But, if XasScan here is meant to be a predictable, static-like thing, a NamedTuple is a fine way to represent "The XAS data from a Nexus file". I would suggest that we don't really need Pandas. I sort of like "simple" for data structures. I would probably use

class XasScan(NamedTuple):
    name: str
    description: str
    info: str
    start_time: str
    labels: List[str]
    data: ArrayLike

as it seems like it maps to HDF5 a bit better, and also to how we are already reading data from some text files.