workflow4metabolomics / mtbls-dwnld

4 stars 2 forks source link

Switching from R to Python #4

Closed pkrog closed 7 years ago

pkrog commented 7 years ago

Hi @djcomlab, @proccaserra,

I'd like to start translating my R code into Python code, as we discussed at Halle workshop. Is it the isatools package I have to use ? I cannot find a method to load ISATab data from a folder. Could you point me to the right code, please ?

Regards, Pierrick

proccaserra commented 7 years ago

@pierrickrogermele , thx for getting this going again. The latest version of the ISA-API can be found here: https://github.com/ISA-tools/isa-api/releases/tag/v0.7.5

the readthedocs may help: if the data is already in metabolights, please see here: http://isatools.readthedocs.io/en/latest/importdata.html

or, if you have data in a git repo: http://isatools.readthedocs.io/en/latest/github.html

For local files, the following component has the read method you are looking for: https://github.com/ISA-tools/isa-api/blob/master/isatools/isatab.py

if anything unclear, do get in touch.

pkrog commented 7 years ago

Thank you @proccaserra , I'll look at this class.

djcomlab commented 7 years ago

Hi @pierrickrogermele, we have a function isatab.load(FP) where FP is an open file descriptor to the investigation file (see lines around https://github.com/ISA-tools/isa-api/blob/master/isatools/isatab.py#L2676) that returns the ISA content as Python objects (see https://github.com/ISA-tools/isa-api/blob/master/isatools/model/v1.py).

The idea is then you manipulate ISA content in the pure Python objects of the ISA data model, then you can use isatab.dump(isa_obj, file_path) to write out the ISA-Tab (see around line https://github.com/ISA-tools/isa-api/blob/master/isatools/isatab.py#L49).

So you could envisage something like:

with open("i_investigation.txt") as fp:
    investigation = isatab.load(fp)
    investigation.title = "Change name of investigation"  # obviously can do more complex changes that this
    isatab.dump(investigation, "/tmp/")

Hope that helps!

pkrog commented 7 years ago

Hi @djcomlab ,

Yes I did see this method and used it. I'm now trying to list the assays of a study, and select possibly one in particular. Then I'll have to get the sample names, and extract the sample metadata, the variable metadata and the matrix. For the output, I will have to output in W4M format so I won't need the dump(). Maybe once I'm done this isatab2w4m code could be included in isa-api inside the convert section ?

Best regards, Pierrick

djcomlab commented 7 years ago

Sure that would be great!

pkrog commented 7 years ago

Hi @djcomlab ,

I've found that the member data_files of the Assay class contains the list of data files. I'm using MTBLS30 (without raw files) to test, and for each assay I'm getting a list of 3 elements for data_files, two being blank and one being set to the m_*.tsv file. Why are 2 elements blank? How do I identify the text data file containing the data frame and ignore the raw files that could be listed inside the data_files member?

djcomlab commented 7 years ago

Hi Pierrick,

This is probably a bug, from what I can see in MTBLS30 there's two columns that are empty under the Raw Spectral Data File and Derived Spectral Data File columns, so it could be that the parser is picking these up incorrectly as a single data file in each with an empty string as the file name. The correct output I think should only show the one MAF file in the data_files list.

I'll take a look when I have a minute as I'm just on 1 day annual leave today.

Best/David

On 24 May 2017 at 08:48, Pierrick ROGER notifications@github.com wrote:

Hi @djcomlab https://github.com/djcomlab ,

I've found that the member data_files of the Assay class contains the list of data files. I'm using MTBLS30 (without raw files) to test, and for each assay I'm getting a list of 3 elements for datafiles, two being blank and one being set to the m*.tsv file. Why are 2 elements blank? How do I identify the text data file containing the data frame and ignore the raw files that could be listed inside the data_files member?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/workflow4metabolomics/mtbls-dwnld/issues/4#issuecomment-303645777, or mute the thread https://github.com/notifications/unsubscribe-auth/ADb-ptVI9ALyd3g-RELBjHMDv-nNswojks5r8-DSgaJpZM4NZ-GY .

djcomlab commented 7 years ago

I've pushed a fix to the develop branch (although develop is broken at the moment while I am developing some other new features), so will make it into the next isatools release in the Python package soon. In the meantime you should just ignore the Data objects with empty filename.

pkrog commented 7 years ago

Hi @djcomlab , Thanks, that's what I've done (ignoring the empty strings).

Now I have the following error when trying to load MTBLS404 (Sacurine data):

Setting material objects: 211 of 211 |#################################################################################################################################| Time: 0:00:00
Generating process objects: 211 of 211 |###############################################################################################################################| Time: 0:00:00
Setting material objects: 211 of 211 |#################################################################################################################################| Time: 0:00:00
Linking processes and other nodes in paths: 211 of 211 |###############################################################################################################| Time: 0:00:00
warning: Protocol REF missing before 'Data Transformation Name', found 'Normalization Name'
Setting material objects: 234 of 234 |#################################################################################################################################| Time: 0:00:00
Generating process objects: 234 of 234 |###############################################################################################################################| Time: 0:00:01
Setting material objects: 234 of 234 |#################################################################################################################################| Time: 0:00:00
Setting material objects: 0 of 234 |                                                                                                                                  | ETA:  --:--:--Traceback (most recent call last):
  File "./isatab2w4m.py", line 204, in <module>
    convert2w4m(**args_dict)
  File "./isatab2w4m.py", line 171, in convert2w4m
    investigation = load_investigation(input_dir)
  File "./isatab2w4m.py", line 164, in load_investigation
    investigation = ISATAB.load(f)
  File "/home/pierrick/.local/lib/python3.4/site-packages/isatools/isatab.py", line 2893, in load
    study_factors=study.factors).create_from_df(assay_tfile_df)
  File "/home/pierrick/.local/lib/python3.4/site-packages/isatools/isatab.py", line 3272, in create_from_df
    material = other_material[node_key]
KeyError: 'Labeled Extract Name:'

Do you have any idea about what could be wrong?

proccaserra commented 7 years ago

@pierrickrogermele could it be the ':' colon at the end of 'Labeled Extract Name:' ? it should be 'Labeled Extract Name'

djcomlab commented 7 years ago

Sorry for the delay @pierrickrogermele - I am on leave this week.

I think @proccaserra is probably correct (well spotted - I couldn't see it at first!), the parser looks for the exact label Labeled Extract Name so the extra colon would raise the KeyError.

pkrog commented 7 years ago

But I'm just calling the lload() method, and the a_sacurine.txt has the column named correctly "Labeled Extract Name", without any colon. Weird...

djcomlab commented 7 years ago

Ah yes, actually I think it's because of the empty values in the Labeled Extract Name column. The piece of code where the error occurs is actually looking up Labeled Extract Name:{label value} but in this case, like the other error we discussed previously, there is no label value and it's trying to look up a Labeled Extract with an empty string as the name.

I'll have to debug and fix this also.

djcomlab commented 7 years ago

Hi @pierrickrogermele just to let you know I'm back from leave and will try fix this ASAP.

djcomlab commented 7 years ago

Hi @pierrickrogermele, just to keep you updated I've pushed a fix and will roll out a minor update to the isatools pip package tonight.

djcomlab commented 7 years ago

Hi Pierrick,

I've pushed a fix and pushed out a new isatools package release (should be at v0.8.1) via PyPI just now.

Please do a 'pip install isatools --upgrade' and try it out when you can!

Best/David

On 29 May 2017 at 15:31, Pierrick ROGER notifications@github.com wrote:

But I'm just calling the lload() method, and the a_sacurine.txt has the column named correctly "Labeled Extract Name", without any colon. Weird...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/workflow4metabolomics/mtbls-dwnld/issues/4#issuecomment-304675270, or mute the thread https://github.com/notifications/unsubscribe-auth/ADb-prwUps_jE0bJ0hHjUZ-sMTTC7LvOks5r-ta5gaJpZM4NZ-GY .

pkrog commented 7 years ago

Hi @djcomlab ,

I'm done. The script isatab2w4m has been translated in Python into the script isatab2w4m.py, branch develop. I'll open an issue in ISA-tools/isa-api where we can discuss the integration of this code into ISA api.

Pierrick