zqfang / GSEApy

Gene Set Enrichment Analysis in Python
http://gseapy.rtfd.io/
BSD 3-Clause "New" or "Revised" License
553 stars 116 forks source link

pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 2 #83

Closed tjiagoM closed 4 years ago

tjiagoM commented 5 years ago

Hello,

I have to run multiple enrichments, over different groups of genes, so I just have a big for loop which goes over all these group of genes, and for each one just runs:

enr = gp.enrichr(gene_list=list(genes_array.astype('<U3')),
                         organism='human',
                         description='test',
                         gene_sets='Reactome_2016',
                         cutoff=1)

Once in a while I have this error:

Traceback (most recent call last):                       
File "my_script.py", line 83, in <module>
  cutoff=1)
File "/home_location/.local/lib/python3.6/site-packages/gseapy/enrichr.py", line 391, in enrichr
  enr.run()
File "/home_location/.local/lib/python3.6/site-packages/gseapy/enrichr.py", line 331, in run
  shortID, res = self.get_results(genes_list)
File "/home_location/.local/lib/python3.6/site-packages/gseapy/enrichr.py", line 169, in get_results
  res = pd.read_csv(StringIO(response.content.decode('utf-8')),sep="\t")
File "/home_location/miniconda/envs/env-general/lib/python3.6/site-packages/pandas/io/parsers.py", line 702, in parser_f
  return _read(filepath_or_buffer, kwds)
File "/home_location/miniconda/envs/env-general/lib/python3.6/site-packages/pandas/io/parsers.py", line 435, in _read
  data = parser.read(nrows)
File "/home_location/miniconda/envs/env-general/lib/python3.6/site-packages/pandas/io/parsers.py", line 1139, in read
  ret = self._engine.read(nrows)
File "/home_location/miniconda/envs/env-general/lib/python3.6/site-packages/pandas/io/parsers.py", line 1995, in read
  data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 955, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2172, in pandas._libs.parsers.raise_parser_error

pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 2   

I'm having a huge difficulty to isolate the error because this doesn't happen always for the same group of genes. Could anyone give an hint about what the problem could be, as I've started using gseapy only very recently?

If I cannot find the source of error I guess it's fine because I've been able to run for all the groups by just repeating the code... Which is quite annoying as I don't know whether some enrichment might be wrong. What could I be missing here?

zqfang commented 5 years ago

How many gene groups are you querying? You got this problem because this line of code:

 res = pd.read_csv(StringIO(response.content.decode('utf-8')),sep="\t")

I don't know what happens. But I suggest the reason is network latency. gseapy wait for a long time to get back results from Enricher server. I'll take a time to look at this

tjiagoM commented 5 years ago

Yeah, for some groups I have a few hundreds, but I ended up not saving any group because it constantly changes. I will try to run again and see for which groups it stops this time.

Now that you talk about it, sometimes gseapy was failing because of a connection reset exception, and I solved this by just adding a few milliseconds of sleep before calling enrichr() each time. Could it be that that response read by StringIO has some error/warning from the API request, and that's why pandas cannot read it properly?

tjiagoM commented 5 years ago

@zqfang I was going to create a new issue, but I'm now receiving another error in an inconsistent way (a bit like the error in this issue). Do you think it might be related to this? Apologies for just throwing the exceptions here, but they just randomly appear, so maybe you might know better how to help me.

Traceback (most recent call last):
  File "07_explain_communitites.py", line 84, in <module>
    cutoff=0.05)
  File "/home_location/.local/lib/python3.6/site-packages/gseapy/enrichr.py", line 391, in enrichr
    enr.run()
  File "/home_location/.local/lib/python3.6/site-packages/gseapy/enrichr.py", line 309, in run
    gss = self.parse_genesets()
  File "/home_location/.local/lib/python3.6/site-packages/gseapy/enrichr.py", line 68, in parse_genesets
    enrichr_library = self.get_libraries()
  File "home_location/.local/lib/python3.6/site-packages/gseapy/enrichr.py", line 183, in get_libraries
    libs = [lib['libraryName'] for lib in libs_json['statistics']]
KeyError: 'statistics'
zqfang commented 5 years ago

I think the problems you’ve had are for the same reason: the Enrichr server could not handle gseapy’s high concurrent requests from same IP address in a short time. It seems that user will be blocked to prevent API abuse. So, when you try to get the data back, you will get nothing. I have no other way to improve this, except adding sleep after each querying. Do you have any ideas?

tjiagoM commented 5 years ago

I see, thanks for the help anyway!

I'd say if you have a timeout from the Enrichr server, or some error in the returning answer from Enrichr, maybe just catch that and show to the user that the problem is with the Enrichr server (and maybe suggest wait a bit or reduce the number of requests). Otherwise all these errors will surely just bring confusion when the problem is actually simple, as you pointed out.

zqfang commented 5 years ago

Well, good idea. Warning should be printed out to the console if nothing gets back. Enrichr server are now upgrading. If you still have the same problem, then you need to re-run.

屏幕快照 2019-07-10 下午3 41 16
tsnetterfield commented 5 years ago

I am also getting the same error that @tjiagoM posted above executing the following on a list of about 50 genes:

en_rnk_1=gp.enrichr(gene_list=rnk1_en,description='test',gene_sets='NCI-Nature_2016',outdir='./GSEA Files/Selected Gene Sets')

I updated to the latest release and am still getting this issue, is there still a problem with the server that is causing this?

tsnetterfield commented 5 years ago

I have waited a week and I am still getting the same error?

`2019-09-26 14:28:42,305 Error fetching enrichment results: TRRUST_Transcription_Factors_2019
---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
<ipython-input-59-902aeaec60e8> in <module>
----> 1 en_rnk_1=gp.enrichr(gene_list=rnk1_en,gene_sets='TRRUST_Transcription_Factors_2019',outdir='./GSEA Files/Selected Gene Sets')

~\Anaconda3\lib\site-packages\gseapy\enrichr.py in enrichr(gene_list, gene_sets, organism, description, outdir, background, cutoff, format, figsize, top_term, no_plot, verbose)
    415     enr = Enrichr(gene_list, gene_sets, organism, description, outdir,
    416                   cutoff, background, format, figsize, top_term, no_plot, verbose)
--> 417     enr.run()
    418 
    419     return enr

~\Anaconda3\lib\site-packages\gseapy\enrichr.py in run(self)
    354                 self._logger.debug("Start Enrichr using library: %s" % (self._gs))
    355                 self._logger.info('Analysis name: %s, Enrichr Library: %s' % (self.descriptions, self._gs))
--> 356                 shortID, res = self.get_results(genes_list)
    357                 # Remember gene set library used
    358             res.insert(0, "Gene_set", self._gs)

~\Anaconda3\lib\site-packages\gseapy\enrichr.py in get_results(self, gene_list)
    182         if not response.ok:
    183             self._logger.error('Error fetching enrichment results: %s'%self._gs)
--> 184         res = pd.read_csv(StringIO(response.content.decode('utf-8')), sep="\t")
    185         return [job_id['shortId'], res]
    186 

~\Anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    700                     skip_blank_lines=skip_blank_lines)
    701 
--> 702         return _read(filepath_or_buffer, kwds)
    703 
    704     parser_f.__name__ = name

~\Anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    433 
    434     try:
--> 435         data = parser.read(nrows)
    436     finally:
    437         parser.close()

~\Anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
   1137     def read(self, nrows=None):
   1138         nrows = _validate_integer('nrows', nrows)
-> 1139         ret = self._engine.read(nrows)
   1140 
   1141         # May alter columns / col_dict

~\Anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
   1993     def read(self, nrows=None):
   1994         try:
-> 1995             data = self._reader.read(nrows)
   1996         except StopIteration:
   1997             if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 2`

Any insight into why this may be happening?

zqfang commented 5 years ago

@tsnetterfield , Sorry for replying late. could you please install the lastest PR and try again? I've update the data that pandas read. Hope this will fix the problem you have

tsnetterfield commented 5 years ago

@zqfang Thanks for getting back to me! I updated my Python to 3.7.4 and am still getting the same error I posted above.

zqfang commented 5 years ago

@tsnetterfield , Please install the lastest gseapy using the this line of code:

pip install git+git://github.com/zqfang/gseapy.git#egg=gseapy

make sure that you are using v0.9.16

tsnetterfield commented 5 years ago

@zqfang When I do this in Anaconda Prompt this is the first line that comes up:

Requirement already satisfied: gseapy from git+git://github.com/zqfang/gseapy.git#egg=gseapy in c:\users\tatiana\anaconda3\lib\site-packages (0.9.15)

Anaconda seems to only see the 0.9.15 development version for some reason.

armadillocommander commented 5 years ago

You cannot install the same package with different version twice. Uninstall old one first.

Sent from my iPhone

On Sep 29, 2019, at 10:17 AM, tsnetterfield notifications@github.com wrote:

@zqfang When I do this in Anaconda Prompt this is the first line that comes up:

Requirement already satisfied: gseapy from git+git://github.com/zqfang/gseapy.git#egg=gseapy in c:\users\tatiana\anaconda3\lib\site-packages (0.9.15)

Anaconda seems to only see the 0.9.15 development version for some reason.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

tsnetterfield commented 5 years ago

@armadillocommander thanks for the tip! I uninstalled and now have version 0.9.16. However, I am still getting the exact same parser error from above.

zqfang commented 5 years ago

@tsnetterfield , do you mind share me with your gene list input? I can't reproduce your bug

tsnetterfield commented 5 years ago

my_gene_list.txt

Hi @zqfang, attached is the list I was trying to run. I tried a different list just now and got the same error.

zqfang commented 5 years ago

@tsnetterfield , sorry for replying late. I was on vacation. However, I still could not reproduce the error you've got using the same code:

en_rnk_1=gp.enrichr(gene_list="my_gene_list.txt" ,description='test',gene_sets='NCI-Nature_2016',outdir='./GSEA Files/Selected Gene Sets')

Even I run the code for 50 times, it did not break.

zqfang commented 4 years ago

close now. this issue should be gone now

Eddy265 commented 3 years ago

Alternately, you can save the file as CSV UTF-8 (Comma delimited)

smartup10 commented 3 years ago

I had the same error I arranged regularizing the data in csv file.