wking / rss2email

open-source tool for Windows, Mac OS and UNIX for getting news from RSS feeds in email
http://pypi.python.org/pypi/rss2email/
GNU General Public License v2.0
273 stars 48 forks source link

Replace json? #81

Open sciunto opened 8 years ago

sciunto commented 8 years ago

Hi @wking and others.

As explained here: https://github.com/kurtmckee/feedparser/issues/44 i made some investigations about performances and CPU consumption. I saw somebody complaining on the web about large CPU consumptions, and I have similar problems too.

In addition to what I reported above and the small enhancement I tried to make, it appears that a lot of cpu is used on save functions.

File: /home/fr/github/rss2email/rss2email/feeds.py
Function: _save_feeds at line 357
   340                                               @profile
   341                                               def save(self):
   342         1          172    172.0      0.0          dst_config_file = _os.path.realpath(self.configfiles[-1])
   343         1           27     27.0      0.0          _LOG.debug('save feed configuration to {}'.format(dst_config_file))
   344       110          335      3.0      0.0          for feed in self:
   345       109       365121   3349.7     14.2              feed.save_to_config()
   346         1           34     34.0      0.0          dirname = _os.path.dirname(dst_config_file)
   347         1           40     40.0      0.0          if dirname and not _os.path.isdir(dirname):
   348                                                       _os.makedirs(dirname, mode=0o700, exist_ok=True)
   349         1            3      3.0      0.0          tmpfile = dst_config_file + '.tmp'
   350         1          162    162.0      0.0          with open(tmpfile, 'w') as f:
   351         1         2953   2953.0      0.1              self.config.write(f)
   352         1           40     40.0      0.0              f.flush()
   353         1       123012 123012.0      4.8              _os.fsync(f.fileno())
   354         1          206    206.0      0.0          _os.rename(tmpfile, dst_config_file)
   355         1      2087550 2087550.0     80.9          self._save_feeds()

File: /home/fr/github/rss2email/rss2email/feeds.py
Function: _save_feed_states at line 373

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   373                                               @profile
   374                                               def _save_feed_states(self, feeds, stream):
   375         1           10     10.0      0.0          _json.dump(
   376         1            7      7.0      0.0              {'version': self.datafile_version,
   377         1         2910   2910.0      0.1               'feeds': list(feed.get_state() for feed in feeds),
   378                                                        },
   379         1            4      4.0      0.0              stream,
   380         1            9      9.0      0.0              indent=2,
   381         1      1939815 1939815.0     99.8              separators=(',', ': '),
   382                                                       )
   383         1           15     15.0      0.0          stream.write('\n')

Is there any particular reason to choose json? I guess, but I've not tested that a binary format would be more efficient. I don't think anybody is going to read the json file anyway. I think there is room for optimization here.

Any though?

sciunto commented 8 years ago

In addition:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    53                                           @profile
    54                                           def run(feeds, args):
    55                                               "Fetch feeds and send entry emails."
    56         1            5      5.0      0.0      if not args.index:
    57                                                   args.index = range(len(feeds))
    58         1            3      3.0      0.0      try:
    59         2            6      3.0      0.0          for index in args.index:
    60         1          142    142.0      0.0              feed = feeds.index(index)
    61         1            3      3.0      0.0              if feed.active:
    62         1            1      1.0      0.0                  try:
    63         1      5180755 5180755.0     65.9                      feed.run(send=args.send)
    64                                                           except _error.RSS2EmailError as e:
    65                                                               e.log()
    66                                               finally:
    67         1      2677187 2677187.0     34.1          feeds.save()
sciunto commented 8 years ago

For some other stuffs, I discovered that ujson has better performances. A quick test with rss2email shown me that I could get between 30 to 50 % improvements on the execution time of r2e.

It requires an extra deps, but probably worth it. However, I still think that a binary format would be appropriated. Any other thoughts?

Phiber2000 commented 8 years ago

You're right that r2e could be faster. But switching to a binary format would be a step backwards. The possibility to have a quick look/change to the database is a big advantage of r2e! But switching to ujson - why not? If this speeds up the whole thing, it would be great & easy!

A problem may be, that the ujson package may be not available as repository package and has to be installed manually. F.e.: https://packages.debian.org/stretch/python3-ujson ...this is still in testing suite (+ not kept up to date).

sciunto commented 8 years ago

I'm running ujson without any problem for two months. I also split my configuration file to an hourly and daily versions (according to the typical frequency) to minimize the cpu charge.

Phiber2000 commented 8 years ago

I made some measurements using simplejson - which is normaly faster than the json encoder - and didn't notice a considerable runtime improvement (diff: 70ms).

The guilty part doesn't seem to be the json handler itself...

Did you make other experiences? If yes, could you post your _json.dump(...) statement from feeds.py?

jkufner commented 8 years ago

I have 150 feeds monitored bby r2e and ~/.local/share/rss2email.json has about 8000 lines after fresh run. It takes many minutes and few hundred megabytes of memory to complete a single run.

Why JSON? It is completely inapropriate format. What we need here is a quick indexed storage. For example Berkeley DB or SQLite. Both can be viewed by existing tools so it is not hard to debug. Libraries are production-ready and with many tutorials around.

Use of such database would drop memory usage to few MB per run (database is mmapped instead of reading and parsing it) and since the database would be properly indexed, it would increase speed significantly. SQLite can do ten thousands inserts per second with no trouble. BDB is even few times faster. Retrieving data will be faster too, because index will be created once and then stored on disk along the data. Index lookup then does not have to load whole index to memory, it can just read few blocks here and there (because of mmap) to get requested records (single logarithmic-complexity search instead of multiple linear processings and then hash table lookup).

Forget about binary json, it won't help.

Ekleog commented 4 years ago

Hello,

This repository has been deprecated for a few years now, and has been replaced by https://github.com/rss2email/rss2email .

If this issue is still relevant to you, and not fixed with v3.12.2, could you please reopen the issue there?

Cheers, Leo