Open sciunto opened 8 years ago
In addition:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
53 @profile
54 def run(feeds, args):
55 "Fetch feeds and send entry emails."
56 1 5 5.0 0.0 if not args.index:
57 args.index = range(len(feeds))
58 1 3 3.0 0.0 try:
59 2 6 3.0 0.0 for index in args.index:
60 1 142 142.0 0.0 feed = feeds.index(index)
61 1 3 3.0 0.0 if feed.active:
62 1 1 1.0 0.0 try:
63 1 5180755 5180755.0 65.9 feed.run(send=args.send)
64 except _error.RSS2EmailError as e:
65 e.log()
66 finally:
67 1 2677187 2677187.0 34.1 feeds.save()
For some other stuffs, I discovered that ujson has better performances. A quick test with rss2email shown me that I could get between 30 to 50 % improvements on the execution time of r2e.
It requires an extra deps, but probably worth it. However, I still think that a binary format would be appropriated. Any other thoughts?
You're right that r2e could be faster. But switching to a binary format would be a step backwards. The possibility to have a quick look/change to the database is a big advantage of r2e! But switching to ujson - why not? If this speeds up the whole thing, it would be great & easy!
A problem may be, that the ujson package may be not available as repository package and has to be installed manually. F.e.: https://packages.debian.org/stretch/python3-ujson ...this is still in testing suite (+ not kept up to date).
I'm running ujson without any problem for two months. I also split my configuration file to an hourly and daily versions (according to the typical frequency) to minimize the cpu charge.
I made some measurements using simplejson - which is normaly faster than the json encoder - and didn't notice a considerable runtime improvement (diff: 70ms).
The guilty part doesn't seem to be the json handler itself...
Did you make other experiences?
If yes, could you post your _json.dump(...)
statement from feeds.py?
I have 150 feeds monitored bby r2e and ~/.local/share/rss2email.json has about 8000 lines after fresh run. It takes many minutes and few hundred megabytes of memory to complete a single run.
Why JSON? It is completely inapropriate format. What we need here is a quick indexed storage. For example Berkeley DB or SQLite. Both can be viewed by existing tools so it is not hard to debug. Libraries are production-ready and with many tutorials around.
Use of such database would drop memory usage to few MB per run (database is mmapped instead of reading and parsing it) and since the database would be properly indexed, it would increase speed significantly. SQLite can do ten thousands inserts per second with no trouble. BDB is even few times faster. Retrieving data will be faster too, because index will be created once and then stored on disk along the data. Index lookup then does not have to load whole index to memory, it can just read few blocks here and there (because of mmap) to get requested records (single logarithmic-complexity search instead of multiple linear processings and then hash table lookup).
Forget about binary json, it won't help.
Hello,
This repository has been deprecated for a few years now, and has been replaced by https://github.com/rss2email/rss2email .
If this issue is still relevant to you, and not fixed with v3.12.2, could you please reopen the issue there?
Cheers, Leo
Hi @wking and others.
As explained here: https://github.com/kurtmckee/feedparser/issues/44 i made some investigations about performances and CPU consumption. I saw somebody complaining on the web about large CPU consumptions, and I have similar problems too.
In addition to what I reported above and the small enhancement I tried to make, it appears that a lot of cpu is used on save functions.
Is there any particular reason to choose json? I guess, but I've not tested that a binary format would be more efficient. I don't think anybody is going to read the json file anyway. I think there is room for optimization here.
Any though?