z411 / trackma

Open multi-site list manager for Unix-like systems. (ex-wMAL)
https://z411.github.io/trackma
GNU General Public License v3.0
771 stars 82 forks source link

Full Rescan CPU load #598

Open serjflint opened 2 years ago

serjflint commented 2 years ago

Clicking Rescan Library (full) eats one core 100% for several minutes and the QT interface barely responds at the time. I have ~4k titles in total on the lists but barely 100 files on the disk.

Manjaro Linux Trackma-qt 0.8.4

serjflint commented 2 years ago

2022-01-21_22-17 Obvious suspect is guess_show with SequenceMatcher

serjflint commented 2 years ago

I created a dirty hack to scan only currently watching titles instead of the whole library when scan_whole_list is disabled.

ahmubashshir commented 2 years ago

We should scan everything in a separate thread and collect the result after it finishes.

serjflint commented 2 years ago

We should scan everything in a separate thread and collect the result after it finishes.

It solves the symptoms, but not the cause. There are many optimizations to find the match than just brute-force seeking longest common sub-sequence of each file and each title.

  1. normalizing the titles (both files and library, calling lower() is just the beginning)
  2. hash-table for library (like for alternative titles)
  3. limit the candidates (like I've done in PR above, but there can be done more)
FichteFoll commented 2 years ago

Indeed, matching is quote slow especially when there is no match for a file.

z411 commented 2 years ago

I created a dirty hack to scan only currently watching titles instead of the whole library when scan_whole_list is disabled.

It is not scanning the whole library, just watching, on hold and plan to watch, as you can see here:

https://github.com/z411/trackma/blob/master/trackma/lib/libmal.py#L53

If the API doesn't present this variable, it will use statuses_start, which is indeed only Watching in most cases.

serjflint commented 2 years ago

@z411 sorry, I said it incorrectly. I agree, it is not the whole library. But in my case the difference is negligible. I have 90% of titles in "plan_to_watch". For 2k titles the current brute-force algorithm is horribly slow. I mentioned some ideas above.

z411 commented 2 years ago

Limiting the candidates is what's being done now, but I think the main fix would be to make the algorithm faster. As my list was never too big I never noticed horrifying performance but admittedly Taiga is a lot faster even though it's scanning the whole list; we really need to take from that. Or maybe it's not that faster, it's just that as @ahmubashshir said it's being done in a separate thread so it's not noticeable. But 121 seconds is indeed a ton. The algorithm needs to be improved.