workeffortwaste / horseman

The detailed update and issue repository for the Horseman crawler.
https://gethorseman.app/
16 stars 0 forks source link

App goes blank when interacting with "larger" websites #122

Open bartholomewbartkowski opened 1 year ago

bartholomewbartkowski commented 1 year ago
image

Left the app on overnight. In the morning, the app was showing just a blank window. I reran the craw, but after 700 URLs (I think) when I wanted to interact with it, it went blank as well.

workeffortwaste commented 1 year ago

Let me look into this and see if I can reproduce it.

Can you let me know your system specs and OS version?

bartholomewbartkowski commented 1 year ago

Processor Processor Apple M1 Speed 3.2 GHz Number of Cores 8 Memory RAM 16 GB Video Card Chipset Apple M1 Manufacturer
Vendor ID 106B Operating System Version 13.4 Operating System MacOS 13.4 Drives Free Memory 12 GB Free Memory 12 GB

workeffortwaste commented 1 year ago

Thanks for that, I've managed to replicate the issue on an Intel device and I'm looking into the cause.

workeffortwaste commented 1 year ago

I've isolated this down to the amount of data being returned from the snippets exceeding the max that the renderer can handle when they're all enabled.

The short term fix is to only enable required snippets.

I'll be increasing the limits and optimising the way data is handled in the next release.

bartholomewbartkowski commented 1 year ago

Thank you!

DavidMelamed commented 1 year ago

I had this same issue - is there any way to recover or save what it crawled so far? Is it saved somewhere? Im also on a macbook pro m1. Also, if a crawl crashes like this - but I managed to save some of the crawl - is there a way to pick up where it left off?

workeffortwaste commented 1 year ago

Unfortunately there's not a way to recover a crawl right now. If you can sit tight a little it's my priority to resolve this. I'm currently re-engineering how the data is stored so this won't happen and opening up more memory to the app.

You're more likely to run into the problem if you're enabling all the snippets. Selectively enable only the ones you want.

The main culprits that are pushing the memory usage over the edge are the RAW HTML and Critical CSS snippets, as they return a lot of data to the table.

(Aside: My son was born a few days ago and my attention is there currently but I'll be looking at this as soon as I can)

workeffortwaste commented 11 months ago

Just an update to let you know I've been working on this and have been busy re-engineering how Horseman handles data so the Chromium memory limit isn't reached when all snippets are enabled.

DavidMelamed commented 10 months ago

Congrats on the birth of your son. I consistently run into this issue, even with enabling some of the snippets although admittedly, Im grabbing the full html. How close are you to resolving this? Would running this on a vm with more ram solve the issue for larger sites?

workeffortwaste commented 10 months ago

Thanks. I've largely resolved this now, just onto the last few fixes and tests. Unfortunately it's taken longer than expected as it required quite a substantial rewrite of some core functionality.

Unfortunately running as a VM with more RAM won't resolve this as the limit that's being reached is hard coded limit within Electron/Chromium. It's made worse on MacOS as the ceiling is even lower.

Can I ask your reason for returning the full HTML to the table?

DavidMelamed commented 10 months ago

The few custom snippets I tried to create (im not a developer, so relying on chatgpt) didnt work and I dont want to re-crawl the site.

Grabbing the html lets me use code interpreter or gpt4 creating beautiful soup scripts to help me parse out what I need later.

On Wed, Aug 23, 2023, 5:17 AM Chris Johnson @.***> wrote:

Thanks. I've largely resolved this now, just onto the last few fixes and tests. Unfortunately it's taken longer than expected as it required quite a substantial rewrite of some core functionality.

Unfortunately running as a VM with more RAM won't resolve this as the limit that's being reached is hard coded limit within Electron/Chromium. It's made worse on MacOS as the ceiling is even lower.

Can I ask your reason for returning the full HTML to the table?

— Reply to this email directly, view it on GitHub https://github.com/workeffortwaste/horseman/issues/122#issuecomment-1689777423, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABL4TPEFWFDWPL3HK6IYQHLXWXRDVANCNFSM6AAAAAAZPUNW4Y . You are receiving this because you commented.Message ID: @.***>

workeffortwaste commented 10 months ago

Ahh, are you using the JSON or CSV export for this?

As a side note I'm more than happy to help you with custom snippets to get them to do what you want.