zzzprojects / html-agility-pack

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
https://html-agility-pack.net
MIT License
2.63k stars 375 forks source link

What is the most proper way of disposing of HtmlDocument object #370

Open FurkanGozukara opened 4 years ago

FurkanGozukara commented 4 years ago

I am running an application heavily depending on HtmlDocument with hundreds of multi-threads

So what would be the best way of disposing of HtmlDocument after done with the object

Like below?

                vrHtmlDocument  = new HtmlDocument();
                vrHtmlDocument  = null;
JonathanMagnan commented 4 years ago

Hello @FurkanGozukara ,

I will ask my developer to look at it and perhaps implement the IDisposable interface.

Best Regards,

Jon


Performance Libraries context.BulkInsert(list, options => options.BatchSize = 1000); Entity Framework ExtensionsEntity Framework ClassicBulk OperationsDapper Plus

Runtime Evaluation Eval.Execute("x + y", new {x = 1, y = 2}); // return 3 C# Eval FunctionSQL Eval Function

zb-z commented 4 years ago

The best way of disposing of HtmlDocument is - not disposing :-) Just as XPathDocument and XmlDocument are not "disposable", neither should HtmlDocument be.

GC knows it's job - don't micromanage it - you are not in Java :-)) Pointer nullification is for complex cases when there can be multiple GC roots (threads in .NET) and you have to break a tie-up or it will never remove them on it's own but that's if you are doing your own thread pump or something equivalent. Jerking GC is counter-productive.

IDisposable shouldn't be implemented unless there is something inside a class that really needs explicit closing/removal because it "locks" some expensive external resources where holding on to them for a long time is de-facto a leak. For a class that is a stand-alone "utility" class it would almost always be a bug to hold on to anything but memory internally.

So if someone spots a problem like keeping file handles or sockets (God forbid thread handles - like spawning something because "async" is "popular") it's much better to blow a whistle at a concrete problem rather than ask for a "disposal" just because ...

JonathanMagnan commented 4 years ago

Thank you @zb-z for your answer.

My developed looked at this and come to the same conclusion as @zb-z .

Just setting some variable to null will not really help the GC.

Do you currently have some issues with HAP? Or it was only a question?

FurkanGozukara commented 4 years ago

Well i have to notify you that if you are running a very heavy multi-threaded HtmlDocument using app, even nullfying the pointer helps. Moreover, i had huge speed improvement when i changed garbage collector mode

    <gcAllowVeryLargeObjects enabled="true"/>
    <gcServer enabled="true"/>

It creates so many so many objects each time a document loaded and brings a heavy burden to garbage collector as far as i observed. I would really prefer implementation of IDisposableinterface

elgonzo commented 4 years ago

The only way i could see IDisposable help in situations like @FurkanGozukara's is if HAP were o reuse DOM object instances instead of always creating new ones. The IDisposable implementation(s) basically would move disposed DOM objects into a cache of unused DOM object instances, or something similar like that. (But i have to admit, i don't know HAP well enough to even begin to guess how much of an undertaking implementing such functionality would be... :( )

UPDATE: Probably it is best to profile the memory usage first, and look at which instance types are allocating the bulk of the heap. If it turns out that the memory pressure does not come from (mutable) DOM type instances but perhaps from immutable/non-reusable instances like strings, then an attempt of caching DOM type instances wouldn't really solve the problem.

zb-z commented 4 years ago

These two config settings that Furkan posted are interesting.

Opens the question of "how is a very large object defined" (published meaning doesn't have to equal real life meaning, wouldn't be the first time in .NET that something very efficient is left undocumented). Does Furkan's code really cross the 2GB size limit with memory usage? The AllowVeryLargeObjects at the very least means "true 64-bit everything" and that can have far reaching consequences without ever really crossing the limit.

It could be that .NET, and consequently GC, uses what I call a "controlled leak tactics" when "very large objects" are on --> lumps up a lot of otherwise small objects into one "atomic" BLOB => as long as at least one is alive it "leaks" but then releases them all as one block and doesn't do any work on them (assuming that finalizers are de-facto disabled). Depending on the implementation tactics this can be as cheap as a single int with atomic decrements (playing the role of a countdown event - minus the event :-)

Controlled leak is really a C++ repertoire (you can crash OS and send kernel into panic with one "special" variant - don't try at home :-)) [and without say 128 GB RAM, never using more than 90 for your .exe/process so that kernel can breathe] It essentially turns off GC-like mechanisms ("special" variant turns off virtual memory - for real :-).

The other one (gcServer) should always be "on" for everything anyway - how can it not be on? :-) I'm a server Dev, I take these kinds of things for granted. So if you are writing something that's really not a server - make it a server - or fake it :-)) It's the virtual machine 4.0 "smart" quirk (read infinitely stupid). It has two "versions" - "server" (== normal, on par with 2.0 and 3.0) and "client" (all stupidity of the universe collected :-), probably something for UI apps, "Metro", maybe even Unity/gaming [although Unity should be "insistent" on 2.0 - that's how it gains universal compatibility, might have changed with core-something if Unity started "speaking" Core ] )

Oh and Core should be "always server" equivalent. @FurkanGozukara - if you have the time and energy, that would be an interesting thing to try with that same code that's giving you hard time - see how it fares under Core 3.1.

Also, it might be that the box where that code runs is congested regarding the threads - check with Process Explorer [ live ]. The pressure in such case is not on memory at all but CPU(s). Not enough cores to feed all little piranhas nibbling away cycles.

FurkanGozukara commented 4 years ago

@zb-z currently unfortunately I do not have time to test :( However, I have written an article about these settings and performance issues that I have encountered during my PhD thesis software development

Its name is Focused Web Crawler Development Challenges: ECCrawler

Direct PDF link : https://acikerisim.toros.edu.tr/xmlui/bitstream/handle/1/273/Focused%20Web%20Crawler%20Development%20Challenges%20EcCrawler.pdf?sequence=1&isAllowed=y

The experimental results that are presented in Table 6 clearly display the huge performance improvement when we set garbage collector mode to server. The gray highlighted rows present the results obtained when garbage collector mode is set to server, and other rows show the results for the default case. As it can be seen from the table, this mode change boosts the number of crawled URLs by 336%, and improves the UI responsiveness by 58.74%.

zb-z commented 4 years ago

Heh, it was pretty obvious that you were writing a crawler :-) did you publish the source on git or somewhere ?

So, hardware was definitely a problem - you have to complain to your dept chief - in writing :-) Starting with (the lack of) SSD drive - at least a lil 512 gig :-) and ending with (the lack of) strong network connection in a university lab environment. But the software side is all your fault :-)) It had to be a server SKU (2012 R2 or even better 2008 R2 - the last golden - no thread wasting "innovations" :-). Win 7 at worst, definitely not 8 (or God forbid 10 :-) and VS 2015 with .NET 4.6.2 ot 4.7.2 (you have to install a "bridge over" package for VS 2015 to get there). SQL Server 2014 is fine (but 2016 is better :-) and I can bet you didn't look into in-memory indexing (shipped with 2014 - the same engine (bwtree) that runs Cosmos).

And no GUI, no WFP - nada :-) IIS, runs (good) thread pool for you and for monitoring a simple web page - just to see that it's alive. Everything else - dump into logs. And since you are just playing you can abuse SQL Server to be your logger - just don't ping it too often :-)

32 gig RAM should normally be OK but for a bwtree index under SQL Server you'd really want 64 (SQL Server does that "trick" with suspending virtual memory (it was written for SQL Server :-) so you'd want to let it take "his" half of the RAM). Hyperthreading off - in BIOS. Then you can breathe a bit - and have a coke :-)

The CPU usage goal is ~60% - total. Above that you are spinning and just scorching CPU (hyperthreading multiplies waste by 2-3 - by "nature" :-)

FurkanGozukara commented 4 years ago

@zb-z well i done all the software development and testing on my personal computer with my personal internet connection. I didn't have school infrastructure to use.

I will tell you another interesting thing. While I was working (during 2014-2016), at some point my internet was completely being gone until i restart the computer or perhaps even the internet modem. I don't remember which one was that. Also after some number of crawling threads, my per minute crawled page count were not increasing even though number of threads were surely increasing with sufficient hardware power.

I figured out why it was happening in 2018. It was happening because as all ISPs in Turkey, my ISP were also using CGN system and it has limited number of ports per user. Lets say if my ISP reserved 1000 ports for me, I was quickly filling all of them with hundreds of concurrent crawling threads. That is why sometimes my internet was completely being gone (probably ISP were not properly handling such cases) or the number of crawled URLs per minute were not increasing :)

I also have written a very crucial article about how using CGN logs in legal cases is dangerous. It is currently in review process on a very good journal (SCI-expanded).

leus commented 3 years ago

I figured out why it was happening in 2018. It was happening because as all ISPs in Turkey, my ISP were also using CGN system and it has limited number of ports per user. Lets say if my ISP reserved 1000 ports for me, I was quickly filling all of them with hundreds of concurrent crawling threads. That is why sometimes my internet was completely being gone (probably ISP were not properly handling such cases) or the number of crawled URLs per minute were not increasing :)

Hundreds of crawling threads? That's a sure recipe for socket exhaustion.