zzzprojects / html-agility-pack

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
https://html-agility-pack.net
MIT License
2.59k stars 374 forks source link

Self closing tags modified #58

Open ghost opened 6 years ago

ghost commented 6 years ago

I've noticed that several different aspects of self closing tags are not respected. Some examples:

< /br> becomes <br> <img src="..." /> becomes <img src="..." >

Thanks, Chris

JonathanMagnan commented 6 years ago

Hello @chrisnelsondotca ,

You are right. We are currently working on it to respect all HTML specification.

Unfortunately, this request will take a few weeks before being fully implemented.

Best Regards,

Jonathan

joergbattermann commented 6 years ago

@JonathanMagnan is there any viable workaround / ETA when this will be fixed? This issue -does- create structurally different HTML:

This:

<html>
<head></head>
  <body>
    <p align="left">
    Some list:<br />
    - A<br />
    - B<br />
    - C<br />
    - D
    </p>
  </body>
</html>

Becomes this:

<html>
<head></head>
  <body>
    <p align="left">
    Some list:<br>
    - A<br>
    - B<br>
    - C<br>
    - D</br></br></br></br>
    </p>
  </body>
</html>

.. which is obviously logically different xml / nesting than the input.

Thanks, -Jörg

JonathanMagnan commented 6 years ago

Hello @jbattermann ,

I'm currently waiting that one of my employees become permanent (at the start of September) to assign him this task.

This requires making A LOT of change to this library to fix all this kind of issue. I'm not sure yet when it will be fixed since a lot of hours will be required.

Best Regards,

Jonathan

joergbattermann commented 6 years ago

@JonathanMagnan thanks for the update! I've found a workaround in the mean time that does switch the behaviour to the expected one (see also https://stackoverflow.com/a/5557297/2591):

setting the static dictionary / -entry like this:

HtmlAgilityPack.HtmlNode.ElementsFlags["br"] = HtmlElementFlag.Empty;

and setting the .OptionWriteEmptyNodes to true for the HtmlDocument instances, i.e. like this:

var htmlDocument = new HtmlDocument();
htmlDocument.OptionWriteEmptyNodes = true;

.. does result in the expected, <br/> tags.

ColinM9991-zz commented 6 years ago

Hi all,

Another workaround I've found, along with what @jbattermann has posted, is to set the OptionOutputAsXml flas as true on the HtmlDocument object.

JonathanMagnan commented 6 years ago

Hello @ColinM9991 ,

Thank you for reporting.

I have added this issue to the list to check this week. We started recently to change/improve the parser to respect more the HTML5 spec.

I will check this one as well if we can fix some well-known tag such as br and img.

Best Regards,

Jonathan

JonathanMagnan commented 6 years ago

Hello guys,

I have done some research and it looks the ending '/' is optional and most browser doesn't show it when you open their console See: http://w3c.github.io/html-reference/syntax.html#void-element

The question now is more if we should respect the input or standardize it and never show it when optional.

Currently, since it doesn't seem to break anything and respect HTML5 rule, my suggestion will be to close this issue and move on. That will give more time to develop HAP 2.x

@jbattermann ,

I cannot reproduce this problem, but some ending tags have been fixed in the past month, so perhaps it has been included in another change.

Best Regards,

Jonathan

joergbattermann commented 6 years ago

Jonathan, XHTML syntax is an application of XML and therefore non-closed tags such as <br> would violate its rules / well-formedness (see also http://w3c.github.io/html-reference/documents.html#conformant-xml and for HTML5 https://www.w3.org/TR/html5/the-xhtml-syntax.html#xhtml).

So the original issue reported by @chrisnelsondotca and my example would still apply and result in logically incorrect/modified output to what was provided as input..

If the problem still exists I'd personally keep the issue open (maybe someone comes along and wants to fix it) but it's up to you. I myself can currently live with the reported workaround.

Thanks either way! 👍

JonathanMagnan commented 6 years ago

Thank you for this additional information.

Perhaps adding an options to force this optional closing tag may be the best idea?

Best Regards,

Jonathan

joergbattermann commented 6 years ago

Yeah - for non-XHTML it would make sense to make it opt-in / off by default because in that case the specs simply allow it both ways and it is a developers choice for sure.

For XHTML however I'd make it opt-out / on by default because developers generating / working with XHTML would probably expect it to produce well-formed XML by default.

OpenSpacesAndPlaces commented 3 years ago

@JonathanMagnan Is there a fix for this?

Using e Include="HtmlAgilityPack.NetCore" Version="1.5.0.1"

Trying to modify and SVG and it keeps removing the self-closers.

OptionWriteEmptyNodes seems to have no effect.


Only work around I've seen so far would be to "hack" in a post process with regex. https://html-agility-pack.net/knowledge-base/10188285/html-agility-pack-stripping-self-closing-tags-from-input

sangeethnandakumar commented 2 months ago

@JonathanMagnan This is a real problem. Today I used HtmlAgilityPack to work on an epub file HTML.

1. I added an image programicaly

 string imageHtml = $"<img id='{Guid.NewGuid()}' imgindex='{chapter.SelectedImages.FirstOrDefault()}'  width='800px' src='Art/{Path.GetFileName(imagePaths.FirstOrDefault())}' style='max-width: 100%; height: auto; display: block; margin: 16px auto; mask-image: linear-gradient(to top, transparent 0%, white 10%, white 90%, transparent 100%), linear-gradient(to right, transparent 0%, white 10%, white 90%, transparent 100%), linear-gradient(to bottom, transparent 0%, white 10%, white 90%, transparent 100%), linear-gradient(to left, transparent 0%, white 10%, white 90%, transparent 100%);'/>";
 HtmlNode imageNodeBeforeFirstPara = HtmlNode.CreateNode(imageHtml);
 firstPara.ParentNode.InsertBefore(imageNodeBeforeFirstPara, firstPara);

 imageHtml = $"<img id='{Guid.NewGuid()}' imgindex='{chapter.SelectedImages.FirstOrDefault()}' width='800px' src='Art/{Path.GetFileName(imagePaths.LastOrDefault())}' style='max-width: 100%; height: auto; display: block; margin: 16px auto; mask-image: linear-gradient(to top, transparent 0%, white 10%, white 90%, transparent 100%), linear-gradient(to right, transparent 0%, white 10%, white 90%, transparent 100%), linear-gradient(to bottom, transparent 0%, white 10%, white 90%, transparent 100%), linear-gradient(to left, transparent 0%, white 10%, white 90%, transparent 100%);'/>";
 HtmlNode imageNodeAfterLastPara = HtmlNode.CreateNode(imageHtml);
 lastPara.ParentNode.InsertAfter(imageNodeAfterLastPara, lastPara);

2. This rendered like this. with missing ending tag

image

3. The EPub viewer crashed

image

Note this is an XHTML file, In browser also it crashes Adding a simple option to enable it would be lot helpfull


Even if I set XMLOutput to true or enabled WriteEmptyTag = true. Getting same result image


FIX & ASSOCIATED BUG

It worked only when I enabled WriteEmptyTag option specific for the tag. The same if declared globally is not working

image

JonathanMagnan commented 2 months ago

Hello @sangeethnandakumar ,

The img tag doesn't require to be closed. HTML Agility Pack is done for HTML, not for XHTML, which is way more strict.

Unfortunately, I believe your workaround by specifying the options in the tag will be the only solution if you really want it closed. The Action<HtmlDocument> htmlDocumentBuilder we added was to specifically handle this kind of scenario and allow people to set options for the node.

(Options set on the HtmlDocument doesn't always work when you create node as you noticed)

Best Regards,

Jon

sangeethnandakumar commented 2 months ago

Thanks for the clarification @JonathanMagnan