zzzprojects / html-agility-pack

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
https://html-agility-pack.net
MIT License
2.63k stars 375 forks source link

Inconsistent comment rendering #525

Closed mems closed 9 months ago

mems commented 9 months ago

1. Description

It's not clear if HtmlDocument.CreateComment() need to include <!-- and -->. If it's included, OuterHtml include it twice. If it's not, WriteTo() with OptionOutputAsXml = true could raise an exception.

Note: the HTML Standard indicate HTML comments can't contains some sequences:

text [inside comments] must not start with the string ">", nor start with the string "->", nor contain the strings "", or "--!>", nor end with the string "<!-"

2. Exception

[System.ArgumentOutOfRangeException: Length cannot be less than zero.
Parameter name: length]
   at System.String.Substring(Int32 startIndex, Int32 length)
   at HtmlAgilityPack.HtmlNode.GetXmlComment(HtmlCommentNode comment)
   at HtmlAgilityPack.HtmlNode.WriteTo(TextWriter outText, Int32 level)
   at HtmlAgilityPack.HtmlNode.WriteTo()
   at Program.Test(Boolean outputAsXml) :line 21
   at Program.Main() :line 10

3. Fiddle or Project

A Fiddle that reproduce the issue: https://dotnetfiddle.net/vARPlD

// @nuget: HtmlAgilityPack -Version 1.11.54
using System;
using HtmlAgilityPack;

public class Program
{
    public static void Main()
    {
        Test();
        Test(true);
    }

    static void Test(bool outputAsXml = false)
    {
        Console.WriteLine(string.Format("---- Test with outputAsXml = {0} ----", outputAsXml));
        var doc = new HtmlDocument();
        doc.OptionOutputAsXml = outputAsXml;
        doc.LoadHtml("<!-- abc -->");
        var comment = doc.CreateComment(" cde ");
        Console.WriteLine(string.Format("comment.OuterHtml = {0}", comment.OuterHtml));
        Console.WriteLine(string.Format("comment.WriteTo() = {0}", comment.WriteTo()));
        doc.DocumentNode.AppendChild(comment);
        var comment2 = doc.CreateComment("<!-- fgh -->");
        Console.WriteLine(string.Format("comment2.OuterHtml = {0}", comment2.OuterHtml));
        Console.WriteLine(string.Format("commen2t.WriteTo() = {0}", comment2.WriteTo()));
        doc.DocumentNode.AppendChild(comment2);
        Console.WriteLine(string.Format("doc.DocumentNode.OuterHtml = {0}", doc.DocumentNode.OuterHtml));
        Console.WriteLine(string.Format("doc.DocumentNode.WriteTo() = {0}", doc.DocumentNode.WriteTo()));
    }
}

Will outuput:

---- Test with outputAsXml = False ----
comment.OuterHtml = <!-- cde -->
comment.WriteTo() =  cde 
comment2.OuterHtml = <!--<!-- fgh -->-->
commen2t.WriteTo() = <!-- fgh -->
doc.DocumentNode.OuterHtml = <!-- abc --> cde <!-- fgh -->
doc.DocumentNode.WriteTo() = <!-- abc --> cde <!-- fgh -->
---- Test with outputAsXml = True ----
comment.OuterHtml = <!-- cde -->
[Run-time exception]
JonathanMagnan commented 9 months ago

Hello @mems ,

Thank you for reporting. Indeed something doesn't work in this logic.

We will look at it.

Best Regards,

Jon

JonathanMagnan commented 9 months ago

Hello @mems ,

We have released today the v1.11.55: https://github.com/zzzprojects/html-agility-pack/releases/tag/v1.11.55

You can see this commit that will explain what we did:

In this version:

You can see the following Fiddle with your case on this new version: https://dotnetfiddle.net/6LvV3S

Which has the following output:

---- Test with outputAsXml = False ----
comment.OuterHtml = <!--  cde  -->
comment.WriteTo() = <!--  cde  -->
comment2.OuterHtml = <!-- fgh -->
commen2t.WriteTo() = <!-- fgh -->
doc.DocumentNode.OuterHtml = <!-- abc --><!--  cde  --><!-- fgh -->
doc.DocumentNode.WriteTo() = <!-- abc --><!--  cde  --><!-- fgh -->
---- Test with outputAsXml = True ----
comment.OuterHtml = <!--  cde  -->
comment.WriteTo() = <!--  cde  -->
comment2.OuterHtml = <!-- fgh -->
commen2t.WriteTo() = <!-- fgh -->
doc.DocumentNode.OuterHtml = <?xml version="1.0" encoding="iso-8859-1"?><span><!-- abc --><!--  cde  --><!-- fgh --></span>
doc.DocumentNode.WriteTo() = <?xml version="1.0" encoding="iso-8859-1"?><span><!-- abc --><!--  cde  --><!-- fgh --></span>

Let me know if we missing something.

Best Regards,

Jon

elgonzo commented 9 months ago

@JonathanMagnan,

please note that the code in HtmlDocument.cs adds surrounding spaces to the comment, something it should not do. Instead of "<!-- " and " -->", it should use "<!--" and "-->" without the spaces. (The code in HtmlNode.cs and HtmlCommentNode.cs appear to do it correctly.)

EDIT: I made a pull request for this: https://github.com/zzzprojects/html-agility-pack/pull/528

JonathanMagnan commented 9 months ago

Thank you @elgonzo ,

I understand your point.

Adding a space before and after is usually a good practice. It is our responsibility? Probably not and like @mems did in his example, he added the space on his side.

So we will probably go ahead with your pull ;)

mems commented 9 months ago

Since there is a release version that fix my issue, I close it now.

Thank you @JonathanMagnan

JonathanMagnan commented 9 months ago

Hello @mems ,

Just to let you know that the merge from @elgonzo has been merged (thank again!). So, no additional space will be added anymore when creating the comment tag manually.

Best Regards,

Jon