zzzprojects / html-agility-pack

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
https://html-agility-pack.net
MIT License
2.63k stars 375 forks source link

`attribute.QuoteType` does not output the correct quote type #560

Closed rvishruth closed 2 months ago

rvishruth commented 2 months ago

1. Description

attribute.QuoteType does not output the correct quote type

In the code below, despite setting the GlobalAttributeValueQuote property to AttributeValueQuote.Initial, the a.QuoteType outputs the quote types as DoubleQuote, SingleQuote, and DoubleQuote, instead of the expected None, SingleQuote, and DoubleQuote.

Follow up question: Is there any way currently to detect the quote type of an attribute's value (even if it's unquoted to begin with)?

2. Exception

N/A

3. Fiddle or Project

// @nuget: HtmlAgilityPack
using System;
using HtmlAgilityPack;
using System.Linq;

public class Program
{
    public static void Main()
    {
        // Load
        string htmlString = "<p><a href=https://example.com>Link</a><a href='https://example.com'>Link</a><a href=\"https://example.com\">Link</a></p>";
        HtmlDocument dom = new HtmlDocument()
        {
            GlobalAttributeValueQuote = AttributeValueQuote.Initial
        };
        dom.LoadHtml(htmlString);
        var documentNodeDescendants = dom.DocumentNode.Descendants().ToList();
        Console.WriteLine(String.Join(",", documentNodeDescendants.SelectMany(n => n.Attributes).Select(a => a.QuoteType)));
    }
}

Output:

DoubleQuote,SingleQuote,DoubleQuote

image

4. Any further technical details

JonathanMagnan commented 2 months ago

Hello @rvishruth ,

Thank you for reporting.

On the good side, we output correctly without a quote when performing Console.WriteLine(dom.DocumentNode.InnerHtml);

We will look at this issue.

Follow up question: Is there any way currently to detect the quote type of an attribute's value (even if it's unquoted to begin with)?

I'm not sure I understand your follow-up question. Could you try again?

Best Regards,

Jon

rvishruth commented 2 months ago

Thanks @JonathanMagnan! Sorry let me rephrase -

What is the current best way to detect if an attribute's value is unquoted?

JonathanMagnan commented 2 months ago

Thank @rvishruth , now it's 100% clear.

You can currently get it through reflection:

var internalQuoteTypeProperty = typeof(HtmlAttribute).GetProperty("InternalQuoteType", System.Reflection.BindingFlags.Public | System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);

Console.WriteLine(String.Join(",", documentNodeDescendants.SelectMany(n => n.Attributes).Select(a => internalQuoteTypeProperty.GetValue(a))));

The InternalQuoteType should always be the one we parsed. We probably added it in the past for not impacting projects that were already using the QuoteType to be able to support the Initial flag.

Let me know if that solution could work for you. If yes, make sure to keep the BindingFlags.Public as we never know if we will make it public in the future.

Best Regards,

Jon

rvishruth commented 2 months ago

Thank you @JonathanMagnan! This makes sense!