zzzprojects / html-agility-pack

Html Agility Pack (HAP) is a free and open-source HTML parser written in C# to read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files.
https://html-agility-pack.net
MIT License
2.65k stars 375 forks source link

Bad ZZZ Code.AI suggestion #522

Closed scarabdesign closed 10 months ago

scarabdesign commented 1 year ago

1. Description

Unless I'm doing something wrong, it looks like Code.AI is suggesting bad code. I wasn't sure where to report this issue, but since it's linked directly in the HAP documentation, I figured this was probably the best place. https://html-agility-pack.net/select-nodes

As per the suggestion on the HTML Agility Pack documentation page, I asked the following question: https://zzzcode.ai/answer-question?id=447c1b4a-136f-411b-8c1a-7c6536a90ec0

2. Exception

blazor.webview.js:1  '.class1, .class2, .class3' has an invalid token.
   at MS.Internal.Xml.XPath.XPathParser.ParseXPathExpression(String xpathExpression)
   at System.Xml.XPath.XPathExpression.Compile(String xpath, IXmlNamespaceResolver nsResolver)
   at System.Xml.XPath.XPathNavigator.Select(String xpath)
   at HtmlAgilityPack.HtmlNode.SelectNodes(String xpath)
   at Viands.Data.ViewModels.VList.GetMenuBrief()
   at Viands.Data.ViewModels.VList.get_FilteredDescription()
   at Viands.Pages.Index.BuildRenderTree(RenderTreeBuilder __builder)
   at Microsoft.AspNetCore.Components.ComponentBase.<.ctor>b__6_0(RenderTreeBuilder builder)
   at Microsoft.AspNetCore.Components.Rendering.ComponentState.RenderIntoBatch(RenderBatchBuilder batchBuilder, RenderFragment renderFragment, Exception& renderFragmentException)

3. Fiddle or Project

Here is the code that the AI suggests:

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.Load("path/to/your/html/file.html");

HtmlNodeCollection nodes = htmlDoc.DocumentNode.SelectNodes(".class1, .class2, .class3");

if (nodes != null)
{
    foreach (HtmlNode node in nodes)
    {
        // Do something with the selected nodes
    }
}

Here is my code:

        public string GetMenuBrief()
        {
            if (!string.IsNullOrEmpty(Description) && !IsSet && !IsTemplate)
            {
                var wholeString = new HtmlDocument();
                wholeString.LoadHtml(Description);
                var desc = new List<string>();
                wholeString.DocumentNode?.SelectNodes(".class1, .class2, .class3")?.ToList().ForEach(e =>
                {
                    desc.Add(e.InnerText.Replace("&nbsp;", " ").Trim());
                });
                return string.Join(", ", desc);
            }

            return Description;
        }

4. Any further technical details

Maybe I'm doing it wrong, but it looks like the AI is spewing bad results (as ChatGPT is wont to do).

elgonzo commented 1 year ago

Unless I'm doing something wrong, it looks like Code.AI is suggesting bad code. I wasn't sure where to report this issue [...]

While i am just a user of HAP and not associated with the HAP project nor its authors/maintainers, i would like to point out that the issue tracker for HAP is not the right place to report issues with zzzcode.ai. Instead, i would like to suggest you report the problem you encountered in the issue tracker for zzzcode.ai, whose project site (including its issue tracker) is also on github: https://github.com/zzzprojects/zzzcode.ai

Regardless of your misplaced issue report and whatever zzzcode.ai produced in response to your question for it, note that XPath expressions were not invented for querying HTML, but for querying XML documents. As such, XPath does not know the concept of "classes" as in HTML. Consequentially, you would have to write the XPath expression in a manner so that it explicitly tests class attributes for the occurrences of certain (sub)strings -- the class names -- using the contains function similarly to what this SO Q&A details: https://stackoverflow.com/questions/1604471/how-can-i-find-an-element-by-css-class-with-xpath. For testing of the occurrence of multiple strings in an attribute value, chain/combine multiple such contains test with either the and or or operator.

Also keep in mind that HAP relies on .NET's own System.Xml.XPath infrastructure, which is and will be limited to XPath 1.0 expressions.

malamai123 commented 10 months ago

Nice 👍👍👍👍

elgonzo commented 10 months ago

@JonathanMagnan note that currently, after the fix, the AI code generator generates an example code like

HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[contains(@class, 'class1') or contains(@class, 'class2')]");

This is still not entirely correct. Someone who due to a lack of knowledge has to ask such a question to an AI assistant might intuitively expect that the given answer/code example selects div nodes having the CSS classes "class1" or "class2". But the answer/code example provided by the bot also selects div nodes with the classes "class13", "class21", "subclass10", etc., which arguably is not what the asking person is looking for.

The StackOverflow Q&A i linked to in my previous post demonstrates an accurate XPath expression for selecting CSS classes that should work flawlessy under any circumstances without making any assumptions about the use case and therefore should be suggested by the bot instead:

//div[contains(concat(' ', normalize-space(@class), ' '), ' Test ')]

It's not pretty (because it involves padding the @class attribute value as well as the class name with spaces), but that's what is needed to get a robust and accurately working XPath expression for this task.

JonathanMagnan commented 10 months ago

Hello @elgonzo ,

We don't choose what ChatGPT generates. It might either help or lead you in a bad way, but the more time passes, the better it becomes.

Eventually, it will be easier to train a custom model for a specific subject, and then we will be able to provide him with a ton of examples about what the best practices should be despite what he already knows. Open AI is growing quickly with new features every month, so I believe by the end of 2024, they will provide an easier way to train a custom model dedicated to Html Agility Pack (it's already possible with some third-party software at this moment).

elgonzo commented 10 months ago

@JonathanMagnan

oh, i didn't know. Thanks for letting me know. I assumed you did something with respect to how zzzcode.ai utilizes ChatGPT, as after noticing you closed the issue i did a quick check of the result it generates now and got a result that is different than what the OP originally got.

Cheers, and Happy New Year!

JonathanMagnan commented 10 months ago

Happy New Year @elgonzo ;)

scarabdesign commented 10 months ago

Whether the AI is improving or not, just like other modern early adopters of GPT, your going to get bad results, and those bad results taint the believability of GPT. I'm not sure I'll fully trust the AI for accurate answers from any source or subject, as it's just a text prediction algorithm, not artificial intelligence. You may say is a good thing to not trust it fully, however, when your product is so light on documentation and relies so heavily on PGT, you're generating support calls and confusion.

IMO, you may want to pop it back in the oven until it's done.