web-platform-tests / rfcs

web-platform-tests RFCs
80 stars 66 forks source link

Set policy for LLM-generated tests #202

Open jgraham opened 1 month ago

jgraham commented 1 month ago

CC @web-platform-tests/wpt-core-team

I was recently asked about the policy for using LLMs to generate tests that are submitted to wpt. Currently we don't have any explicit policy on this, but I think we need one, so it's clear to test authors what's permissible, and to downstream consumers whether wpt is in line with any other policies they might have with regard to LLMs.

To be clear, I expect that any policy here would not affect review requirements i.e. we'd still require human review for all tests to ensure they're correct and easy to follow.

Ms2ger commented 1 month ago

Given their tendency to create output that looks plausible but isn't actually correct, I'd lean towards "don't allow" or at least increase review requirements on them. We've been relatively permissive in accepting tests based on the fact that a human has already reasoned their way through the spec while writing; if that's no longer a valid assumption, we should take that into account.

tabatkins commented 1 month ago

Yes, I'm overall on the "disallow officially" stance. If they are LLM-generated, but are good/correct enough we can't tell, then fine, but in my experience people who use LLMs like this are not particularly concerned about the quality in the first place.

jgraham commented 1 month ago

We also need to consider cases where the test is paritally generated but human-modified (e.g. with copilot-like tools).

The practical arguments I've heard in favour of allowing LLM-[partially-]generated tests are basically:

Speaking personally, I'm definitely worried about the potential for lower-quality tests being written using these technologies, but as a practical matter I don't know how much we can enforce, so I'd only want to ban them if there's some strong non-technical reason (e.g. vendors who would be unable to use wpt). However I'd be interesting in ideas for marking such tests (assuming good faith contributors) so that we could verify whether they cause more problems in the longer term compared to purely human-generated tests.

WeizhongX commented 2 weeks ago

Maybe we need clarify the license issue associate with code generated with LLM? One of the reason LLM is banned from some corporation is because the license of LLM-generated code can be unclear, thus not safe to use.

What is the policy regarding the license now for WPTs contributed by a human?

jgraham commented 2 weeks ago

I think broadly it's the submitter's responsibility to ensure that they're able to contribute it under the terms of the 3-Clause BSD License.

zcorpan commented 2 weeks ago

I have thought about using an LLM to generate a testing plan (giving it https://web-platform-tests.org/writing-tests/making-a-testing-plan.html + the relevant part of the spec to test), and just tested it now with ChatGPT 4o for https://quirks.spec.whatwg.org/#the-html-element-fills-the-viewport-quirk and https://quirks.spec.whatwg.org/#the-body-element-fills-the-html-element-quirk

After clarifying that I want a list of 100 tests with short descriptions, it came up with various reasonable cases to test. I then asked it to write the first test, and while it made a few mistakes (included a standards mode doctype, unnecessary margin calculations, and unnecessary setup() etc boilerplate), after fixing those the test is basically correct and useful (it passes in Chrome and Safari, fails in Firefox due to bug 1259924).

As a reviewer I would probably like to know when LLMs were used and whether the contributor put in some effort to clean up and verify the correctness.

Prompts and output - Please read this to learn how to generate a testing plan for a specification. [attachment: making-a-testing-plan.md] - Now make a testing plan for the attached spec. [attachment: quirks-html-body.bs] - 5 tests is not sufficient to test all interesting cases. I expect something on the order of 100 tests. Please only generate a list where each bullet point briefly explains a single test. Example: "Test `float: left` on ``, verify the border box size in the block flow direction" > Certainly! Here is an extensive list of test cases to thoroughly cover the specification for quirks mode behavior of the `` and `` elements: > > ### Tests for `` Element > > #### Vertical Writing Mode with `width: auto` > 1. Test `width: auto` on ``, verify the border box size in vertical writing mode. > 2. Test `margin-left` and `margin-right` on ``, verify the border box size in vertical writing mode. > 3. Test `margin-left: 10px; margin-right: 20px` on ``, verify the border box size in vertical writing mode. > 4. Test `border: 5px solid` on ``, verify the border box size in vertical writing mode. > 5. Test `padding: 10px` on ``, verify the border box size in vertical writing mode. > 6. Test `width: auto; writing-mode: vertical-rl` on ``, verify the border box size. > > #### Horizontal Writing Mode with `height: auto` > 7. Test `height: auto` on ``, verify the border box size in horizontal writing mode. > 8. Test `margin-top` and `margin-bottom` on ``, verify the border box size in horizontal writing mode. > 9. Test `margin-top: 10px; margin-bottom: 20px` on ``, verify the border box size in horizontal writing mode. > 10. Test `border: 5px solid` on ``, verify the border box size in horizontal writing mode. > 11. Test `padding: 10px` on ``, verify the border box size in horizontal writing mode. > 12. Test `height: auto; writing-mode: horizontal-tb` on ``, verify the border box size. > > ### Tests for `` Element > > #### Vertical Writing Mode with `width: auto` > 13. Test `width: auto` on ``, verify the border box size in vertical writing mode. > 14. Test `margin-left` and `margin-right` on ``, verify the border box size in vertical writing mode. > 15. Test `margin-left: 10px; margin-right: 20px` on ``, verify the border box size in vertical writing mode. > 16. Test `border: 5px solid` on ``, verify the border box size in vertical writing mode. > 17. Test `padding: 10px` on ``, verify the border box size in vertical writing mode. > 18. Test `position: static` on ``, verify the border box size in vertical writing mode. > 19. Test `position: relative` on ``, verify the border box size in vertical writing mode. > 20. Test `position: absolute` on ``, verify the border box size in vertical writing mode. > 21. Test `position: fixed` on ``, verify the border box size in vertical writing mode. > 22. Test `float: none` on ``, verify the border box size in vertical writing mode. > 23. Test `float: left` on ``, verify the border box size in vertical writing mode. > 24. Test `float: right` on ``, verify the border box size in vertical writing mode. > 25. Test inline-level ``, verify the border box size in vertical writing mode. > 26. Test spanning `` (multi-column), verify the border box size in vertical writing mode. > > #### Horizontal Writing Mode with `height: auto` > 27. Test `height: auto` on ``, verify the border box size in horizontal writing mode. > 28. Test `margin-top` and `margin-bottom` on ``, verify the border box size in horizontal writing mode. > 29. Test `margin-top: 10px; margin-bottom: 20px` on ``, verify the border box size in horizontal writing mode. > 30. Test `border: 5px solid` on ``, verify the border box size in horizontal writing mode. > 31. Test `padding: 10px` on ``, verify the border box size in horizontal writing mode. > 32. Test `position: static` on ``, verify the border box size in horizontal writing mode. > 33. Test `position: relative` on ``, verify the border box size in horizontal writing mode. > 34. Test `position: absolute` on ``, verify the border box size in horizontal writing mode. > 35. Test `position: fixed` on ``, verify the border box size in horizontal writing mode. > 36. Test `float: none` on ``, verify the border box size in horizontal writing mode. > 37. Test `float: left` on ``, verify the border box size in horizontal writing mode. > 38. Test `float: right` on ``, verify the border box size in horizontal writing mode. > 39. Test inline-level ``, verify the border box size in horizontal writing mode. > 40. Test spanning `` (multi-column), verify the border box size in horizontal writing mode. > > ### Interaction Between `` and `` > > #### Different Writing Modes > 41. Test `` horizontal and `` vertical, verify the border box size for both. > 42. Test `` vertical and `` horizontal, verify the border box size for both. > > #### Same Writing Modes with Different Properties > 43. Test `` and `` both vertical, different margins. > 44. Test `` and `` both vertical, different paddings. > 45. Test `` and `` both vertical, different borders. > 46. Test `` and `` both horizontal, different margins. > 47. Test `` and `` both horizontal, different paddings. > 48. Test `` and `` both horizontal, different borders. > > ### Additional Scenarios for Edge Cases > > #### Complex Combinations > 49. Test `` with `width: auto`, `` with `height: auto`. > 50. Test `` with `height: auto`, `` with `width: auto`. > 51. Test `` with `padding: 20px`, `` with `margin: 30px`. > 52. Test `` with `border: 5px solid`, `` with `padding: 15px`. > 53. Test `` with `margin: 10px`, `` with `border: 10px solid`. > > #### Nested Elements and Layouts > 54. Test nested `
` elements inside `` affecting the border box size. > 55. Test nested `` elements inside `` affecting the border box size. > 56. Test nested `