whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
7.85k stars 2.57k forks source link

Proposal: Meta Tag for AI Consent Management #9334

Open brennancaldwell opened 1 year ago

brennancaldwell commented 1 year ago

Introduction

With the rapid growth of artificial intelligence, and especially machine learning models that train on web data, the issue of data usage consent has become more relevant than ever. Currently, there is no standard way for website owners to express their consent or otherwise for AI models to use their data for training or crawling purposes. This proposal seeks to address this issue by introducing a new HTML meta tag called ai-consent.

The Proposed Solution

I propose the introduction of an HTML meta tag named ai-consent. This tag would have a content attribute with the following possible values:

The tag would appear in the <head> of an HTML document. For example:

<meta name="ai-consent" content="all">

Use Cases and Examples

Below are some examples of how the ai-consent tag could be used:

  1. A news website owner wants their articles to be included in both AI training and search results. They would use:
<meta name="ai-consent" content="all">
  1. A personal blog author does not want their content included in AI model training but is fine with it being used for live search results, provided the blog is cited. They would use:
<meta name="ai-consent" content="search-only">
  1. A privacy-focused website's owner does not want their content used by AI at all. They would use:
<meta name="ai-consent" content="none">

Considerations

This proposal introduces a method for website owners to manage consent regarding AI data usage and is similar in intent to the noindex meta tag. However, it does not enforce the consent. It would be the responsibility of AI creators and operators to respect and enforce these tags, which might not happen short of robust regulation. Additionally, the proposed tag would need to be included in popular web crawlers' whitelists of meta tags.

Conclusion

The proposed ai-consent meta tag provides a standard method for website owners to express their consent for AI data usage. It would promote transparency and respect for website owners' data preferences, contributing to a more ethical web environment for AI.

rthrejheytjyrtj545 commented 1 year ago

Why should the author explicitly choose none to indicate that they do not agree? What is meant by the absence of this type of metadata?

Doesn't this sentence duplicate the existing license link type? Interested parties can already create a mechanism like CC REL and provide the appropriate legal background, this is an organizational issue, not a technological one.

brennancaldwell commented 1 year ago

These are great points! Thank you for pointing these out. I had considered just proposing all and search-only -- I believe the default assumption should be no consent.

I also agree that this is more a question of organization than technology. The details of implementation aren't important to me so much as agreeing on a standard for establishing consent specifically in the case of model training and search. Perhaps this can indeed be handled using a license link tag.

rthrejheytjyrtj545 commented 1 year ago

By the way, if you leave it in force something similar to DNT, you can move the proposal to the Microformats Wiki (which will be officially recognized as a specification), or go with the same to WICG. Also, bikeshedding: something like notraining and nosnipping would sound more “vanilla”.

brennancaldwell commented 1 year ago

Thank you!

rthrejheytjyrtj545 commented 1 year ago

No problem. What I suggested to you in the comment above is a move away from metadata in favor of a link type.

You can, of course, write a specification and send it to MetaExtensions, but this is a chore and “However, a new metadata name should not be created in any of the following cases: If the name is for something expected to have processing requirements in user agents; in that case it ought to be standardized” might be applicable given that crawlers are also UA in some way. So <link href = . rel = training/> might be a good option...

ramijwar commented 1 year ago

wow that's awesome

saschanaz commented 1 year ago

FYI, DeviantArt and SketchFab came up with <meta name="robots" content="noai">.

myakura commented 1 year ago

I believe that bots can crawl non-HTML resource files, such as source codes or images. Isn't it better to define this in (or on top of) the robots.txt protocol? https://datatracker.ietf.org/doc/html/rfc9309

rthrejheytjyrtj545 commented 1 year ago

@myakura, no, because there are countless crawlers in the future, and the author cannot be made responsible for following them. In addition, no one wants to limit crawling in this case, only the use of the collected content.

jfhr commented 10 months ago

One consideration here is that crawlers would need to download each individual page to find out if it has an ai-consent meta tag. Downloading lots of pages just to find out you can't use them is a waste of money - as long as this is a voluntary standard, companies would be less incentivized to respect it at all.

The robots.txt standard avoids exactly that problem by having a single file for an entire origin. Perhaps a similar file could be introduced for ai consent management. e.g.

All: /documentation
Search-Only: /weblog
None: /personal

This could be hosted under a well-known URI such as /.well-known/ai-consent.txt