evayde commented 1 year ago

Introduction

With the rapid growth of artificial intelligence, and especially machine learning models that train on web data, the issues that

these models themself train on (poorly) generated data over and over again,
Users don't know whether the content is generated or not,
and Search Engines cannot decide the quality of content,

arise.

Currently, there is no standard way for website owners to express that AI models (partly) generated their content. This proposal seeks to address this issue by introducing a new HTML meta tag called ai-generated.

The Proposed Solution

I propose the introduction of an HTML meta tag named ai-generated. This tag would have a content attribute with the following possible values:

all: The whole main content was generated by AI
partially: The content was co-authored by AI
none: none of the content was generated by AI
unknown (internal value?): it is unknown whether the content was generated. This value should be assumed in case of an absence of the meta tag

The tag would appear in the <head> of an HTML document. For example:

<meta name="ai-generated" content="partially">

Use Cases

Below are some examples of when the ai-generated meta tag could be used:

1. Let search engines know the content was (partially) generated by AI

Websites use AI-generated content in different ways. In the future, search engines might be aware that the content was generated by AI (because they generated it themselves), and not providing the meta tag would automatically de-rank those websites.

2. Let users know the content was (partially) generated by AI

When browsers see this meta tag, they could visually indicate that parts of the website were authored by AI, telling the user to treat the information with caution.

3. Let AI know that this content was generated by AI

AI should be aware that the following content was already generated, and thus, the information might be flawed.

Examples

Below are examples of how to use the ai-generated meta tag:

1. The whole (main) content was generated by AI (e.g., the main chunk of text content) <meta name="ai-generated" content="all">

2. Only parts of the content were generated by AI <meta name="ai-generated" content="partially">

3. Nothing on this website was generated by AI <meta name="ai-generated" content="none">

Existing Solutions

We have two existing tags that could solve this problem, but we would have to standardize the use:

1. Meta Generator <meta name="generator" content="Chat-GPT">

The meta generator tag indicates that the structure of the document has been generated. In my opinion, this is good enough but solves a different problem. It could, however, actually be used to indicate that the structure of a website was generated by AI.

2. Meta Author This tag is more interesting as it does exactly what was proposed. But its use would have to be standardized in order to be useful:

The content was fully created by AI: <meta name="author" content="AI">

The content was co-authored by AI: <meta name="author" content="Me, AI">

The content was not created by AI: <meta name="author" content="Me">

In my opinion, having a dedicated meta tag for ai-generated is the better solution.

Other considerations

1. Why should an author use the tag? Authors need incentives to use this tag. First of all, they contribute to the quality of AI-generated content, as AI might not pick up content that had been generated. Second, we have to be able to identify the content that was generated. Adobe already tries this with Firefly, but we also need a mechanism for written content. So, in the future, Search Engines and other relevant players might punish content that was generated and doesn't explicitly state so.

2. Schema Org We could move the whole issue to Schema Org and call it a day. E.g., by proposing the ai-generated attribute to them, users could indicate whether articles etc. were generated.

3. How to show which parts of content were generated by AI? This is an unsolved problem. I am not a friend of creating a new attribute or even new tags, but currently, this might be the only way to solve it:

<span ai-generated="true">Foo</span>

Of course, this would indeed be easier if we just used the schema org solution. Or maybe a combination.

Conclusion

The proposed ai-generated meta tag provides a standard method for website owners to express that their content was (partially) generated by AI. It would promote transparency and respect for website users, contributing to a more ethical web environment for AI.

How to declare which parts of the website are generated remains unresolved and open to discussion.

Other

I copied some of the text from this issue which proposed the ai-consent meta tag, as they were very similar. https://github.com/whatwg/html/issues/9334

Pandapip1 commented 1 year ago

It's quite possible that an image or a bit of text is AI generated, but the rest of the document isn't. Wouldn't an attribute be better?

Also, a better option might be to have a more general attribution attribute, which would be a strict superset of this feature.

evayde commented 1 year ago

It's quite possible that an image or a bit of text is AI generated, but the rest of the document isn't. Wouldn't an attribute be better?

Read "Other Considerations" number 3.

Also, a better option might be to have a more general attribution attribute, which would be a strict superset of this feature.

Wouldn't that be the Author meta tag? This was mentioned in "Existing Solutions" number 2.

Pandapip1 commented 1 year ago

Somehow I completely missed those. Yes, those are roughly what I'm thinking of.

I still think that being able to indicate which AI generated the content is valuable, and that per-uelement specificity is equally valuable. Worst-case, stick it on the body tag.

evayde commented 1 year ago

I still think that being able to indicate which AI generated the content is valuable

Possible with a combination of existing meta tags, imo

<meta name="author" content="Me, ChatGPT">
<meta name="ai-generated" content="partially">

This seems redundant, but it actually isn't. In this case, we set the name of the AI in the author meta content instead of just writing "AI." We cannot possibly keep track of every AI name. So this would explicitly say that AI created parts of the content, and the additional information which AI was used to create the content is added to the author meta information.

Pandapip1 commented 1 year ago

Just so you know - for multiple authors, multiple tags should be used IIRC.

<meta name="author" content="Me">
<meta name="author" content="ChatGPT">
<meta name="ai-generated" content="partially">

How about a flag be added to the author mata tag instead, like:

<meta name="author" content="Me">
<meta name="author" content="ChatGPT" ai="1">

BLamy commented 1 year ago

I think it would be really useful to be able to link embeddings databases to webpages.

For instance

<link rel="embeddings" type="openai/text-embedding-ada-002" src="./public/embeddings.sqlite">

That way a user could semantically chat with a website without having to embed the page for themselves.

Something kind of like https://til.simonwillison.net/llms/openai-embeddings-related-content

Pandapip1 commented 1 year ago

I think it would be really useful to be able to link embeddings databases to webpages.

Probably worth opening a seperate issue for that. It seems out-of-scope for this proposal.

TheRealRitMan commented 1 year ago

How about if it was defined id's and/or classes so that the area could easily be delineated and parsed? Absence of the "ai-generated" meta tag means the author is asserting there is no ai-generated content, which in turn eases backward compatibility, and the only individuals affected are AI content publishers who need to catch up with the standard.

<meta name="ai-generated" content="id=id1,id2;class=class1,class2..." />

Pandapip1 commented 1 year ago

I'd still prefer the ai="1" syntax, but I would be okay with that as a close second.

I'm really not a fan of people boycotting AI-generated work for various reasons, but at the same time I recognize that if I were one of those people, I would definitely want a standard like this. As such, I am generally +0 to this proposal as a whole.

TheRealRitMan commented 1 year ago

I absolutely agree with the bit operator as first choice, but we need to be more specific than that. In one of my use cases, I want to have a chatbot on my page, so I dont want the rest of my content to be ignored. I do not see this as boycotting AI content. Simply identifying it. If the intention of the content is above board, it should not be an issue.

Pandapip1 commented 1 year ago

I think what you want is an RDFa extension, then.

TheRealRitMan commented 1 year ago

Like so? And I must disclose I used AI to help me, since that is what this thread is about it wouldn't be right for me not to. In my defense, I haven't used markup in probably a decade.

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
   <title>AI Generated Content and Author Content Areas</title>
   <!-- Metadata for ChatGPT3.5 content -->
   <meta name="AI-generated" content="ChatGPT3.5" />
   <!-- Metadata for ChatGPT4 content -->
   <meta name="AI-generated" content="ChatGPT4" />
</head>
<body>
   <h1>Human Generated Content</h1>
   <p>by Adam</p>

   <!-- AI-generated content for ChatGPT3.5 -->
   <div id="ChatGPT3.5-content">
      <h2>ChatGPT3.5 Generated Content</h2>
      <p>This section contains content generated by ChatGPT3.5.</p>
      <!-- Add ChatGPT3.5 content here -->
   </div>

   <!-- AI-generated content for ChatGPT4 -->
   <div id="ChatGPT4-content">
      <h2>ChatGPT4 Generated Content</h2>
      <p>This section contains content generated by ChatGPT4.</p>
      <!-- Add ChatGPT4 content here -->
   </div>
</body>
</html>

Pandapip1 commented 1 year ago

No, RDFa is a standard that effectively allows you to have scoped meta tags. I'm proposing that AIPerson be an extension of Person with no new properties (but is, of course, interpreted to be an AI).

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Lorem Ipsum Document</title>
</head>
<body>
    <h1>About Lorem Ipsum</h1>
    <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam in vehicula arcu. Sed non urna in libero venenatis aliquam ac a odio.</p>

    <!-- RDFa Markup for Lorem Ipsum -->
    <div about="#about-lorem-ipsum" typeof="schema:CreativeWork">
        <h2 property="schema:name">Lorem Ipsum</h2>
        <p property="schema:description">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam in vehicula arcu. Sed non urna in libero venenatis aliquam ac a odio.</p>
        <span property="schema:datePublished">2023-09-13</span>
    </div>

    <h2>More Lorem Ipsum</h2>
    <p>Phasellus nec diam vel ipsum blandit consectetur. Nullam vestibulum, ligula at ullamcorper suscipit, neque metus sodales ex, sit amet facilisis tellus neque a libero.</p>

    <!-- RDFa Markup for More Lorem Ipsum -->
    <div about="#more-lorem-ipsum" typeof="schema:CreativeWork">
        <h3 property="schema:name">More Lorem Ipsum</h3>
        <p property="schema:description">Phasellus nec diam vel ipsum blandit consectetur. Nullam vestibulum, ligula at ullamcorper suscipit, neque metus sodales ex, sit amet facilisis tellus neque a libero.</p>
        <span property="schema:datePublished">2023-09-13</span>
    </div>

    <h2>Author</h2>

    <div about="#author" typeof="AIPerson">
        <h4 property="schema:name">AI Author</h4>
    </div>
</body>
</html>

bathos commented 1 year ago

At least with regard to some of the stated use cases, this appears to suffer from the evil bit problem.

Pandapip1 commented 1 year ago

So does robots.txt, but it's still useful and generally respected nonetheless.

TheRealRitMan commented 1 year ago

<head>
  <meta name="author" content="human author,AmazonLex,Claude,GoogleBard,Pi,LLaMA2,Copilot,ChatGPT4,etc">
  <meta name="ai-content” value=true>  <!-- SIMPLES -->
</head>

<body>

 <!-- NOW THE BROWSER KNOWS THERE IS AI CONTENT IN THIS DOCUMENT AND THE AUTHOR(S) 
     END OF META ROLE `FINISH IN SCRIPT TAG -->

<h1>HUMAN GENERATED CONTENT</H1>

  <script
    type="application/ai"  // BROWSER KNOWS THIS IS LIVE AI CONTENT 
    src="/script.ai"  // IN THIS CASE LOCAL TO THE SERVER
    id="HumanAuthorMadeCustomerServiceBot" // THIS ID IS FOR THE USER' INFORMATION
  ></script>

<h1>HUMAN GENERATED CONTENT</H1>

  <script
    type="application/ai"  // BROWSER KNOWS THIS IS LIVE AI CONTENT 
    src="https://anthropic.com/script.ai"  // IN THIS CASE OFF-SITE
    async 
    defer
    crossorigin="anonymous"
    integrity="sha256-abc123xyz456" // SECURITY WHICH NEEDS TO BE ADDRESSES AS WELL
    referrerpolicy="strict-origin-when-cross-origin"
    nonce="abc123" // CONTENT SECURITY POLICY WHITELISTS AUTHOR
    id="Claude-content"
  ></script>

<h1>HUMAN GENERATED CONTENT</H1>

  <script
    type=“text/ai” // STATIC AI-GENERATED TEXT - BROWSER KNOWS THIS IS AI CONTENT
    src="/script.ai" // LOCAL SCRIPT
    id="GoogleBard-API-content" // EXAMPLE OF A TYPE OF API USE CASE
    // RENDERS WITHIN THE SCRIPT TAGS - HUMAN AUTHOR ADDS MARKUP LIKE <div id="xxx" class="yyy"></div>
  ></script>

<h1>HUMAN GENERATED CONTENT</H1>

  <script
    type=“text/ai” // STATIC AI-GENERATED TEXT - BROWSER KNOWS THIS IS AI CONTENT
  ><span id="LLaMA2-content">INLINE CONTENT HERE - SORRY AGAIN GANG</span></script>

<h1>HUMAN GENERATED CONTENT</H1>

   <!--  DOES THIS SOLUTION HOLD WATER? -->

</body>

bathos commented 1 year ago

@TheRealRitMan What is that supposed to be demonstrating? What is that media type supposed to represent? What do CSP nonces or subresource integrity have to do with this concept (etc)?

TheRealRitMan commented 1 year ago

in short the meta tag lets the browser know there is AI content. do we need meta info to also identify where every bit of it is amongst hybrid content? I was going with @Pandapip1's suggestion of the bit operator for the meta tag is it AI true or false. If true, its Identified in the SCRIPT tag by the content-type. notice I had AI for a live AI because it might be a 2 way connection, and ai-generated is static content furnished by script or inline.

You(and the spider) know there is AI content, You know all the authors in multiple AI situations, you know if its live or static. you know where it is, you know which author it was if the human author tells you. You know what kind of AI it is, you know if it is remote or local. . Those are a few examples, I feel like its a bit cleaner than the RDFa route, but I am a hack so maybe I have no business trying to define standards! If Im being dumb call me an idiot its cool :) We never had an issue this serious and I am honored to be part of the discussion. I feel like AI is no less significant than when the Internet itself came out.

Pandapip1 commented 1 year ago

A MIME type for AI-generated content is an interesting, but isn't really what MIME types are for. MIME types are for specifying what format your data is in, not for specifying other metadata. So -1 to that idea.

+1 to my RDFa and <meta author="???" ai="1> solutions. They extend those particular standards in ways they were intended to be extended and effectively solve the problem.

+0.5 to the <meta name="ai-generated" content="id=id1,id2;class=class1,class2..." /> solution. For the latter case, I would prefer if the content was a CSS selector (e.g. <meta name="ai-generated" content="#id1, #id2, .class1, .class2"). It also solves the original problem but is a bit less clean and requires implementers to implement additional parsing beyond the normal XML parsing.

TheRealRitMan commented 1 year ago

+1 to your upgrade of my id=id suggestion for adding the # and . identifiers for id's and classes, and removing the assignment operator.

I'm neutral on the RFA because I found it tricky to understand, I am with you on making things as clean as possible and the *=1 is the shortest way to indicate anything.

The mime types were meant to indicate the type of AI content because I realized there is more than just static content, and we are going to have AI sessions so that was forward looking and

I'm not clear why your <meta name="ai-generated" content="#id1, #id2, .class1, .class2"> syntax doesn't work, that meta information indicates there is AI generated content and where it resides. It has the same result as the RFA solution with much less information, is simple to implement and addresses the bullet points of the OP:

Users don't know whether the content is generated or not, and Search Engines cannot decide the quality of content

Either way the spiders are going to have to make changes to weigh the content for value +1 to you for all your input

TheRealRitMan commented 1 year ago

copied from @Pandapip1 tweak on my idea.

<meta name="ai-generated" content="#id1, .class1">
</head><body>
<div id="id1">AI GENERATED</div>
<div id="nonai1">NON-AI GENERATED CONTENT>
<div class="class1">AI GENERATED</div>
<div id="nonai2">NON-AI GENERATED CONTENT>

How does it get any simpler? Here is how: MOST credit to @Pandapip1

<meta author="" ai="1>

This is EVEN simpler. I don't know how much I love having this brand new technology be a child of the author tag EVEN THOUGH it is highly relevant. It does prevent you from adding a human author without any modification.

AND WE CANT FORGET @evayde who offered an extensive view of the options. (IDK if these points are real or what, but I say you get MAD props for such a detailed and clear layout of the problem, and the Proposed Solutions and Use Cases Other Considerations, Other Other. Whatever the most points you can get, you deserve.

<meta name="ai-generated" content="partially">

I think this, with my addition of putting the id and classes in:

<meta name="ai-generated" content="id=id1,id2;class=class1,class2..." />

AND THEN when @Pandapip1 added the selectors and the correct way of assigning them:

<meta name="ai-generated" content="#id1, #id2, .class1, .class2">

Is the winner for being clear cut, user friendly and everyone here contributed. Anyone care to second that?

Pandapip1 commented 1 year ago

I'm fine with <meta name="ai-generated" content="css selector"> as a standard.

TheRealRitMan commented 1 year ago

@Pandapip1 - I will say again that your ai="1" idea is simpler, but it doesn't cover hybrid situations. And since you optimized my idea, let me say you could have eliminated the double quotes since it is an integer! JK

@evayde - this is your thread what do you think about <meta name="ai-generated" content="css selector"> as a standard?

evayde commented 1 year ago

@TheRealRitMan I think that the thing with CSS selectors could be prone to a lot of errors and false positives. Especially with how CSS is used in the real world. For instance: How do we handle automatically generated class names? Maybe I am missing a use case here.

To me, it is sufficient to be able to tell that parts of the website are generated. The AI could figure it out by themselves (e.g. they are able to figure out whats a navigation, whats a sidebar, whats the main content and so on). It should be a hint and not a definite guide to every generated word.

I also assume that whoever provides such a hint will most likely also use other measures to inform their users about generated content (e.g. by providing a list of sources, which could be Microdata).

On top of that, this meta tag could also act as an enabler to other (possibly proprietary) technologies. Think of WAI Aria, Schema.org, or Open Graph. So, there could be a technology or library building on that (which possibly might work with CSS class names), but in my opinion, this is out of scope of the HTML spec.

Pandapip1 commented 1 year ago

On top of that, this meta tag could also act as an enabler to other (possibly proprietary) technologies. Think of WAI Aria, Schema.org, or Open Graph. So, there could be a technology or library building on that (which possibly might work with CSS class names), but in my opinion, this is out of scope of the HTML spec.

What do you think about the other proposal, ai="1" for author tags?

evayde commented 1 year ago

@Pandapip1 I assume that you would use it as follows?

Partially by AI <meta name="author" content="name of somebody" ai="1">

Completely AI generated <meta name="author" content="" ai="1">

No AI involved <meta name="author" content="..." ai="0"> <meta name="author" content="...">

That could also work, however, it could be confusing. What do we do in these cases? <meta name="author" content="ChatGPT" ai="0"> <- should mean no AI <meta name="author" content="ChatGPT" ai="1"> <- should mean partially generated by AI

So, the solution might be misunderstood and open to human error, while the proposed solution is explicit: The mere existence of the author tag doesn't mean anything, so devices would have to read the contents of the tag to figure out the meaning. While a special ai-generated tag would explicitly state that there might be something going on with AI (or nothing at all, but it's explicit).

Also, there's another thing about my proposal, what would that mean?

<meta name="ai-generated" content="all">
<meta name="author" content="some person">

It means everything was generated by an AI, but there's a human author. Now, it could mean that it was the person who used AI to generate the content (something that couldn't be expressed with your solution).

I don't want to simply dismiss your idea. As I mentioned earlier, I don't like to pollute HTML with more and more meta tags. And this is what I like about your approach, it reuses the author meta. Despite the shortcomings, it would still be a viable solution in my eyes.

Pandapip1 commented 1 year ago

No, there would just be one author meta tag per author, as usual. If the author is an AI, the AI flag is set.

MatthiasWiesmann commented 1 year ago

For schema.org, there is a proposal to map the IPTC tags values here, which I feel would be relevant:

https://github.com/schemaorg/schemaorg/issues/3392

danbri commented 11 months ago

Thanks @MatthiasWiesmann, there's a draft at https://webschemas.org/IPTCDigitalSourceEnumeration now, although we would do well to add some examples the link to the IPTC codes is very explicit so round-tripping between embedded-in-image metadata and published-in-a-referencing-webpage metadata ought to be straightforward in most cases.

ioaoai commented 11 months ago

Great discussion. I like the proposal.

Khrommm commented 6 months ago

I'd still prefer the ai="1" syntax, but I would be okay with that as a close second.

I'm really not a fan of people boycotting AI-generated work for various reasons, but at the same time I recognize that if I were one of those people, I would definitely want a standard like this. As such, I am generally +0 to this proposal as a whole.

AI content is low effort content created on the basis of valuable content based on hundreds of hours of human work. Often, it is not based on sources and links because, for example, Open AI does not provide sources of works on which it was learned.

Even if it is good and substantive, it is still based on human work. Therefore, AI content should be depreciated compared to human content because it always has less value due to the impossibility or problematic nature of verifying sources and the risk of possible hallucinations contained in the text, which are sometimes difficult to detect.

Imagine a situation where, while looking for confirmation of whether what is in one text is true, you come across 10 other texts generated by AI with the same nonsense and you become convinced that you are reading the truth and there are no real sources

That is why it is so important to catch it and distinguish it from human content, because there is a risk that search results will be flooded with content fully generated by AI, which generally is often better written in terms of grammar and SEO, which paradoxically translates into a lower search position for content written by humans. . Isn't it about us all drowning in the hallucinations of an algorithm that will soon start learning from variations of its own sweats?

Pandapip1 commented 6 months ago

You make a good point. While I don't believe it's impossible to make LLMs cite their sources (and GPT-4 does when browsing the internet), I agree that tagging content as AI-generated to avoid training LLMs on LLM output is probably needed in order to avoid a feedback cycle.

Explorer09 commented 4 months ago

Hello. May I express my opinion here?

I think the two proposed taggings are good ideas, but they seems to be catering to different uses cases and one isn't a substitute to another.

<meta name="author" content="ChatGPT" ai="1"> can be useful for attributions to the AI model generating the contents. I expect the main uses to this are for the AI models themselves, (a.) to avoid feeding with data generated by itself for training lest it causes a loop in data sources, (b.) to makes AIs aware the contents might be generated by a rival AI, and if that rival AI is subject to legal issues (copyright & IP infringements, or contain illegal content otherwise), other AIs that are fed with this data might be legally liable, too.
<meta name="ai-generated" content="all | css-selectors..."> is for more precise tagging of which content is human made and which content is AI. I expect the main uses are browsers and search engines: (a.) browsers may have the motive to visually highlight the text, picture, or videos as AI generated (such as having an AI icon on the corner of the image, or highlight the text with a different color for analysis); (b.) browsers and search engines might implement a filtering techniques of users who want less AI-generated content within their search results.

So the use cases of the two different proposed tagging don't overlap, and we can go with both.

wturrell commented 3 months ago

(I am new here.)

I like this, but I worry the phrase "ai-generated" might become out of date / too vague / meaningless (if it hasn't already) – i.e. at some stage it becomes essential to specify the type of AI, or there is some future leap beyond current LLMs.

Nostalgically, I also like the idea of reusing the term "robot" rather than "ai". I think it's quite fun, friendly terminology (robots.txt and humans.txt are easily understandable concepts). Maybe you can introduce "hybrid" or something better for partial content.

Perhaps you encourage people to reuse the crawler's User Agent string (including version number) when specifying the precise author.

I think the other consideration is audio - and to have a way of distinguishing (I don't believe there already is one) between:

an audio version of an article read by a real human
an audio version that was auto-generated
an item such as a podcast episode page where the audio is in fact the original content and the transcript is either human generated or AI transcribed.

Again, this is all partly to assist crawlers and avoid them needlessly crawling or transcribing ai-generated or simply duplicate material.

Explorer09 commented 3 months ago

May I try to give my two cents on this?

I like this, but I worry the phrase "ai-generated" might become out of date / too vague / meaningless (if it hasn't already) – i.e. at some stage it becomes essential to specify the type of AI, or there is some future leap beyond current LLMs.

My belief that the main purpose of tagging AI content is to satisfy laws that require such labeling. Thus, a generic "AI" label is necessary. When the contents are labeled as AI-generated, it means the contents may be inaccurate, misleading, or are deliberately forged. In this use case, there is no need to indicate which kind of AI generates them.

Maybe you can introduce "hybrid" or something better for partial content.

I am reserved on this. The question would be: how much AI-generated content is within should it be labeled as AI? That threshold may differ among mediums that the "partial" could add more confusion than it solves. I think it would be better to have a free-form text comment to indicate which parts are AI and which are not. Example: "Background and props except the trading card designs made by AI" for this Magic: The Gathering advertising image

whatwg / html

Proposal: Meta Tag for AI Generated Content #9479

Introduction

The Proposed Solution

Use Cases

Examples

Existing Solutions

Other considerations

Conclusion

Other