Open evayde opened 1 year ago
It's quite possible that an image or a bit of text is AI generated, but the rest of the document isn't. Wouldn't an attribute be better?
Also, a better option might be to have a more general attribution attribute, which would be a strict superset of this feature.
It's quite possible that an image or a bit of text is AI generated, but the rest of the document isn't. Wouldn't an attribute be better?
Read "Other Considerations" number 3.
Also, a better option might be to have a more general attribution attribute, which would be a strict superset of this feature.
Wouldn't that be the Author meta tag? This was mentioned in "Existing Solutions" number 2.
Somehow I completely missed those. Yes, those are roughly what I'm thinking of.
I still think that being able to indicate which AI generated the content is valuable, and that per-uelement specificity is equally valuable. Worst-case, stick it on the body tag.
I still think that being able to indicate which AI generated the content is valuable
Possible with a combination of existing meta tags, imo
<meta name="author" content="Me, ChatGPT">
<meta name="ai-generated" content="partially">
This seems redundant, but it actually isn't. In this case, we set the name of the AI in the author meta content instead of just writing "AI." We cannot possibly keep track of every AI name. So this would explicitly say that AI created parts of the content, and the additional information which AI was used to create the content is added to the author meta information.
Just so you know - for multiple authors, multiple tags should be used IIRC.
<meta name="author" content="Me">
<meta name="author" content="ChatGPT">
<meta name="ai-generated" content="partially">
How about a flag be added to the author mata tag instead, like:
<meta name="author" content="Me">
<meta name="author" content="ChatGPT" ai="1">
I think it would be really useful to be able to link embeddings databases to webpages.
For instance
<link rel="embeddings" type="openai/text-embedding-ada-002" src="./public/embeddings.sqlite">
That way a user could semantically chat with a website without having to embed the page for themselves.
Something kind of like https://til.simonwillison.net/llms/openai-embeddings-related-content
I think it would be really useful to be able to link embeddings databases to webpages.
Probably worth opening a seperate issue for that. It seems out-of-scope for this proposal.
How about if it was defined id's and/or classes so that the area could easily be delineated and parsed? Absence of the "ai-generated" meta tag means the author is asserting there is no ai-generated content, which in turn eases backward compatibility, and the only individuals affected are AI content publishers who need to catch up with the standard.
<meta name="ai-generated" content="id=id1,id2;class=class1,class2..." />
I'd still prefer the ai="1"
syntax, but I would be okay with that as a close second.
I'm really not a fan of people boycotting AI-generated work for various reasons, but at the same time I recognize that if I were one of those people, I would definitely want a standard like this. As such, I am generally +0 to this proposal as a whole.
I absolutely agree with the bit operator as first choice, but we need to be more specific than that. In one of my use cases, I want to have a chatbot on my page, so I dont want the rest of my content to be ignored. I do not see this as boycotting AI content. Simply identifying it. If the intention of the content is above board, it should not be an issue.
I think what you want is an RDFa extension, then.
Like so? And I must disclose I used AI to help me, since that is what this thread is about it wouldn't be right for me not to. In my defense, I haven't used markup in probably a decade.
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>AI Generated Content and Author Content Areas</title>
<!-- Metadata for ChatGPT3.5 content -->
<meta name="AI-generated" content="ChatGPT3.5" />
<!-- Metadata for ChatGPT4 content -->
<meta name="AI-generated" content="ChatGPT4" />
</head>
<body>
<h1>Human Generated Content</h1>
<p>by Adam</p>
<!-- AI-generated content for ChatGPT3.5 -->
<div id="ChatGPT3.5-content">
<h2>ChatGPT3.5 Generated Content</h2>
<p>This section contains content generated by ChatGPT3.5.</p>
<!-- Add ChatGPT3.5 content here -->
</div>
<!-- AI-generated content for ChatGPT4 -->
<div id="ChatGPT4-content">
<h2>ChatGPT4 Generated Content</h2>
<p>This section contains content generated by ChatGPT4.</p>
<!-- Add ChatGPT4 content here -->
</div>
</body>
</html>
No, RDFa is a standard that effectively allows you to have scoped meta tags. I'm proposing that AIPerson
be an extension of Person
with no new properties (but is, of course, interpreted to be an AI).
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Lorem Ipsum Document</title>
</head>
<body>
<h1>About Lorem Ipsum</h1>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam in vehicula arcu. Sed non urna in libero venenatis aliquam ac a odio.</p>
<!-- RDFa Markup for Lorem Ipsum -->
<div about="#about-lorem-ipsum" typeof="schema:CreativeWork">
<h2 property="schema:name">Lorem Ipsum</h2>
<p property="schema:description">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam in vehicula arcu. Sed non urna in libero venenatis aliquam ac a odio.</p>
<span property="schema:datePublished">2023-09-13</span>
</div>
<h2>More Lorem Ipsum</h2>
<p>Phasellus nec diam vel ipsum blandit consectetur. Nullam vestibulum, ligula at ullamcorper suscipit, neque metus sodales ex, sit amet facilisis tellus neque a libero.</p>
<!-- RDFa Markup for More Lorem Ipsum -->
<div about="#more-lorem-ipsum" typeof="schema:CreativeWork">
<h3 property="schema:name">More Lorem Ipsum</h3>
<p property="schema:description">Phasellus nec diam vel ipsum blandit consectetur. Nullam vestibulum, ligula at ullamcorper suscipit, neque metus sodales ex, sit amet facilisis tellus neque a libero.</p>
<span property="schema:datePublished">2023-09-13</span>
</div>
<h2>Author</h2>
<div about="#author" typeof="AIPerson">
<h4 property="schema:name">AI Author</h4>
</div>
</body>
</html>
At least with regard to some of the stated use cases, this appears to suffer from the evil bit problem.
So does robots.txt
, but it's still useful and generally respected nonetheless.
<head>
<meta name="author" content="human author,AmazonLex,Claude,GoogleBard,Pi,LLaMA2,Copilot,ChatGPT4,etc">
<meta name="ai-content” value=true> <!-- SIMPLES -->
</head>
<body>
<!-- NOW THE BROWSER KNOWS THERE IS AI CONTENT IN THIS DOCUMENT AND THE AUTHOR(S)
END OF META ROLE `FINISH IN SCRIPT TAG -->
<h1>HUMAN GENERATED CONTENT</H1>
<script
type="application/ai" // BROWSER KNOWS THIS IS LIVE AI CONTENT
src="/script.ai" // IN THIS CASE LOCAL TO THE SERVER
id="HumanAuthorMadeCustomerServiceBot" // THIS ID IS FOR THE USER' INFORMATION
></script>
<h1>HUMAN GENERATED CONTENT</H1>
<script
type="application/ai" // BROWSER KNOWS THIS IS LIVE AI CONTENT
src="https://anthropic.com/script.ai" // IN THIS CASE OFF-SITE
async
defer
crossorigin="anonymous"
integrity="sha256-abc123xyz456" // SECURITY WHICH NEEDS TO BE ADDRESSES AS WELL
referrerpolicy="strict-origin-when-cross-origin"
nonce="abc123" // CONTENT SECURITY POLICY WHITELISTS AUTHOR
id="Claude-content"
></script>
<h1>HUMAN GENERATED CONTENT</H1>
<script
type=“text/ai” // STATIC AI-GENERATED TEXT - BROWSER KNOWS THIS IS AI CONTENT
src="/script.ai" // LOCAL SCRIPT
id="GoogleBard-API-content" // EXAMPLE OF A TYPE OF API USE CASE
// RENDERS WITHIN THE SCRIPT TAGS - HUMAN AUTHOR ADDS MARKUP LIKE <div id="xxx" class="yyy"></div>
></script>
<h1>HUMAN GENERATED CONTENT</H1>
<script
type=“text/ai” // STATIC AI-GENERATED TEXT - BROWSER KNOWS THIS IS AI CONTENT
><span id="LLaMA2-content">INLINE CONTENT HERE - SORRY AGAIN GANG</span></script>
<h1>HUMAN GENERATED CONTENT</H1>
<!-- DOES THIS SOLUTION HOLD WATER? -->
</body>
@TheRealRitMan What is that supposed to be demonstrating? What is that media type supposed to represent? What do CSP nonces or subresource integrity have to do with this concept (etc)?
in short the meta tag lets the browser know there is AI content. do we need meta info to also identify where every bit of it is amongst hybrid content? I was going with @Pandapip1's suggestion of the bit operator for the meta tag is it AI true or false. If true, its Identified in the SCRIPT tag by the content-type. notice I had AI for a live AI because it might be a 2 way connection, and ai-generated is static content furnished by script or inline.
You(and the spider) know there is AI content, You know all the authors in multiple AI situations, you know if its live or static. you know where it is, you know which author it was if the human author tells you. You know what kind of AI it is, you know if it is remote or local. . Those are a few examples, I feel like its a bit cleaner than the RDFa route, but I am a hack so maybe I have no business trying to define standards! If Im being dumb call me an idiot its cool :) We never had an issue this serious and I am honored to be part of the discussion. I feel like AI is no less significant than when the Internet itself came out.
A MIME type for AI-generated content is an interesting, but isn't really what MIME types are for. MIME types are for specifying what format your data is in, not for specifying other metadata. So -1 to that idea.
+1 to my RDFa and <meta author="???" ai="1>
solutions. They extend those particular standards in ways they were intended to be extended and effectively solve the problem.
+0.5 to the <meta name="ai-generated" content="id=id1,id2;class=class1,class2..." />
solution. For the latter case, I would prefer if the content
was a CSS selector (e.g. <meta name="ai-generated" content="#id1, #id2, .class1, .class2"
). It also solves the original problem but is a bit less clean and requires implementers to implement additional parsing beyond the normal XML parsing.
+1 to your upgrade of my id=id suggestion for adding the # and . identifiers for id's and classes, and removing the assignment operator.
I'm neutral on the RFA because I found it tricky to understand, I am with you on making things as clean as possible and the *=1 is the shortest way to indicate anything.
The mime types were meant to indicate the type of AI content because I realized there is more than just static content, and we are going to have AI sessions so that was forward looking and
I'm not clear why your <meta name="ai-generated" content="#id1, #id2, .class1, .class2">
syntax doesn't work, that meta information indicates there is AI generated content and where it resides. It has the same result as the RFA solution with much less information, is simple to implement and addresses the bullet points of the OP:
Users don't know whether the content is generated or not, and Search Engines cannot decide the quality of content
Either way the spiders are going to have to make changes to weigh the content for value +1 to you for all your input
copied from @Pandapip1 tweak on my idea.
<meta name="ai-generated" content="#id1, .class1">
</head><body>
<div id="id1">AI GENERATED</div>
<div id="nonai1">NON-AI GENERATED CONTENT>
<div class="class1">AI GENERATED</div>
<div id="nonai2">NON-AI GENERATED CONTENT>
How does it get any simpler? Here is how: MOST credit to @Pandapip1
<meta author="" ai="1>
This is EVEN simpler. I don't know how much I love having this brand new technology be a child of the author tag EVEN THOUGH it is highly relevant. It does prevent you from adding a human author without any modification.
AND WE CANT FORGET @evayde who offered an extensive view of the options. (IDK if these points are real or what, but I say you get MAD props for such a detailed and clear layout of the problem, and the Proposed Solutions and Use Cases Other Considerations, Other Other. Whatever the most points you can get, you deserve.
<meta name="ai-generated" content="partially">
I think this, with my addition of putting the id and classes in:
<meta name="ai-generated" content="id=id1,id2;class=class1,class2..." />
AND THEN when @Pandapip1 added the selectors and the correct way of assigning them:
<meta name="ai-generated" content="#id1, #id2, .class1, .class2">
Is the winner for being clear cut, user friendly and everyone here contributed. Anyone care to second that?
I'm fine with <meta name="ai-generated" content="css selector">
as a standard.
@Pandapip1 - I will say again that your ai="1" idea is simpler, but it doesn't cover hybrid situations. And since you optimized my idea, let me say you could have eliminated the double quotes since it is an integer! JK
@evayde - this is your thread what do you think about <meta name="ai-generated" content="css selector">
as a standard?
@TheRealRitMan I think that the thing with CSS selectors could be prone to a lot of errors and false positives. Especially with how CSS is used in the real world. For instance: How do we handle automatically generated class names? Maybe I am missing a use case here.
To me, it is sufficient to be able to tell that parts of the website are generated. The AI could figure it out by themselves (e.g. they are able to figure out whats a navigation, whats a sidebar, whats the main content and so on). It should be a hint and not a definite guide to every generated word.
I also assume that whoever provides such a hint will most likely also use other measures to inform their users about generated content (e.g. by providing a list of sources, which could be Microdata).
On top of that, this meta tag could also act as an enabler to other (possibly proprietary) technologies. Think of WAI Aria, Schema.org, or Open Graph. So, there could be a technology or library building on that (which possibly might work with CSS class names), but in my opinion, this is out of scope of the HTML spec.
On top of that, this meta tag could also act as an enabler to other (possibly proprietary) technologies. Think of WAI Aria, Schema.org, or Open Graph. So, there could be a technology or library building on that (which possibly might work with CSS class names), but in my opinion, this is out of scope of the HTML spec.
What do you think about the other proposal, ai="1"
for author tags?
@Pandapip1 I assume that you would use it as follows?
Partially by AI
<meta name="author" content="name of somebody" ai="1">
Completely AI generated
<meta name="author" content="" ai="1">
No AI involved
<meta name="author" content="..." ai="0">
<meta name="author" content="...">
That could also work, however, it could be confusing. What do we do in these cases?
<meta name="author" content="ChatGPT" ai="0">
<- should mean no AI
<meta name="author" content="ChatGPT" ai="1">
<- should mean partially generated by AI
So, the solution might be misunderstood and open to human error, while the proposed solution is explicit: The mere existence of the author tag doesn't mean anything, so devices would have to read the contents of the tag to figure out the meaning. While a special ai-generated tag would explicitly state that there might be something going on with AI (or nothing at all, but it's explicit).
Also, there's another thing about my proposal, what would that mean?
<meta name="ai-generated" content="all">
<meta name="author" content="some person">
It means everything was generated by an AI, but there's a human author. Now, it could mean that it was the person who used AI to generate the content (something that couldn't be expressed with your solution).
I don't want to simply dismiss your idea. As I mentioned earlier, I don't like to pollute HTML with more and more meta tags. And this is what I like about your approach, it reuses the author meta. Despite the shortcomings, it would still be a viable solution in my eyes.
No, there would just be one author meta tag per author, as usual. If the author is an AI, the AI flag is set.
For schema.org, there is a proposal to map the IPTC tags values here, which I feel would be relevant:
Thanks @MatthiasWiesmann, there's a draft at https://webschemas.org/IPTCDigitalSourceEnumeration now, although we would do well to add some examples the link to the IPTC codes is very explicit so round-tripping between embedded-in-image metadata and published-in-a-referencing-webpage metadata ought to be straightforward in most cases.
Great discussion. I like the proposal.
I'd still prefer the
ai="1"
syntax, but I would be okay with that as a close second.I'm really not a fan of people boycotting AI-generated work for various reasons, but at the same time I recognize that if I were one of those people, I would definitely want a standard like this. As such, I am generally +0 to this proposal as a whole.
AI content is low effort content created on the basis of valuable content based on hundreds of hours of human work. Often, it is not based on sources and links because, for example, Open AI does not provide sources of works on which it was learned.
Even if it is good and substantive, it is still based on human work. Therefore, AI content should be depreciated compared to human content because it always has less value due to the impossibility or problematic nature of verifying sources and the risk of possible hallucinations contained in the text, which are sometimes difficult to detect.
Imagine a situation where, while looking for confirmation of whether what is in one text is true, you come across 10 other texts generated by AI with the same nonsense and you become convinced that you are reading the truth and there are no real sources
That is why it is so important to catch it and distinguish it from human content, because there is a risk that search results will be flooded with content fully generated by AI, which generally is often better written in terms of grammar and SEO, which paradoxically translates into a lower search position for content written by humans. . Isn't it about us all drowning in the hallucinations of an algorithm that will soon start learning from variations of its own sweats?
You make a good point. While I don't believe it's impossible to make LLMs cite their sources (and GPT-4 does when browsing the internet), I agree that tagging content as AI-generated to avoid training LLMs on LLM output is probably needed in order to avoid a feedback cycle.
Hello. May I express my opinion here?
I think the two proposed taggings are good ideas, but they seems to be catering to different uses cases and one isn't a substitute to another.
<meta name="author" content="ChatGPT" ai="1">
can be useful for attributions to the AI model generating the contents. I expect the main uses to this are for the AI models themselves, (a.) to avoid feeding with data generated by itself for training lest it causes a loop in data sources, (b.) to makes AIs aware the contents might be generated by a rival AI, and if that rival AI is subject to legal issues (copyright & IP infringements, or contain illegal content otherwise), other AIs that are fed with this data might be legally liable, too.<meta name="ai-generated" content="all | css-selectors...">
is for more precise tagging of which content is human made and which content is AI. I expect the main uses are browsers and search engines: (a.) browsers may have the motive to visually highlight the text, picture, or videos as AI generated (such as having an AI icon on the corner of the image, or highlight the text with a different color for analysis); (b.) browsers and search engines might implement a filtering techniques of users who want less AI-generated content within their search results.So the use cases of the two different proposed tagging don't overlap, and we can go with both.
(I am new here.)
I like this, but I worry the phrase "ai-generated" might become out of date / too vague / meaningless (if it hasn't already) – i.e. at some stage it becomes essential to specify the type of AI, or there is some future leap beyond current LLMs.
Nostalgically, I also like the idea of reusing the term "robot" rather than "ai". I think it's quite fun, friendly terminology (robots.txt and humans.txt are easily understandable concepts). Maybe you can introduce "hybrid" or something better for partial content.
Perhaps you encourage people to reuse the crawler's User Agent string (including version number) when specifying the precise author.
I think the other consideration is audio - and to have a way of distinguishing (I don't believe there already is one) between:
Again, this is all partly to assist crawlers and avoid them needlessly crawling or transcribing ai-generated or simply duplicate material.
May I try to give my two cents on this?
I like this, but I worry the phrase "ai-generated" might become out of date / too vague / meaningless (if it hasn't already) – i.e. at some stage it becomes essential to specify the type of AI, or there is some future leap beyond current LLMs.
My belief that the main purpose of tagging AI content is to satisfy laws that require such labeling. Thus, a generic "AI" label is necessary. When the contents are labeled as AI-generated, it means the contents may be inaccurate, misleading, or are deliberately forged. In this use case, there is no need to indicate which kind of AI generates them.
Maybe you can introduce "hybrid" or something better for partial content.
I am reserved on this. The question would be: how much AI-generated content is within should it be labeled as AI? That threshold may differ among mediums that the "partial" could add more confusion than it solves.
I think it would be better to have a free-form text comment to indicate which parts are AI and which are not. Example: "Background and props except the trading card designs made by AI"
for this Magic: The Gathering advertising image
Introduction
With the rapid growth of artificial intelligence, and especially machine learning models that train on web data, the issues that
arise.
Currently, there is no standard way for website owners to express that AI models (partly) generated their content. This proposal seeks to address this issue by introducing a new HTML meta tag called ai-generated.
The Proposed Solution
I propose the introduction of an HTML meta tag named ai-generated. This tag would have a content attribute with the following possible values:
all
: The whole main content was generated by AIpartially
: The content was co-authored by AInone
: none of the content was generated by AIunknown
(internal value?): it is unknown whether the content was generated. This value should be assumed in case of an absence of the meta tagThe tag would appear in the
<head>
of an HTML document. For example:<meta name="ai-generated" content="partially">
Use Cases
Below are some examples of when the
ai-generated
meta tag could be used:1. Let search engines know the content was (partially) generated by AI
Websites use AI-generated content in different ways. In the future, search engines might be aware that the content was generated by AI (because they generated it themselves), and not providing the meta tag would automatically de-rank those websites.
2. Let users know the content was (partially) generated by AI
When browsers see this meta tag, they could visually indicate that parts of the website were authored by AI, telling the user to treat the information with caution.
3. Let AI know that this content was generated by AI
AI should be aware that the following content was already generated, and thus, the information might be flawed.
Examples
Below are examples of how to use the
ai-generated
meta tag:1. The whole (main) content was generated by AI (e.g., the main chunk of text content)
<meta name="ai-generated" content="all">
2. Only parts of the content were generated by AI
<meta name="ai-generated" content="partially">
3. Nothing on this website was generated by AI
<meta name="ai-generated" content="none">
Existing Solutions
We have two existing tags that could solve this problem, but we would have to standardize the use:
1. Meta Generator
<meta name="generator" content="Chat-GPT">
The meta generator tag indicates that the
structure
of the document has been generated. In my opinion, this is good enough but solves a different problem. It could, however, actually be used to indicate that the structure of a website was generated by AI.2. Meta Author This tag is more interesting as it does exactly what was proposed. But its use would have to be standardized in order to be useful:
The content was fully created by AI:
<meta name="author" content="AI">
The content was co-authored by AI:
<meta name="author" content="Me, AI">
The content was not created by AI:
<meta name="author" content="Me">
In my opinion, having a dedicated meta tag for
ai-generated
is the better solution.Other considerations
1. Why should an author use the tag? Authors need incentives to use this tag. First of all, they contribute to the quality of AI-generated content, as AI might not pick up content that had been generated. Second, we have to be able to identify the content that was generated. Adobe already tries this with Firefly, but we also need a mechanism for written content. So, in the future, Search Engines and other relevant players might punish content that was generated and doesn't explicitly state so.
2. Schema Org We could move the whole issue to Schema Org and call it a day. E.g., by proposing the
ai-generated
attribute to them, users could indicate whether articles etc. were generated.3. How to show which parts of content were generated by AI? This is an unsolved problem. I am not a friend of creating a new attribute or even new tags, but currently, this might be the only way to solve it:
<span ai-generated="true">Foo</span>
Of course, this would indeed be easier if we just used the schema org solution. Or maybe a combination.
Conclusion
The proposed
ai-generated
meta tag provides a standard method for website owners to express that their content was (partially) generated by AI. It would promote transparency and respect for website users, contributing to a more ethical web environment for AI.How to declare which parts of the website are generated remains unresolved and open to discussion.
Other
I copied some of the text from this issue which proposed the
ai-consent
meta tag, as they were very similar. https://github.com/whatwg/html/issues/9334