whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
7.85k stars 2.57k forks source link

Proposal: Elements for a more semantic web #8693

Open skwee357 opened 1 year ago

skwee357 commented 1 year ago

Motivation

Making the web more understandable for humans, means it should be more readable to machines.

The main way to collect information today-is through search engines, and sometimes via syndication feeds such as RSS. As the amount of information in the internet grows at insane rate, we can no longer operate with the same semantics as we did 20-30 years ago.

Semi standards such as JSON-LD, RDF, Microdata, and Microformats - exist, but not widely adopted.

The addition of various elements, such as <article>, <picture>, <video>, etc, was a good step forward. But I believe that as the web becomes more important, the need to find accurate data becomes the number one priority of the internet user. Therefor, we should have more semantic elements. Elements such as <tag> to indicate content tagging, <author> to indicate the content author, etc.

Having a semantic web is important not only for machines (such as search engines), but for a better user experience. For example, one might want to syndicate her own content to different blogging platforms. By having semantic webpage, syndication might become easier.

Another example-one might want to read the content in a reader friendly environment, removing all the clutter such as advertisement, etc. Many browsers provide an option to highlight only the content. By providing a more semantic webpage, we can assist the developers of such tools, to create better reading experience for their users.

Why not just use one of the above semi-standards?

Some of the above standards-are supplementary, like JSON-LD. While they can be used to enhance the semantic of a content on a web page, they can become repetitive. For example: one could mark the article of a blog post with the <article> element, but in order to expose it semantically via JSON-LD, one would have to duplicate the same article inside the JSON structure, creating heavier web pages for the sake of semantics, while the data is already there.

Same goes to other attributes. Most of the elements such as author or tags, are already part of the visible web page, and exposing them via JSON-LD, creates duplication.

RDF - while being a specification, is very confusing and rarely used.

Microdata is a dying format. The Downward Spiral of Microdata.

Microformats never became popular. Moreover, it relied on HTML class attribute. While the idea was nice, it feels hack-y to rely a styling attribute, rather than having dedicated elements.

Summary

The web needs to become cleaner for machines, so it will become cleaner for humans. Not every semantic element can be it's own HTML element, as HTML meant to be presented to the end user. But we need to stop relying on semi-formats like OpenGraph, or JSON-LD that are are mainly used by some companies to enhance their users experience, and instead shift to a standard way to make the web more semantic.

I can't provide the exact elements/attributes that are needed, but I'm willing to be part of the discussions-if this proposal moves forward.

brennanyoung commented 1 year ago

I approve of this. I'm especially interested in developing mechanisms of relationship between parts, so that alternative read-order can be marked up (e.g. for complex data views, mathematical formulae, multi-staff music notation, flowcharts etc.).

FransBond3D commented 1 year ago

When reading about semantic web, I always feel that the authoring part is ignored. In a distributed environment, statements about facts have no meaning if it is not possible to trace the source of the fact. So, each piece of data should be versioned and be signed by an author (which could also be an organisation) and if based on other sources, have links to those sources, such that it is possible to verify the correctness of the aggregrated fact.

tanepiper commented 1 year ago

Totally agree here - I brought this point up last year at the Knowledge Graph conference that good well written semantic HTML is the best data format for both human and machine-readable data. Back in 2012 it felt like there was an opportunity to bring Microformats and early Web Components together to make this possible, but a drive toward SPAs and API-driven applications killed it off.

I'd like to see a move back to interoperable data using (X)HTML

bahrus commented 1 year ago

I would suggest a tag called "measurement" be added for representing readonly numbers. Meter is a nice tag, but measurement would be used just for displaying a single number. It could have as an attribute "units"

bahrus commented 1 year ago

Also, a "status" tag for representing readonly boolean/indeterminate values - is married, is vegetarian, etc.

brennanyoung commented 1 year ago

@bahrus have you had a look at <output> and (for checkboxes) the readonly attribute?

You can have a "mixed" or indeterminate boolean state today with aria-checked, (but not with HTML's native checked, which is always "true" if present. I'd welcome a HTML native "mixed" attribute for indeterminate booleans).

A way of expressing units would be great (especially if the user agent could switch values to match the user's preferred set of units, e.g. metric to US or whatever), but I think that's going beyond semantics a bit. There is a prior effort at https://www.w3.org/TR/mathml-units/

For a good time, take a look at these proposed (2015) semantic roles for data presentation. Only three of these have been adopted so far, but there's plenty of useful stuff here: https://www.w3.org/wiki/SVG_Accessibility/ARIA_roles_for_charts which we could really use.

skwee357 commented 9 months ago

Anyone knows how I can push this forward?

capjamesg commented 9 months ago

I approve of semantic representation of content in general: it is good for parsers, and enables easier consumption of data for use in building applications. The IndieWeb community uses microformats as a building block, enabling decentralized communication across websites with microformats standards like u-like-of, h-entry, and e-comment. I love semantic HTML.

Semi standards such as JSON-LD, RDF, Microdata, and Microformats - exist, but not widely adopted.

I disagree with this. JSON-LD, Microdata, and Microformats are all parsed by Google Search and used during indexing. The HTTP Archive has some numbers showing use of JSON-LD in the wild. JSON-LD is used in ActivityPub for communication, too.

I prefer microformats in general for non-structural semantics. Using HTML classes for markup allows you to keep your semantic data with HTML, without creating new elements. A parser will not care if a tag is marked up as rel=tag (an established microformats standard) or a <tag> HTML. The semantics would be the same; the only difference would be the authored HTML. In contrast, JSON-LD requires keeping your semantics in a separate object, which means that you need to update content with a semantic representation in two separate places (which could be in distinct parts of a codebase).

MIcroformats are used in Micropub and Webmention for social interaction standards. The former is used for publishing content on the web; the latter is used for semantically rich replies and interactions (i.e. comments, likes, bookmarks, RSVPs). As of June 2022, one of the primary publishers of Webmention had sent 2.3 million Webmentions from pages with microformats markup, sent to thousands of websites.

I am in agreement re: RDF and Microdata.

I co-authored the MDN microformats page, which has a brief intro to use cases of microformats. We have "why" pages for using various different microformats on the IndieWeb wiki, too!

When developing standards, it is important to start with a use case. What elements precisely would you use? Can these things be done in microformats, which already has a rich, well-tested, vocabulary with parsers across different languages? Do you have at least a couple of implementors who would be interested in both consuming and publishing the elements? I wrote a blog post a few months ago on use-case driven standards development that explains my philosophy on this.

snarfed commented 9 months ago

Thanks @capjamesg, agreed!

Also, very minor nit re:

JSON-LD is used in ActivityPub for communication, too.

It's a bit weaker than that. AS2 (AP activities) are JSON-LD compatible, but don't have to actually implement JSON-LD. AP/AS2 implementors can happily ignore Linked Data, @context, compaction, etc and still be fully compliant. More background: https://www.w3.org/TR/activitystreams-core/#jsonld , https://www.w3.org/TR/activitypub/#x1-overview

(Apologies, I know this seems like nitpicking, and it kind of is. I only mention it because many people feel strongly on two opposite sides - 1) AP/AS2 is/should be fully JSON-LD, vs 2) JSON-LD is strictly optional in AP/AS2, and shouldn't be any more than that - so it sometimes seems worth clarifying.)

brennanyoung commented 9 months ago

I'd like to propose a mechanism whereby an element provides secondary/extra information or details to another element.

The idea is to have an attribute similar to the for attribute for form control labels, except that this would not be the label (the "accessible name"), it would behave more like aria-describedby or aria-details.

Benefits are to bring HTML to parity with some of the relationship attributes found in ARIA, so that those relationships become machine readable, and consumed through a variety of different kinds of presentation (i.e. a wide range of assistive technologies). There are obvious benefits for SEO engineering also.

So, to generalize, Element A adds context, meaning or richness to element B. One of them points at the other via an attribute. Neither of these need be an operable/interactive element

I am not sure whether it's better to follow the example of for, where the label points at the control, or aria-describedby / aria-details. where the element points at the id(s) of the complementary content. I think that one-to-many relationships should be supported, so that (e.g.) the elements of a legend may be used multiple times by (e.g) the points on a map.

It's possible that new tokens for the rel attribute might be part of a solution, but there seems to be a reluctance to expand the tokens used with rel, and some really useful semantic ones have been deprecated. (Does anyone know why?)

The way to do this accessibly today is to "hack" the accessible name, e.g. by prepending or appending strings to it. It works, but it places a burden on the imperative code (javascript) which could be better -and more consistently- handled by declarative code (html).

Example use cases

snarfed commented 9 months ago

I'd like to propose a mechanism whereby an element provides secondary/extra information or details to another element.

Naive question: isn't this <summary> + <details>? Browsers hide the content outside <summary> by default, but I expect that's modifiable with CSS.

brennanyoung commented 9 months ago

Do you think <summary> + <details> would be a suitable way of associating (say) a datapoint with an element from a legend?

The main problem I see with that is that <details> is supposed to be a direct descendent of <summary>.

I am looking for a way to associate one element with another regardless of their position in the DOM tree.

Also the visual presentation of <details> in the hidden state is currently not modifiable with CSS.