whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
7.86k stars 2.58k forks source link

Proposed Element : Transcript #9829

Open brennanyoung opened 9 months ago

brennanyoung commented 9 months ago

What problem are you trying to solve?

WCAG calls for transcripts as a text alternative for audio-only speech, and for spoken video soundtracks in some cases.

HTML does not have a native <transcript> element. Making a well-engineered transcript is not trivial. A babel of half-baked solutions are to be found in the world, but implementations differ greatly from site to site.

This is a missed opportunity for a content type which is extremely common across the web, leading to a degraded/uneven experience of transcripts for assistive tech users.

I am calling for a baseline transcript solution to be offered via declarative markup. Assistive technologies would thus be able to present this content with predictable affordances, regardless of where (on the web) it is found, or how it may be visually presented.

It is the inconsistency of implementation (especially regarding AT experience) that I would hope to resolve by introducing this element.

What solutions exist today?

A babel of half-baked solutions are to be found on the www, along with a small handful of well-engineered examples (such as the one offered by ableplayer) but implementations, and the expected patterns of "consumption" differ greatly from site to site.

It's not obvious what the best semantic markup for a transcript should be today, and yet a transcript has a relatively consistent format and a clear and distinct semantic role.-

A possible choice today might be a <ol>, since a transcript is indeed an ordered collection. However the typical affordances for lists offered by ATs (such as announcement of the item index and the total items) are of questionable value for a transcript.

A closer fit might be <dl> with the timestamp as the term and the utterances as the data. I could be convinced that this is the way forward (perhaps with some special attributes), except that when I look at most transcript implementations "in the wild" they do not use these elements. Support for <dl> in ATs is improving, but not great, with some accessibility experts recommending against it.

If the existing "semantic pool" in HTML does not offer a good fit for a transcript, the temptation to reach for non-semantic divs and spans in "home cooked" solutions instead will be high. Examples of transcripts using non-semantic HTML are easily found. The differing implementations offer no way to leverage user experience and expectations. An explicit transcript element type would usefully constrain and simplify the way transcripts are authored for the benefit of all users.

How would you solve it?

Transcript semantics require some sort of outer wrapper, for example an element perhaps called <transcript>

An optional attribute might link that transcript to a time-based media element elsewhere in the DOM, perhaps reusing the for attribute. -or- time-based media elements themselves might indicate the id of the transcript, in a way rather similar to aria-details. Either would be acceptable. The direction of indication is less important than establishing a standard way of associating the two elements.

It may be preferable to support (but not require) an association to be made as a simple descendent relationship (e.g. a transcript appearing inside the subtree of an <audio> element may be understood as the transcript for that audio - and in the case of multilanguage video soundtracks, the different transcripts could have a language attribute etc.).

The DOM subtree of a transcript would consist of timestamps (using the existing <time> element), and utterances, which may or may not deserve their own element.

The structure of each cue could be similar to the implicit "pairing" of <DT> and <DD> which may be expressed in <DL>, although I think it may be better if each cue is explicitly wrapped so that there is no doubt which time belongs with each utterance, for example:

<cue>
<time>00:00</time>
<utterance>lorem ipsum</utterance>
</cue>

SImple CSS selectors and rules can be imagined to style, hide or show the timestamps.

It is a reasonable expectation that visible timestamps behave like hyperlinks, which will jump to exactly that moment of the associated media. Again, it would be ideal if user agents could construct this UI styling and behavior by default from declarative code.

Anything else?

Note: Some users prefer timestamps to be presented, others prefer them to be suppresed. A simple baked-in toggle attribute to handle this would go a very long way. (Time stamp announcements from a screen reader get old very quickly).

nigelmegitt commented 9 months ago

On the assumption that you want to reference an external transcript resource, I would suggest looking at DAPT for the payload format choice.

prlbr commented 7 months ago

<audio> and <video> elements can have nested <track> elements that reference different kinds of subtitles/transcriptions/descriptions in the WebVTT format.

Can this be leveraged or do the transcriptions have to be encoded in native HTML?

brennanyoung commented 7 months ago

Leveraging <track> elements seems like an obvious choice as a data source for a user-agent default transcript view. If we can avoid the requirement to encode the transcript directly in HTML, that would be great.

The content of time-based text tracks is not included in the accessibility tree. Typically only the "current cue" will be surfaced, and that is (currently) the job of the content author, so it simply may not happen in many cases.

Interesting that the HTML5 spec explicitly mentioned transcript under both captions and subtitles, although I would like to note that WCAG treats transcripts and captions as separate kinds of content. The data is often near-identical, but the presentation, and intended pattern of consumption differs greatly.

So... among any other goals, we may be looking at a predictable baseline way for .vtt and .srt files to be converted into rich text (presumably some kind of DOM subtree), with the intention that it may be consumed independent of playback state of the audio.

brennanyoung commented 7 months ago

Regex for VTT cues ^(\d{2}:\d{2}:\d{2}[.,]\d{3})\s-->\s(\d{2}:\d{2}:\d{2}[.,]\d{3})\n(.*(?:\r?\n(?!\r?\n).*)*)

Example Replace pattern <cue><a href=#><time>$1</time></a><utterance>$3</utterance></cue>

Malvoz commented 7 months ago

FYI this was also brought up in https://github.com/whatwg/html/issues/7499 / https://github.com/WICG/proposals/issues/45, I did not intend to shut down that discussion by any means, by posting the following:

Apparently, there already exists a proposal for <transcript>:

/cc @chaals

Other useful resources:

cc @accessabilly