whatwg / dom

DOM Standard
https://dom.spec.whatwg.org/
Other
1.56k stars 289 forks source link

Consider new method: getNodesByType #37

Closed prettydiff closed 9 years ago

prettydiff commented 9 years ago

Currently it is easy to get element node types from any given element in a document by getElementsByTagName, but other node types are less straight forward to access. This has many design implications.

Example 1 - comments

Comments exist so that code authors can provide commentary to the code in a way that is immediately available in the source code, but not parsed or processed for user consumption. Although comments are easy for humans to read they are hard to access dynamically, and so are largely worthless for storing any sort of data.

Example 2 - attributes

Attributes are easy to access from any given node, but are challenging to access without touch their respective node directly. This is not particularly helpful. It may be more important to know about the status of one or more attributes by name or their values without care for the element on which they reside. For instance if I wanted to find a data attribute dynamically from a large section of a document there would be no simple means to accomplish this.

Example 3 - text nodes

The most convenient way to access to text content is to ignore the DOM and instead use the less safe innerHTML property. Even still the innerHTML property does not return merely the text node value, but absolutely everything contained by the element.

getNodesByType

All of these issues can be solved by a getNodesByType method. Such a method should provide the same level of availability as the getElementsByTagName method in that:

A functional example written in JavaScript can be found at https://github.com/prettydiff/getNodesByType/blob/master/getNodesByType.js. The code example automatically applies the proposed method to all element nodes present in the document, but not to element nodes created with the createElement method.

In order to understand the value of this recommendation I strongly suggest playing with the linked code example on any given large markup document. It is liberating to have this level of immediate access without so much need for walking the DOM tree and without reliance upon a framework.

tabatkins commented 9 years ago
  1. What's your use-case for getting comments from the DOM more easily?
  2. Your example use-case for attributes (getting an attribute of a given name regardless of what element it lives on) is easily doable with existing APIs, such as document.querySelector("[my-attr]").getAttribute("my-attr"). It's a bit of repetition, but this is also a pretty rare case, so that's usually okay.

    As well, the fact that attributes are nodes is a legacy mistake that we want to forget as much as possible. We already tried to kill the whole thing (and had to back it out due to site breakage), but probably don't want to add new things that treat attributes as nodes.

  3. If you want all the text in an element, the best way is to use el.textContent. Do you have any use-case for getting all the actual text nodes themselves?
prettydiff commented 9 years ago

It's a bit of repetition, but this is also a pretty rare case, so that's usually okay.

At this time all the examples I presented are extremely rare use cases. I would say they are all edge cases merely because the means and availability do not exist natively or conveniently. This is like saying cell phones should would never become popular because at one point they were rare and expensive like every other emerging technology.

What's your use-case for getting comments from the DOM more easily?

Comments can store any format and size of text data, so therefore it could store JSON formatted data for any use case. It could also conveniently store dynamically constructed configuration data from a page server for use in third party services, which benefits analytics, testing, and evaluation services.

Your example use-case for attributes

One example is immediately gathering all id attributes from a page for discerning possible duplicate values. Duplicate id attribute values represents functional limitations and accessibility barriers in HTML, particularly in the presence of forms and in page navigation.

Another example is to immediately gather all src attributes to determine if there are errant requests, duplicates, various domains, request quantity, and so forth.

As well, the fact that attributes are nodes is a legacy mistake that we want to forget as much as possible.

Why? It seems perfectly valid for attributes to be DOM nodes as children of elements in much the same way that text are DOM nodes as children of elements. What makes attributes less worthy of node consideration than textual content? I would argue they are perhaps more worthy of node consideration in that they store parse-able data that describes both the immediate element and possibly children/descendents in that element.

More important still is that attribute names are subject to namespace inheritance apart from the containing element. Therefore attribute name definitions are subject to lexical scope in exactly the same manner as element names and independently so.

If you want all the text in an element, the best way is to use el.textContent

You are still limited to a single node's text in this manner. Although it is possible to get broader text nodes by walking up the DOM tree the means of access is still limited and less immediate. Limits upon means of access produces second and third order consequences into software design, readability, and stylistic considerations that makes code easier to share and understand.

tabatkins commented 9 years ago

I would say they are all edge cases merely because the means and availability do not exist natively or conveniently.

This is a very difficult argument to make convincingly; there are tons of things that one could imagine would be more popular if they were easier to do, but that doesn't mean we should make them all easier. We have a finite amount of coding effort to spend, and large APIs are harder to learn and use. One has to have a pretty remarkably strong case, generally showing that people are working around the lack in some hacky ways, to justify adding this kind of surface.

That said, this isn't one of these cases. It's not hard to iterate over nodes; people used to do it all the time. These days people almost never do it, because querySelector() solves nearly all of the use-cases. That suggests that (a) we can tell if people need some particular type of node-querying functionality, because they'll be doing a lot of manual iteration if so, and (b) because there isn't much of that going on, it must not be that common, and thus probably not worth adding new API surface.

Comments can store any format and size of text data, so therefore it could store JSON formatted data for any use case.

This is already doable in any number of ways: putting it in a <script type="something/custom"> and parsing the textContent out later; putting it in a normal <script> and just directly assigning the JSON blob to a variable; putting the JSON blob in an attribute on some element; etc. It doesn't immediately appear that we have much need for an additional method of achieving this, particularly if we have to add more API surface to make it convenient.

One example is immediately gathering all id attributes from a page for discerning possible duplicate values.

This and your src example are both trivially doable today with querySelector(), as I indicated above:

var ids = [].slice.call(document.querySelectorAll("[id]"))
    .map(function(el){return [el.getAttribute("id"), el];});

Why? It seems perfectly valid for attributes to be DOM nodes as children of elements in much the same way that text are DOM nodes as children of elements. What makes attributes less worthy of node consideration than textual content?

Their order doesn't matter (they're explicitly unordered, in fact), they can't contain further structure (just text), they don't have any intrinsic relationship between each other (unlike text/elements which are siblings), they don't live in an element's .childNodes list (comments/text/elements do). In practice they're just a String=>String dict hanging off an element.

The only reason they're Nodes at all is a mistaken attempt by early XML/SGML folk to make the distinction between an attribute and a child element containing text less apparent, as many vocabularies make a more-or-less arbitrary choice between the two for simple data. This blurring makes no sense in HTML, where attributes and content are mostly strongly delineated.

More important still is that attribute names are subject to namespace inheritance apart from the containing element. Therefore attribute name definitions are subject to lexical scope in exactly the same manner as element names and independently so.

XML Namespaces are a terrible solution to the composing-languages problem to begin with, and namespaced attributes are a hack compounded on top of that. This does not reflect well on them.

You are still limited to a single node's text in this manner.

Yes? Your example was complaining about having to use .innerHTML to get at the text; .textContent is the correct way to do the same thing.

(And note, .textContent gives you the text of this element and its descendants, joined in tree order.)

Although it is possible to get broader text nodes by walking up the DOM tree the means of access is still limited and less immediate. Limits upon means of access produces second and third order consequences into software design, readability, and stylistic considerations that makes code easier to share and understand.

If you want more text nodes, you just walk the DOM. As I said at the beginning of this comment, DOM-walking used to be very common; when we introduced querySelector() it dropped off significantly. If there was still a lot of interest in getting text nodes specifically (or any other type) it would show up in current DOM-walking usage, but it doesn't.

prettydiff commented 9 years ago

This is a very difficult argument to make convincingly; there are tons of things that one could imagine would be more popular if they were easier to do, but that doesn't mean we should make them all easier.

The DOM is an API. It is actually more than that in that it is a robust data model built upon the definitions of W3C Schema and various access requirements in the Level 3 and Level 4 specifications. Without a doubt the most important component of the DOM is its API, though. Like any API there are only 3 things that matter: convenience, performance, and integrity. I would say enhancements that speak to the spirit and nature of the technology and sufficiently achieve all of the primary design targets are sufficiently worthy of evaluation based purely upon the technical merits of the technology proposed.

I was recently reading that the patent for the telephone was offered for sell to Western Union. They were not interested and considered the technology a mere electronic toy with no potential. A fledgling noname company called AT&T raised capital to purchase the telephone patent. The problem is that they did not evaluate the technology on its technical merits want the potential for disruption would allow. Instead, they were convinced a defensible position of their standing technology presented a sufficient cause to evaluate the future of their business according to the user market at present.

It's not hard to iterate over nodes; people used to do it all the time. These days people almost never do it, because querySelector() solves nearly all of the use-cases.

querySelector is sloooow. It requires a separate parsing scheme to translate a string into acceptable DOM arguments that are then accessed in the way the DOM is normally walked using the standard methods. An access means that more closely resembles the standard DOM methods offers superior potential for optimization.

To say that no technology should be considered if alternate to the popular querySelector method is a halting conservative position for a group tasked with maintaining and improving upon an important technology. There are access means that querySelector does not solve for, and still require walking the tree.

This is already doable in any number of ways putting it in a <script type="something/custom"> and parsing the textContent out later; putting it in a normal <script> and just directly assigning the JSON blob to a variable;

There are many ways to solve this problem without a new method, of which all are less convenient and some of which are less secure.

This and your src example are both trivially doable today with querySelector(), as I indicated above:

The code you provided is hardly trivial though.

querySelector became popular because of convenience. People could not be bothered to learn to walk the DOM. The method I propose is also a convenience method. It just happens to be more convenient than querySelector and more inline with the standard DOM conventions.

Their order doesn't matter (they're explicitly unordered, in fact), they can't contain further structure (just text)

Most node types do not contain further structure, so this is hardly a distinguishing characteristic. In this case though, it is wrong. Attributes appear to only be a name value pair and are accessed as such, however attributes are described and constricted by property definitions. Therefore attributes are data structures unto themselves, even if more primitive than element nodes.

http://www.w3.org/TR/2004/PER-xmlschema-1-20040318/#cAttribute_Declarations

If we were only talking about HTML and ignored both forms and accessibility generally then I would say you are less wrong on this point. The DOM has wider use cases than either HTML or XML.

The order of attributes is only irrelevant to their immediate peers on the same containing element, but their order is otherwise very relevant. Attributes immediately describe the element on which they reside and then describe descendent nodes unless challenged in the lexical scope model as walk down the scope chain. As such the order of attributes, among other attributes not on the same element, is very relevant.

The only reason they're Nodes at all is a mistaken attempt by early XML/SGML folk to make the distinction between an attribute and a child element containing text less apparent

That is not accurate. The W3C DOM Level 2 specification is based immediately upon types in the XML Schema specification and differs only as necessary to absorb certain qualities from DOM Level 1 and provide some additional backwards compatibility. Years ago I used to spend a lot of time in the W3C Schema mailing list when I was writing my own markup language. XML Schema and W3C DOM fall under the same architectural committees and used to intentionally publish spec updates in pretty close proximity.

Attribute nodes are an intentional type because of how attributes may be defined by schema or by namespace. To eliminate attributes as a type means that attributes cannot be defined apart from the elements on which they reside. This invalidates design considerations of these technologies and interferes with the scope model and document extensibility.

XML Namespaces are a terrible solution to the composing-languages problem to begin with, and namespaced attributes are a hack compounded on top of that.

I will neither agree nor disagree with your opinion. I will only say that the technology is already defined and supported by the DOM. Before the example becomes invalid this technology must sacrifice some backwards compatibility.

If you want more text nodes, you just walk the DOM.

You certainly could, but its less convenient. If not for convenience why would anybody use the querySelector method?

tabatkins commented 9 years ago

To say that no technology should be considered if alternate to the popular querySelector method is a halting conservative position for a group tasked with maintaining and improving upon an important technology. There are access means that querySelector does not solve for, and still require walking the tree.

You continue to hyperbolize my position to the point of ridiculousness. This is not a useful argument technique.

What I actually said, repeatedly, was that we can't add API for every possible convenience someone can come up with; we have to prioritize for those usage patterns that are common. querySelector() was added because tree-walking to find element nodes that matched Selector-compatible patterns was extremely common, and fairly annoying to do. Adding API to make it simpler was a big usability win.

Since the introduction of querySelector(), tree-walking has become much more rare. That suggests that the other tree-walking use-cases that querySelector() can't solve (looking for comments or text nodes) were only ever a small percentage of all the tree-walking cases, and so it's much less compelling to try and add more API surface for them.

The code you provided is hardly trivial though.

It really is. It's a querySelector() and a forEach() call, both staples of JS programming. (If you're not comfortable with basic functional programming, the forEach() can be translated into a trivial for loop.) The [].slice.call() is a common and well-known hack to translate a NodeList into an Array.

Attributes appear to only be a name value pair and are accessed as such, however attributes are described and constricted by property definitions.

Not in browsers. XMLSchema is not implemented by browsers, and is thus irrelevant for this discussion.

If we were only talking about HTML and ignored both forms and accessibility generally then I would say you are less wrong on this point. The DOM has wider use cases than either HTML or XML.

In practice, it does not. The vast, vast majority of DOM usage in browsers is over HTML or its related languages (SVG and MathML); anything else is a rounding error.

annevk commented 9 years ago

Any JavaScript libraries not written by OP that implement this?