whatwg / dom

DOM Standard
https://dom.spec.whatwg.org/
Other
1.57k stars 290 forks source link

Suggestion: new method - getNodesByType #992

Open prettydiff opened 3 years ago

prettydiff commented 3 years ago

getNodesByType

This is a suggestion for a new method to return a node list of descendant nodes of a matching node type.

Purpose

More directly access any part of a document. I have used this method in my own code for many years and it has allowed me to perform actions quickly with minimal code.

Examples

Background

In the DOM specification nodes are of a type. The standard types are (including deprecated legacy types):

  1. 1 - ELEMENT_NODE
  2. 2 - ATTRIBUTE_NODE
  3. 3 - TEXT_NODE
  4. 4 - CDATA_SECTION_NODE
  5. 5 - ENTITY_REFERENCE_NODE
  6. 6 - ENTITY_NODE
  7. 7 - PROCESSING_INSTRUCTION_NODE
  8. 8 - COMMENT_NODE
  9. 9 - DOCUMENT_NODE
  10. 10 - DOCUMENT_TYPE_NODE
  11. 11 - DOCUMENT_FRAGMENT_NODE
  12. 12 - NOTATION_NODE

The numbers were intentionally specified because that is the value returned by accessing the nodeType property of a node object, which returns a number corresponding to the definitions listed above.

The below suggestion criteria will also mention a value for 0: 0 - ALL where in a 0 value returns a node list of all descendant nodes.

Availability

This method should be available on the document object and elements.

Input

A node type value. This value can be of two types:

Unsupported Input

If input is provided that does not conform the accepted criteria the default value of 0 will be applied.

Output

A node list

Working Demonstration

Copy and paste the following code into your browser console on any page to explore the potential.

// typeValue argument must be either
// * number 0-12
// * string - named node type constant, but this demonstration code allows case insensitive input and without the "_NODE" tail
//
// example
// document.getNodesByType(8); // returns all comment nodes in the document
// document.getNodesByType("COMMENT_NODE"); // returns all comment nodes in the document
// document.getNodesByType("COMMENT"); // returns all comment nodes in the document
// document.getNodesByType("comment_node"); // returns all comment nodes in the document
// document.getNodesByType("comment"); // returns all comment nodes in the document
// document.getNodesByType(0); // returns all nodes in the document
// document.getNodesByType("ALL"); // returns all nodes in the document
// document.getNodesByType("aLl"); // returns all nodes in the document
const getNodesByType = function (typeValue) {
    const valueString = (typeof typeValue === "string") ? typeValue.toLowerCase().replace(/_node$/, "") : "",
    root = (this === document) ? document.documentElement : this, numb = (isNaN(Number(typeValue)) === false)
        ? Number(typeValue)
        : 0;
    let types = (numb > 12 || numb < 0)
        ? 0
        : Math.round(numb);
    // If input is a string and supported standard value
    // associate to the standard numeric type
    if (valueString === "all" || typeValue === "") {
        types = 0;
    } else if (valueString === "element") {
        types = 1;
    } else if (valueString === "attribute") {
        types = 2;
    } else if (valueString === "text") {
        types = 3;
    } else if (valueString === "cdata_section") {
        types = 4;
    } else if (valueString === "entity_reference") {
        types = 5;
    } else if (valueString === "entity") {
        types = 6;
    } else if (valueString === "processing_instruction") {
        types = 7;
    } else if (valueString === "comment") {
        types = 8;
    } else if (valueString === "document") {
        types = 9;
    } else if (valueString === "document_type") {
        types = 10;
    } else if (valueString === "document_fragment") {
        types = 11;
    } else if (valueString === "notation") {
        types = 12;
    }
    // A handy dandy function to trap all the DOM walking
    {
        const output = [],
            child = function browser_dom_getNodesByType_walking_child(x) {
                const children = x.childNodes;
                let a = x.attributes, b = a.length, c = 0;
                // Special functionality for attribute types.
                if (b > 0 && (types === 2 || types === 0)) {
                    do {
                        output.push(a[c]);
                        c = c + 1;
                    } while (c < b);
                }
                b = children.length;
                c = 0;
                if (b > 0) {
                    do {
                        if (children[c].nodeType === types || types === 0) {
                            output.push(children[c]);
                        }
                        if (children[c].nodeType === 1) {
                            //recursion magic
                            browser_dom_getNodesByType_walking_child(children[c]);
                        }
                        c = c + 1;
                    } while (c < b);
                }
            };
        child(root);
        return output;
    }
};
document.getNodesByType = getNodesByType;
Element.prototype.getNodesByType = getNodesByType;

Then call the method on the document object or any element in the page. Example:

document.getNodesByType(2); // returns all attribute nodes in the page
document.getNodesByType(8); // returns all comment nodes in the page

EDIT

liamquin commented 3 years ago

I'd far rather use document.evaluate("//comment()") (i.e. XPath) for this case, as it's much more readable even for people who don't like XPath or are allergic to the letter x :-)

Making it easier to call XPath would be a good thing for cases like this, too; in document.evaluate( '//comment()', document, null, XPath.ANY_TYPE, null ); only the first argument should really be needed.

Any API in which the integer literal 8 is passed to refer to the type of something has to be questioned, XHTTPRequest notwithstanding :)

WebReflection commented 3 years ago

As much as I love the XPath solution, the TreeWalker already has this kind of filtering, right?

prettydiff commented 3 years ago

@liamquin I completely agree. I am falling in on conventions already in place by the standard and node types are returned as numbers 1-12. This example code does accept a string of the node type name if you would rather execute something like document.getNodesByType("COMMENT_NODE").

EDIT IRC pointed me to the node type constants: https://developer.mozilla.org/en-US/docs/Web/API/Node/nodeType

The strings allowed by the demonstration code are equivalent to calling

The implementation provided in the demonstration code is case insensitive so lowercase will work too:

I could make the demonstration code even more convenient by assuming the _NODE trailing portion of the string argument so that "comment" is accepted as well as "COMMENT_NODE".

liamquin commented 3 years ago

using constants is massively better than "8" :) and also better than strings (because a typo will always fail). But does this really add so much compared to existing methods?

prettydiff commented 3 years ago

@liamquin I have been using this approach for about 9 years, and it absolutely opens up new possibilities developers never would have considered otherwise, because the existing methods forcefully suggest the only way to navigate a document is via query of elements.

Most of the benefits provided by this are second and third order consequences to tooling. For example, this code has allowed me to write a getElementsByText method where I can get all text nodes and then filter that list matching against a string fragment and return a new list of their parent nodes.

liamquin commented 3 years ago

I do not dispute the benefits of being able to write, tbody/tr/th[contains(., "Esterhazy")]/following-sibling::th[2] - i do this on a daily basis. I didn't need to write a new method to do it, either. And it's standards-based, yay.

prettydiff commented 3 years ago

@liamquin The goal of this method is to attain a list of nodes of the request type from descendant nodes irrespective of additional specificity. An XPath instance requires a working knowledge (or discovery thereof) of a tree instance and a non-element type can only be accessed via an element, just like all other DOM methods and properties. That is very different than the goals of the method presented here.

WebReflection commented 3 years ago

Again, the TreeWalker already offers this possibility and more, combining multiple types at once.

The whole proposal could be basically just:

const getNodesByType = function* (typeValue) {
  const type = typeof typeValue === 'number' ? typeValue : NodeFilter[
    (
      /^SHOW_/i.test(typeValue) ? typeValue : ('SHOW_' + typeValue)
    ).toUpperCase()
  ];

  if (!type)
    throw new TypeError('unexpected ' + typeValue);

  const tw = document.createTreeWalker(this, type);

  let currentNode;
  while (currentNode = tw.nextNode())
    yield currentNode;
};

Example

for (const node of getNodesByType.call(document, 'text'))
  console.log(node);

Using numbers / multiple kinds:

const show = NodeFilter.SHOW_TEXT | NodeFilter.SHOW_COMMENT;
for (const node of getNodesByType.call(document, show))
  console.log(node);

Differences

Why is TreeWalker being ignored in this discussion?

WebReflection commented 3 years ago

P.S. the NodeFilter.SHOW_ALL value is -1, not 0 ... please, let's reuse the already defined platform constants.

prettydiff commented 3 years ago

@WebReflection I suspect TreeWalker is ignored because it is incredibly esoteric. It is a stand alone utility available as a part of the browsers' WebAPIs with its own internal filtration API and internal methods.

Furthermore, you can eventually come to the same conclusion using a custom function wrapping TreeWalker as the internal workings of the demonstration logic, which walks element nodes. Neither are the same as a single method name requiring a single argument that does everything for you without any custom logic.

-1 numerical is the SHOW_ALL for TreeWalker. The demonstration logic was only based upon nodeType values and their string constant equivalents which does not provide any value for 0 or -1.

WebReflection commented 3 years ago

How is a native DOM API esoteric, if I might ask?

regarding -1 vs 0, is that there are polyfillys on the Web, and usually these stick with standard documentation. Why diverging here, when a ALL search has been previously defined, as -1?

WebReflection commented 3 years ago

Neither are the same as a single method name requiring a single argument that does everything for you without any custom logic

how is my 20 LOC polyfill inferior to the initial proposal, since we have already a TreeWalker in the specs?

prettydiff commented 3 years ago

@WebReflection That is still 20 lines versus a single method taking a single argument as input.

WebReflection commented 3 years ago

The method takes a single argument as input, is a borrowed method, you can attach it to a prototype and be done with it?

All I’m saying is that this proposal is covered, in a better, more feature rich way, by the TreeWalker.

rniwa commented 3 years ago

There is also NodeIterator.

function* getNodesByType(root, nodeType) {
    const nodeIterator = document.createNodeIterator(root, NodeFilter.SHOW_ALL, (node) => node.nodeType == nodeType);
    if (nodeIterator.referenceNode && nodeIterator.referenceNode.nodeType == nodeType)
        yield nodeIterator.referenceNode;
    while (nodeIterator.nextNode())
        yield nodeIterator.referenceNode;
}

I don't think we want to invent yet another way of iterating over nodes of a particular criteria at this point unless it provides a significant improvement / value over existing methods.

prettydiff commented 3 years ago

@rniwa That misses the point for the same reason as TreeWalker. I am proposing a single method that takes a single argument. There is no internal glue required. How that method executes internally whether using TreeWalker, nodeIterator, or simply walking the DOM (as in the original demonstration logic) is only a performance factor outside the intention of this proposal. If either nodeIterator or TreeWalker required no additional logic to get the desired results I wouldn't propose a more simple approach. Since they do require extensive boilerplate, on each use, XPath as @liamquin pointed out remains a far superior approach.

Consider it from the perspective of querySelectors. They are likewise just as unnecessary and are thousands of times slower to execute. From all technical considerations they are a horrible approach when there were already numerous other superior conventions in place to achieve the same output. Why then were they so highly requested? They are a single method that requires a single argument. Unlike querySelectors the proposal here offers no loss of performance and may result in performance improvements at implementation.

WebReflection commented 3 years ago

they do require extensive boilerplate

is that true though?

function* getNodesByType(node, nodeType) {
    const iterator = document.createTreeWalker(node, nodeType);
    while (node = iterator.nextNode())
      yield node;
}

You can use it like this:

for (const node of getNodesByType(document, NodeFilter.SHOW_TEXT))
  console.log(node);

You can pass multiple filters too: NodeFilter.SHOW_TEXT | NodeFilter.SHOW_ELEMENT

The implementation in core will be pretty much the same of an iterator or a tree walker, so why is this method really needed?

prettydiff commented 3 years ago

@WebReflection Yes, I still see boilerplate. And your example still requires a working knowledge of TreeWalker's internal API when your use case demonstrates use of NodeFilter.SHOW_TEXT.

I did a global search of GitHub for createTreeWalker and found only 26 JavaScript results and only 145 TypeScript results. There were nearly 5000 results for an unrelated Java API of the same name indicating that nobody is using this. Either TreeWalker is not well known or the API is too burdensome for most developers. When I did a search for querySelector I found just under 9 million JavaScript results and almost 25 million JavaScript results for getElementById.

There is a substantial value to using a single method that requires a single argument and absolutely no boilerplate.

WebReflection commented 3 years ago

@prettydiff I am not sure I understand your "solution" though ...

your example still requires a working knowledge of TreeWalker's internal API

make it a module?

your use case demonstrates use of NodeFilter.SHOW_TEXT

which is part of the standard as Element.TEXT_NODE is?

nobody is using this

so your proposal is to create "yet another method nobody will know about" specially in these days where very few read standard documentation?

Either TreeWalker is not well known or the API is too burdensome for most developers

and a new method to learn that nobody would know until it's present at least in MDN is a solution?

There is a substantial value to using a single method that requires a single argument and absolutely no boilerplate.

We started with a convoluted boilerplate that was unnecessary already, but here a couple of things I believe you are not considering:

So, when the boilerplate is literally 3 lines of code, how is this proposal worth it, when it'll just create yet another method nobody used or needed to date, and those that did, rightly solved through a TreeWalker?

prettydiff commented 3 years ago

So, when the boilerplate is literally 3 lines of code, how is this proposal worth it, when it'll just create yet another method nobody used or needed to date, and those that did, rightly solved through a TreeWalker?

I can remember 2008. Then there was no method getElementsByClassName. Chrome was not released until December of that year and support was added to FireFox at version 3 (June 17 2008) and IE9 (March 14 2011). Before this the method did not commonly exist. jQuery did not add the Sizzle engine (query selectors) until version 1.3 (January 14 2009) and was not popular until it did.

Before this classes were almost exclusively used for CSS. Nobody accessed the DOM by class names, because it was too cumbersome and not worth the effort. Nonetheless it was widely adopted and is widely used. To solve for this I would have written code almost identical to the original demonstration logic above, but that doesn't mean anybody would use it. For me writing such logic is trivial, but this may not be true for other developers.

When I search GitHub globally for getNodesByType there are more than 6400 JavaScript results and many of the first results look like supplemental means of accessing the DOM by various libraries and utilities in common use. This demonstrates some level of desire for such a feature. Since the feature is not a standard this convention may likely also exists under countless other names as well.

Yes it would mean new documentation, implementation guidance, conformance errata, and test cases. External documentation like MDN and CanIUse would need to be updated as well. I suspect the greatest level of effort would be in testing for integration with various other related specifications such as SVG, WebGL, WASM, and even various XML technologies like XSLT and XSL-FO. I suspect the risk from something like this would be very low because it is supplemental and based upon existing conventions. I am flexible about any implementation, conventions, names, formalities, or approaches. My only goal is to request a simplified approach that is a single method requiring a single argument under any name regardless of any additional specifics.

WebReflection commented 3 years ago

Live collections are discouraged these days though, so getElementsByClassName is a bad example, as querySelectorAll is what modern code should use, but GitHub doesn’t tell you how deprecated are those results, yet it stores the history of the web. You can search for jQuery or Prototype too there, it doesn’t reflect reality.

That being said, why don’t you publish a module that implements what you’re after and demonstrate afterwards it’s totally needed on the DOM API too? With all the new APIs out there and issues, something covered by more than one lower level API in 3 LOC, doesn’t look important or worth time to me, but feel free to prove me wrong on that, and make a module everyone uses ‘cause those 3 LOC are a real issue.

looking forward to be proven wrong, although that’s another way to convince vendors, from time to time 👍

liamquin commented 3 years ago

@liamquin The goal of this method is to attain a list of nodes of the request type from descendant nodes irrespective of additional specificity. An XPath instance requires a working knowledge (or discovery thereof) of a tree instance and a non-element type can only be accessed via an element, just like all other DOM methods and properties. That is very different than the goals of the method presented here.

Utter total bullshit and nonsense. The example I gave earlier, tbody/tr/th[contains(., "Esterhazy")]/following-sibling::th[2], finds a th element containing the text string Esterhazy and returns the next the element, if there is one, in the same tr. You could write just, e.g. //text()[contains(., "Esterhazy")] to find a matching text node, or from any given node you can write .//text() to get only descendants at any level. No knowledge of tree required, and not the case that a non-element type can only be accessed via an element, either.