whatwg / html

HTML Standard
https://html.spec.whatwg.org/multipage/
Other
8.16k stars 2.69k forks source link

Proposal: Implement new method for parsing markup from `Fetch` responses #10076

Closed brandonmcconnell closed 10 months ago

brandonmcconnell commented 10 months ago

Spec proposal

collapsed by default to avoid cluttering the required fields below
# Introduction This proposal introduces a new method, `.markup()`, to be included in the Fetch API response parsing interface. This method offers a streamlined interface for parsing various markup data formats such as HTML and XML. ## Rationale ### Consistency and Flexibility Just as `.text()` and `.json()` methods offer simplified handling of text and JSON data, `.markup()` extends this ease of use to markup languages. By providing a single method with optional configurations, developers can handle XML and HTML data more flexibly within the same framework. ### Performance and Optimization Integrating this method into the language standard allows for engine-level optimizations, potentially outperforming custom parsing solutions and improving overall performance. ### Enhanced Readability and Maintenance A unified method simplifies codebases, enhancing readability and ease of maintenance. This aligns with modern JavaScript's goal of concise and powerful syntax. ## Technical Specification ### `.markup()` Method - **Purpose:** Parses the response body as XML or HTML based on specified configurations. - **Usage:** `response.markup(options)`. - `options`: An optional argument specifying parsing preferences - `type`: `"text/html" | "text/xml"` (enforces self-closing tags, etc. for HTML) - **Return Type:** A promise that resolves with the result of parsing the response body text as specified. ### Implementation Notes - Should follow the structural design of `.text()` and `.json()`. - Includes error handling for malformed content, with robustness akin to `.json()`. - Security considerations are paramount, especially for HTML content, to prevent injection attacks. - Should be capable of handling self-closing tags in HTML when specified in options. ## Use Cases - **XML Feeds:** Facilitates the consumption of XML feeds, such as RSS or Atom. - **Client-Side Templating:** Simplifies integration of HTML templates fetched from a server. - **Web Scraping:** Aids in efficient parsing of HTML for data extraction. ## Potential Challenges - **Security Concerns:** Ensuring safe parsing, particularly for HTML, to prevent XSS attacks. - **Browser Support and Polyfills:** Guaranteeing consistent behavior across different JavaScript engines and providing polyfills for backward compatibility. ## Conclusion Introducing the `.markup()` method in ECMAScript offers a versatile and optimized approach to handling XML and HTML data. This proposal seeks the TC39's consideration for this addition, which is in line with JavaScript's evolution towards a more powerful and developer-friendly language.

What problem are you trying to solve?

Currently, developers handling XML and HTML content in ECMAScript face a lack of native, streamlined methods for parsing these markup languages. This leads to reliance on custom or third-party parsing solutions, which can vary in efficiency, security, and ease of use.

fetch("https://swapi.dev")
  .then(response => response.markup({ type: "text/html" }))
  .then(data => console.log(doc))
  .catch(error => console.error(error));

What solutions exist today?

Presently, developers typically use custom-built parsers or third-party libraries to parse XML and HTML content. For example, libraries like xml2js or node-html-parser provide these capabilities, but they require additional dependencies and may not be optimized for all use cases. These solutions often lead to inconsistent implementations and may pose security risks, especially when parsing HTML content.

One workaround involves using the .text() method and then parsing its content using a new DOMParser.

For example:

fetch("https://swapi.dev")
  .then(response => response.text())
  .then(data => {
    const parser = new DOMParser();
    const doc = parser.parseFromString(data, "text/html");
    console.log(doc);
  })
  .catch(error => console.error(error));

This method is a bit cumbersome and does not provide any any of the security benefits of the Sanitizer API.

How would you solve it?

The solution is to introduce a new method, .markup(), into the ECMAScript standard. This method will unify and simplify the parsing of XML and HTML content. By offering an optional configuration argument, it allows developers to specify the content type (XML or HTML) and other parsing preferences. For instance, response.markup({ type: "text/html" }) would parse HTML content while appropriately handling self-closing tags (the default behavior). This approach ensures consistency, optimizes performance, and reduces the security risks associated with third-party parsers.

Anything else?

In addition to providing a unified method for parsing markup languages, the .markup() method will include robust error handling and security features, especially vital for HTML parsing to prevent cross-site scripting (XSS) attacks. It should natively support the Sanitizer API, similar to how the setHTML() method will.

Its design will be in line with the existing .text() and .json() methods, ensuring familiarity and ease of adoption for developers. The proposal also considers the need for backward compatibility and browser support, suggesting the development of polyfills for older environments.

annevk commented 10 months ago

Duplicate of #2142.

domenic commented 10 months ago

Seemed to me kind of like a dupe of https://github.com/whatwg/fetch/issues/16

brandonmcconnell commented 10 months ago

@annevk This proposal does not relate to streaming HTML content into elements.

annevk commented 10 months ago

At least to me #2142 covers the idea of a streaming parser API generally.

And yeah, I guess Domenic is correct that exposing a method directly on Response for this is a non-starter.