whatwg / url

URL Standard
https://url.spec.whatwg.org/
Other
533 stars 139 forks source link

Proposal: URL.parseIPv4() URL.parseIPv6() #696

Open jasnell opened 2 years ago

jasnell commented 2 years ago

I've had a few discussions with folks (most recently at the OpenJS World conference in Austin last week) about the possibility of exposing the IPv4/IPv6 parser included in the URL parser via static functions on the URL class, e.g. something like a URL.parseIPv4(input) and URL.parseIPv6(input). The intent is to both validate and normalize the input (e.g. validate that an IPv6 address is conformant, or converting some IPv4 variant into the standard 4-segment syntax, etc). Hanging these APIs of the proposed URLHost would also make sense.

The usage of the proposed APIs would be straightforward:

const normalizedIpv4 = URL.parseIPv4(input);
// normalizedIpv4 is a string with the serialized parsed result of `input`.

If the input is invalid, a TypeError is thrown.

To make quick checks possible without incurring the cost of allocating the TypeError and associated stack, corresponding URL.isIPv4() and URL.isIPv6() utilities could also be provided.

To give an idea as to whether this would be useful for the ecosystem, consider that the ip-regex module on npm currently has over 12 million downloads per week. Given that this is functionality that is already included internally by URL implementations, it would be fairly trivial to expose it via a new API, thereby eliminating yet another dependency.

I'm not super concerned about the naming of the functions or the exact shape of the API. The key for me is just that we should expose this functionality somehow.

annevk commented 2 years ago

So ip-regex would return "not IPv4" for https://jsdom.github.io/whatwg-url/#url=aHR0cHM6Ly8x77yOMS4xLjEv&base=YWJvdXQ6Ymxhbms= (note the . (U+FF0E)), but the host parser would return an IPv4 address. A standalone parseIPvv4 might also return "not IPv4" if it directly invokes the IPv4 parser. That seems problematic.

Can you elaborate on why URLHost would not satisfy these use cases?

jasnell commented 2 years ago

Can you elaborate on why URLHost would not satisfy these use cases?

It might... Just really not certain about the current status of URLHost. As I said, I'm really not all that concerned about the actual API, just the use case. I would suggest, however, that the additional object allocation with new URLHost(...) is not strictly needed in every case -- that is, it seems wasteful if all I really want is a simple true/false check to see if it's a valid ipv6 or not.

annevk commented 2 years ago

Do the use cases want potentialIPv6String -> boolean or potentialHostString -> hostTypeEnum? I could also see potentialHostString -> null or serializedHostString, though the downside of string-in-string-out is that you cannot hand out type information at the same time, so we probably still want URLHost, even if we add some static conveniences.

Anyway, I think this idea is reasonable, but whatever shape we give to the API, the lowest we should go is the "host parser" in terms of what we end up giving inputs to and taking outputs from. That is, let's not invoke the IPv4 parser directly even if the API shape is parseIPv4(). This adds some overhead, but ensures consistent processing for various edge cases. Implementations can optimize code paths depending on the shape of the API.

jasnell commented 2 years ago

Sounds reasonable. I think a hostString -> hostTypeEnum would cover the low-cost is this an ipv4 or is this an ipv6 use case with no problem.

karwa commented 2 years ago

I've been adding host-parsing APIs to my Swift library recently to handle these kinds of use-cases. There really are a lot of edge-cases to go through, so it truly does need the whole host parser every time. For example, not only do IPv4 addresses get parsed after IDNA, that also happens after percent-decoding, so if you want truly URL-compatible parsing, it needs to also return that "0x%F0%9D%9F%95f.1" is an IPv4 address (127.0.0.1):

"0x%F0%9D%9F%95f.1" -> (percent-decoding) -> "0x𝟕f.1" -> (IDNA) -> "0.7f.1" -> (IPv4) -> 127.0.0.1

It's such a long distance from input to output that IMO it's not worth providing users with anything other than the complete, parsed result. I'm not sure about how allocations work in JS specifically, but I doubt there would be any performance gain by doing anything else, especially considering that domain-to-ASCII in particular is not trivial and often needs to allocate. I think performance wins are likely going to be found at the application level by storing parsed results so you don't need to parse them again.

Also - if the API told people that "0x%F0%9D%9F%95f.1" is an IPv4 address (which it is in URLs), what would they do with that information? Probably no other parser they can find will actually turn that string in to an IPv4 address, so we might as well give them the complete result that the URL host parser interpreted.

annevk commented 1 year ago

@karwa makes sense. We should start with #288 and take it from there, based on data.