whatwg / url

URL Standard
https://url.spec.whatwg.org/
Other
527 stars 137 forks source link

Explain why valid domain needs to run ToUnicode #817

Open hsivonen opened 7 months ago

hsivonen commented 7 months ago

What is the issue with the URL Standard?

https://url.spec.whatwg.org/#valid-domain could use an informative note that states the implications of the two-step (both ToASCII and ToUnicode) check. Given the how both use UTS 46 "Processing" and "ToASCII" does more stuff after "Processing", it would be helpful to call out what the second run of "Processing" (as part of "ToUnicode") catches still.

hsivonen commented 6 months ago

FWIW, after more progressi with writing code, I'm even more puzzled about what the second run of "Processing" is meant to catch here.

annevk commented 6 months ago

I wonder if the difference has disappeared over time. It does seem weird that ToUnicode can now fail apparently, but there's no explicit mention of this.

zacknewman commented 4 months ago

Glad I saw this as I too am skeptical about the need to perform the domain-to-unicode algorithm. I've tried generating inputs that fail on step 3 using the below code in Rust using the idna crate, but I have been unable to find such an input:

use core::{ops::ControlFlow, str};
use idna::Config;
fn main() {
    match ('\0'..=char::MAX).try_fold(String::with_capacity(8), |mut input, c| {
        input.clear();
        input.push(c);
        if let Err(val) = idna_transform(input.as_str()) {
            println!("{val}");
            ControlFlow::Break(())
        } else {
            ControlFlow::Continue(input)
        }
    }) {
        ControlFlow::Continue(input) => {
            let mut utf8 = input.into_bytes();
            utf8.clear();
            utf8.extend_from_slice(b"xn--");
            punycode_inputs(&mut utf8, 0);
        }
        ControlFlow::Break(()) => (),
    }
}
fn punycode_inputs(utf8: &mut Vec<u8>, count: u8) -> bool {
    if count < 4 {
        for i in [
            b'-', b'0', b'1', b'2', b'3', b'4', b'5', b'6', b'7', b'8', b'9', b'a', b'b', b'c',
            b'd', b'e', b'f', b'g', b'h', b'i', b'j', b'k', b'l', b'm', b'n', b'o', b'p', b'q',
            b'r', b's', b't', b'u', b'v', b'w', b'x', b'y', b'z',
        ] {
            utf8.push(i);
            if let Err(val) =
                idna_transform(str::from_utf8(utf8.as_slice()).unwrap_or_else(|_| {
                    unreachable!("ASCII is a subset of UTF-8, so this is fine")
                }))
            {
                println!("{val}");
                return true;
            } else if punycode_inputs(utf8, count + 1) {
                return true;
            } else {
                utf8.pop();
            }
        }
    }
    false
}
fn idna_transform(input: &str) -> Result<(), &str> {
    idna::domain_to_ascii_strict(input).map_or_else(
        |_| Ok(()),
        |ascii| {
            Config::default()
                .use_std3_ascii_rules(true)
                .to_unicode(ascii.as_str())
                .1
                .map_err(|_e| input)
        },
    )
}

Consequently I believe steps 3 and 4 can be removed, but I haven't mathematically proven the domain-to-ascii algorithm is sufficient. I've used these examples as well.