Hostnames should be case-insensitive, but most extractors ignore that.

johnhawkinson commented 8 years ago

Forking off this discussion from #10854, where @dstftw suggested that making all _VALID_URL checks case-insensitive was the wrong way to go:

Use (?i) in regex itself if you want case insensitivity.

I, @johnhawkinson replied:

Good point! Though very few current extractors do that:
pb3:extractor jhawk$ fgrep -Rn '(?i)'  *.py|grep VALID
commonprotocols.py:11:    _VALID_URL = r'(?i)rtmp[est]?://.+'
commonprotocols.py:37:    _VALID_URL = r'(?i)mms://.+'
dailymotion.py:41:    _VALID_URL = r'(?i)(?:https?://)?(?:(www|touch)\.)?dailymotion\.[a-z]{2,3}/(?:(?:embed|swf|#)/)?video/(?P<id>[^/?_]+)'
would it not make sense to consider changing suitable() globally to make most regexps case insensitive, rather than trying to touch every regexp? OTOH I guess it might break some things?

@dstftw replied:

No, this should not be global at least due to presence of potentially case sensitive parts and every regexp should not be touched either. Only those seen to be case insensitive in the wild should do.

and I, @johnhawkinson said:

Probably this is another issue to open, but it's very easy to make most extractors fail by uppercasing something. Compare:
pb3:Downloads jhawk$ youtube-dl -s  'http://abcnews.go.COM/ThisWeek/video/week-exclusive-irans-foreign-minister-zarif-20411932'
[generic] week-exclusive-irans-foreign-minister-zarif-20411932: Requesting header
WARNING: Falling back on generic information extractor.
[generic] week-exclusive-irans-foreign-minister-zarif-20411932: Downloading webpage
[generic] week-exclusive-irans-foreign-minister-zarif-20411932: Extracting information
pb3:Downloads jhawk$ youtube-dl -s  'http://abcnews.go.com/ThisWeek/video/week-exclusive-irans-foreign-minister-zarif-20411932'
[abcnews:video] Downloading Akamai AMP feed
[abcnews:video] 20411932: Downloading f4m manifest
[abcnews:video] 20411932: Downloading m3u8 information
pb3:Downloads jhawk$ 
Sticking to a "seen in the wild" standard masks a lot of bugs. I'll open an issue and submit a pull to at least make the YourExtractor sample code case-insentitive. But I do think the project probably can and should do better...

Finally @dstftw said:

That's simulated. No such URLs is seen in the wild so far and no one will ever intentionally upper case some part of it.

I disagree. The hostname part of a URL is by definition case-insensitive. Any extractor in youtube-dl that assumes the hostname has fixed case is buggy. And a few of them go to ugly contortions using character classes to try to be case-insensitive, like YoutubeIE:

                         (?:(?:(?:(?:\w+\.)?[yY][oO][uU][tT][uU][bB][eE](?:-nocookie)?\.com/|
                            youtu\.be|                                        # just youtu.be/xxxx

And yet ironically it doesn't allow http://YOUTU.BE (although those work anyhow, I think because of some very broad matching of the path component for Youtube).

Anyhow, the authority on this is RFC 1034: Domain Names - Concepts And Facilities, stating:

By convention, domain names can be stored with arbitrary case, but
domain name comparisons for all present domain functions are done in a
case-insensitive manner, assuming an ASCII character set, and a high
order zero bit

And also RFC3986: Uniform Resource Identifier (URI): Generic Syntax:

3.2.2.  Host

   The host subcomponent of authority is identified by an IP literal
   encapsulated within square brackets, an IPv4 address in dotted-
   decimal form, or a registered name.  The host subcomponent is case-
   insensitive.

See also RFC 1035, RFC 4343.

But that only goes so far: while the domain names are case-insensitive, the rest of the URLs are not.

But what is the risk of processing them case-insensitively? From youtube-dl's perspective, it means a URL might match _VALID_URL on the correct site but with a different case, like the extractor for https://www.youtube.com/watch?v=d9TpRfDdyU0 might be triggered by https://www.youtube.com/WATCH?v=d9TpRfDdyU0.

But so what? At worst it means an extractor might be unnecessarily invoked in a few rare cases, which is a fair thing to trade to have it work in more places.

Any any website that has different video content at /ABC and /abc where both need to work is going to need careful attention to this in the extractor anyhow. Although I'm skeptical such sites exist.

Anyhow, the compromise proposal is to just change the README.md and CONTRIBUTING.md examples such that they recommend using (?i) in regexps, so that new extractors are case insensitive. I'll submit a pull request.

I guess we could also go in en masse and prefix most VALID_URI entries with (?i) and see what breaks, if anything?

Thanks.

yan12125 commented 8 years ago

Any any website that has different video content at /ABC and /abc where both need to work is going to need careful attention to this in the extractor anyhow. Although I'm skeptical such sites exist.

I know that RFC and I agree that URL matching should follow it, but it's bad to leave bugs in codes to accomplish a goal.

Possible correct solutions can be:

Create two expressions, like _VALID_URL_HOSTNAME and _VALID_URL_PATH. The former is matched with re.I and the latter not.
Python 3.6 supports local modifiers [1]. If one day youtube-dl supports Python 3.6+ only, we can use it. Using the regex module [2] is another possibility before waiting for Python 3.6+.

[1] https://docs.python.org/3.6/whatsnew/3.6.html#re [2] https://pypi.python.org/pypi/regex

johnhawkinson commented 8 years ago

I know that RFC and I agree that URL matching should follow it, but it's bad to leave bugs in codes to accomplish a goal.

This argument cuts both ways. Leaving the code as it is now, there are approximately 932 bugs we would be leaving:

pb3:extractor jhawk$ grep '_VALID_URL =' *.py | fgrep -v '(?i)' | wc -l
     932

It's not clear how many bugs we might create if we fixed those 932 bugs by making the regexp match case-insensitive, but the net result would be a lot more fixes.

Another solution might be:

Downcase the hostname portion of the URL into lowercase before doing the regular expression comparison. (This is hard in practice because hundreds of extractors use the pattern

    def _real_extract(self, url):
        mobj = re.match(self._VALID_URL, url)
        video_id = mobj.group('id')

instead of calling self._match_id(), so this would be a lot of churn, too. On the other hand, maybe that means that stuff should be converted to _match_id(). Although sometimes there's good reason not to, if you need to match other parameters? But perhaps that means the _match_id() abstraction isn't general enough.)

Anyhow, it is clear this is not a pressing problem.

yan12125 commented 8 years ago

But perhaps that means the _match_id() abstraction isn't general enough.

Yes this function should be improved.

ytdl-org / youtube-dl

Hostnames should be case-insensitive, but most extractors ignore that. #10882