Open johnhawkinson opened 8 years ago
Any any website that has different video content at /ABC and /abc where both need to work is going to need careful attention to this in the extractor anyhow. Although I'm skeptical such sites exist.
I know that RFC and I agree that URL matching should follow it, but it's bad to leave bugs in codes to accomplish a goal.
Possible correct solutions can be:
_VALID_URL_HOSTNAME
and _VALID_URL_PATH
. The former is matched with re.I
and the latter not.[1] https://docs.python.org/3.6/whatsnew/3.6.html#re [2] https://pypi.python.org/pypi/regex
I know that RFC and I agree that URL matching should follow it, but it's bad to leave bugs in codes to accomplish a goal.
This argument cuts both ways. Leaving the code as it is now, there are approximately 932 bugs we would be leaving:
pb3:extractor jhawk$ grep '_VALID_URL =' *.py | fgrep -v '(?i)' | wc -l
932
It's not clear how many bugs we might create if we fixed those 932 bugs by making the regexp match case-insensitive, but the net result would be a lot more fixes.
Another solution might be:
def _real_extract(self, url):
mobj = re.match(self._VALID_URL, url)
video_id = mobj.group('id')
instead of calling self._match_id(), so this would be a lot of churn, too. On the other hand, maybe that means that stuff should be converted to _match_id(). Although sometimes there's good reason not to, if you need to match other parameters? But perhaps that means the _match_id() abstraction isn't general enough.)
Anyhow, it is clear this is not a pressing problem.
But perhaps that means the _match_id() abstraction isn't general enough.
Yes this function should be improved.
Forking off this discussion from #10854, where @dstftw suggested that making all _VALID_URL checks case-insensitive was the wrong way to go:
I, @johnhawkinson replied:
@dstftw replied:
and I, @johnhawkinson said:
Finally @dstftw said:
I disagree. The hostname part of a URL is by definition case-insensitive. Any extractor in youtube-dl that assumes the hostname has fixed case is buggy. And a few of them go to ugly contortions using character classes to try to be case-insensitive, like YoutubeIE:
And yet ironically it doesn't allow http://YOUTU.BE (although those work anyhow, I think because of some very broad matching of the path component for Youtube).
Anyhow, the authority on this is RFC 1034: Domain Names - Concepts And Facilities, stating:
And also RFC3986: Uniform Resource Identifier (URI): Generic Syntax:
See also RFC 1035, RFC 4343.
But that only goes so far: while the domain names are case-insensitive, the rest of the URLs are not.
But what is the risk of processing them case-insensitively? From youtube-dl's perspective, it means a URL might match _VALID_URL on the correct site but with a different case, like the extractor for
https://www.youtube.com/watch?v=d9TpRfDdyU0
might be triggered byhttps://www.youtube.com/WATCH?v=d9TpRfDdyU0
.But so what? At worst it means an extractor might be unnecessarily invoked in a few rare cases, which is a fair thing to trade to have it work in more places.
Any any website that has different video content at /ABC and /abc where both need to work is going to need careful attention to this in the extractor anyhow. Although I'm skeptical such sites exist.
Anyhow, the compromise proposal is to just change the README.md and CONTRIBUTING.md examples such that they recommend using
(?i)
in regexps, so that new extractors are case insensitive. I'll submit a pull request.I guess we could also go in en masse and prefix most VALID_URI entries with
(?i)
and see what breaks, if anything?Thanks.