Closed Microeinstein closed 4 years ago
Thanks for the PR. The changes are too substantial for an unannounced PR, so I can't promise when I'll have time to do the review. At a glance I can get behind some changes but not all, and everything being in a single commit makes reviewing and accepting some changes extremely hard. It looks like you squashed many commits? Please separate them if possible (otherwise I might find the task dreadful enough and just do a clean room reimplementation...). Also, all code should be formatted with black. Some of the changes seem to do nothing but break the formatting (e.g., changes to argparse code), it would be nice to format with black and drop those distractions.
If the requested changes (to make things easier for me) are too time consuming for you then unfortunately it might end up taking a long time for me to get back to you... (I'm certainly not saying you're obligated to change anything. In case this sounds passive aggressive, that's not my intention, I'm just setting reasonable expectations.) Fortunately it's not hard to use your own fork in any setting (reference just in case you don't know how), so whatever you're trying to do shouldn't rely on my finding the time to review and merge the PR.
Some of the changes seem to do nothing but break the formatting
Oh, a little bit more on that: please don't make unnecessary changes like switching order of short and long options; this is my project so I'll switch back to my preference anyway, it'll just waste time for both of us.
The one unnecessary change I can get behind: I usually prefer qualified identifiers than bare imported identifiers (more readable without understanding context), meaning pathlib.Path
not Path
, but it seems I'm really using it a lot, and it's been a long time since pathlib
was introduced in py34, so sure, let's just import Path
alone.
I can't promise when I'll have time to do the review
Don't worry
It looks like you squashed many commits? Please separate them if possible
I didn't - I know, I should have... If I try to separate the commit now, I'll probably break something š
Fortunately it's not hard to use your own fork in any setting
Oh, nice to know!
so whatever you're trying to do shouldn't rely on my finding the time to review and merge the PR.
Yeah, you're absolutely right; I asked a pull request mainly because I thought it was the only way to distribute the modified code with pip
to final users; I don't have this urge anymore due to your suggestion, so merge only if you wish šš»
Btw I reverted some changes in particular as you asked, also the whole commit can be summarized in:
main()
into make_arguments()
__launch()
__handle_arguments()
ffmpeg_loglevel
progress_hooks
and pass them around also with verbosity
Add invoke()
to launch with arguments from external code, example:
from caterpillar import caterpillar
def hook(data):
print(data)
url = 'https://whatever'
out = 'video.mp4'
wd = 'video_temp'
hls_opts = {
'm3u8_url': url,
'output': Path(out),
'workdir': Path(wd),
'jobs': 4,
'quiet': 24,
'no_progress': True,
'ffmpeg_loglevel': 'fatal',
'progress_hooks': [lambda d: hook(d)] # lambda used to denote number of arguments
}
caterpillar.invoke(**hls_opts)
download_m3u8_playlist_or_variant
: use the first variant stream found, if a variant streams playlist has been founddownload_m3u8_segments
Thanks for making an effort.
I didn't
Ah that's what I dreaded. I can't merge all the changes in one commit even if I want to accept all of them verbatim... That would be terrible for blaming and bisecting in the future.
However, your effort isn't in vain, at least it's been made easier for me to understand the changes you want to see, so I think I'll treat them as feature requests instead and make requested changes separately.
Some thoughts:
Split main() into make_arguments() launch() handle_arguments()
Sure, could refactor out the parser. But I definitely won't use dunder unless it's absolutely necessary somehow (it never is).
Add ffmpeg_loglevel
It's nice to be able to suppress logging, but it's not that simple. Critical functionality actually relies on log level being at least info
: https://github.com/zmwangx/caterpillar/blob/f7c47e6f3b86b72b7ad461e07863566da5a762af/src/caterpillar/merge.py#L43-L74 so introducing an option like this that only affects one invocation (the final merge pass) but not all others is deceptive and confusing. I'd much rather select the log level automatically based on the existing verbosity level (controlled by repeatable -q
and -v
, as is standard), in fact I'm already suppressing logs based on that (see https://github.com/zmwangx/caterpillar/blob/f7c47e6f3b86b72b7ad461e07863566da5a762af/src/caterpillar/merge.py#L75-L79). So I'll do that if I can't think of any reason not to.
progress_hooks
I can add that, but your reporting is too elaborate for my taste. I'll implement a simpler reporting scheme instead, other logic should be in the caller as they see fit.
Add invoke() to launch with arguments from external code
Sure, I can add an API like that.
Add download_m3u8_playlist_or_variant: use the first variant stream found, if a variant streams playlist has been found
That's a useful addition. The nondescriptive remote0.m3u8
and remote1.m3u8
are a bit weird though, so I'll likely have to massage the implementation a bit.
To summarize, I'll probably implement these changes separately, but it's hard to give you committer credit when I'm not just fixing up minor things in your commits and signing them off. This makes me feel bad since I want to give you credit. So in the future, please consider scoping your commits as well as PRs (a series of PRs for unrelated things would make it a lot easier to accept or reject some).
I'll probably be able to find some time for this either today or within the next week, then I'll cut a release.
I asked a pull request mainly because I thought it was the only way to distribute the modified code with pip to final users;
Btw, just to be clear, I wasn't discouraging PRs, trying to upstream changes shows respect for an OSS project and I'm grateful for that (of course as I said scoped PRs would be a lot better execution-wise). Just saying you don't have to wait unnecessarily since I can't really promise anything (the other day I finally pushed a fix to an issue I promised to fix precisely a year ago in another project so you get the idea...)
That would be terrible for blaming and bisecting in the future.
I'll learn how to use git properly š
But I definitely won't use dunder unless it's absolutely necessary somehow
I thought using double underscores to hide those function from external code - those functions should not be called directly.
Critical functionality actually relies on log level being at least
info
[...] in fact I'm already suppressing logs based on that
Oh..! I didn't notice that code; during some tests I passed 'quiet': 24
but I still got ffmpeg output, I probably missed something.
New idea then: add an argument to explicitly disable ffmpeg output.
I can add that, but your reporting is too elaborate for my taste.
Uh, ok; the final result I wanted to reach is this, but without rewriting the hook code for the progress bar.
The nondescriptive
remote0.m3u8
andremote1.m3u8
are a bit weird though
I couldn't find better names, following this logic:
remote0.m3u8
)remote1.m3u8
)remote1.m3u8
as final playlist to considerI'll probably implement these changes separately, but it's hard to give you committer credit
No problem šš» - feel free to do what you think best suits your needs.
just to be clear, I wasn't discouraging PRs
Yeah, I had no doubts on that; I mean, I probably would have said - to the final users of my project - to download from my fork and I would have asked for a pull request anyway.
I have implemented the feature requests I deem appropriate.
ffmpeg_loglevel: appropriate. Revamped in e3ebefc, should be completely consistent now.
You were seeing extraneous output despite filtering effort since previously suppression was based on a single (repurposed, or half-assed in this context) pattern, meaning a lot of stuff slip through the crack; and the final merging pass wasn't filtered at all, and you were only fixing that. The implementation here fixes everything.
progress_hooks: appropriate. Instead implemented more general, strictly typed event hooks in bbe9bb4. I wrote a simple adapter to the reporting scheme you implemented:
#!/usr/bin/env python3
import math
import pathlib
from caterpillar import caterpillar
from caterpillar.events import (
Event,
SegmentDownloadSucceededEvent,
SegmentDownloadFailedEvent,
SegmentsDownloadInitiatedEvent,
SegmentsDownloadFinishedEvent,
)
class ProgressReporter:
def __init__(self):
self.segment_count = 0
self.handled_count = 0
self.downloaded_bytes = 0
def event_hook(self, event: Event) -> None:
# An alternative to isinstance: check event.event_type against
# the EventType enum.
if isinstance(event, SegmentsDownloadInitiatedEvent):
self.segment_count = event.segment_count
return
elif isinstance(event, SegmentsDownloadFinishedEvent):
data = dict(
status="finished",
fragment_index=self.handled_count,
fragment_count=self.segment_count,
downloaded_bytes=self.downloaded_bytes,
# Warning: this is wrong, since errored segments haven't
# been counted.
total_bytes=self.downloaded_bytes,
)
elif isinstance(event, SegmentDownloadSucceededEvent):
self.handled_count += 1
self.downloaded_bytes += event.path.stat().st_size
data = dict(
status="downloading",
# Warning: caterpillar uses imap_unordered which returns
# results out of order, so calling it fragment_index is
# not accurate.
fragment_index=self.handled_count,
fragment_count=self.segment_count,
downloaded_bytes=self.downloaded_bytes,
total_bytes_estimate=math.ceil(
self.downloaded_bytes / self.handled_count * self.segment_count
),
)
elif isinstance(event, SegmentDownloadFailedEvent):
self.handled_count += 1
return
else:
return
...
def main():
reporter = ProgressReporter()
caterpillar.process_entry(
"...",
pathlib.Path("..."),
force=True,
keep=True,
event_hooks=[reporter.event_hook],
)
if __name__ == "__main__":
main()
You should probably consider implementing the event hook API natively though.
download_m3u8_playlist_or_variant: appropriate. Implemented variant streams support in e1dd54c. Instead of blindly selecting the first variant, I try to find the best variant. See variants.py
.
invoke: not appropriate. Reason: there's already caterpillar.caterpillar.process_entry
and caterpillar.caterpillar.process_batch
that basically do this, the two are different enough that an application programmer probably won't use both indiscriminately and need their own dispatch anyway (if they use both, which they probably don't), so adding another trivial dispatcher or two seems pointless.
I think those are all the substantial requests. Let me know if you have questions or need anything else. I'll cut a release if there are no additional requests.
Unrelated Python clear-up:
I thought using double underscores to hide those function from external code - those functions should not be called directly.
No, you were thinking about name mangling, which only applies to methods, not module functions and variables, and even then only makes access difficult, not impossible. There's a very narrow use case that you'll know when you really, really need it (I was exaggerating a bit when I said "it never is", but the point is if you just want to mark something private, you shouldn't do this). https://docs.python.org/3/tutorial/classes.html#private-variables
If you just want to "hide" module functions from help
and import *
, then a single underscore is enough.
A simple example you can try:
mod.py
:
__hidden_var = "nope, not hidden"
def __hidden_func():
print("nope, not hidden")
class Foo:
def __hidden_method(self):
print("nope, still here")
main.py
:
import mod
def main():
print("mod.__hidden_var:", end="\t")
print(mod.__hidden_var)
print("mod.__hidden_func:", end="\t")
mod.__hidden_func()
foo = mod.Foo()
print("mod.Foo.__hidden_method:", end="\t")
try:
foo.__hidden_method()
except AttributeError as e:
print(f"AttributeError: {e}")
print("mod.Foo._Foo__hidden_method:", end="\t")
foo._Foo__hidden_method()
if __name__ == "__main__":
main()
$ python3 main.py
mod.__hidden_var: nope, not hidden
mod.__hidden_func: nope, not hidden
mod.Foo.__hidden_method: AttributeError: 'Foo' object has no attribute '__hidden_method'
mod.Foo._Foo__hidden_method: nope, still here
Let me know if you have questions or need anything else. I'll cut a release if there are no additional requests.
I would have some questions but can't ask right now; I'm in a hurry
No, you were thinking about name mangling, which only applies to methods, not module functions and variables, and even then only makes access difficult, not impossible. [...]
Ah, I see; thank you for the clarification.
Iām not in a hurry to cut a release, so no rush.
I released v1.1 while I still remember this... Closing, but feel free to ask questions.
ffmpeg_loglevel
- determine loglevel by verbosity level;
download_m3u8_playlist_or_variant
- try to find the best variant
I find this designs way better than mine šš»
progress_hooks
appropriate. Instead implemented more general
Uhh, ok I guess; it would be nice adding that ProgressReporter
snippet (or something like that) in the code [without using it directly], to avoid unnecessary copy&paste and re-implementations in dependents code
class ProgressReporter:
STATUS_DOWNLOADING = "downloading"
STATUS_FINISHED = "finished"
def __init__(self, callback: Callable):
self.segment_count = 0
self.handled_count = 0
self.downloaded_bytes = 0
self.callback = callback
def event_hook(self, event: Event) -> None:
status = ProgressReporter.STATUS_DOWNLOADING # default
# An alternative to isinstance: check event.event_type against
# the EventType enum.
if isinstance(event, SegmentsDownloadInitiatedEvent):
self.segment_count = event.segment_count
elif isinstance(event, SegmentDownloadSucceededEvent):
self.handled_count += 1
self.downloaded_bytes += event.path.stat().st_size
elif isinstance(event, SegmentDownloadFailedEvent):
self.handled_count += 1
elif isinstance(event, SegmentsDownloadFinishedEvent):
status = ProgressReporter.STATUS_FINISHED
if status is not None:
estimate = (
self.downloaded_bytes / self.handled_count * self.segment_count
if self.handled_count > 0
else 0
)
data = dict(
status=status,
handled_count=self.handled_count,
segment_count=self.segment_count,
downloaded_bytes=self.downloaded_bytes,
total_bytes_estimate=math.ceil(estimate),
)
self.callback(data)
invoke
not appropriate. Reason: there's already
caterpillar.caterpillar.process_entry
Ah I see; I wrote it to avoid rewriting the default values of certain arguments (and let argparse handle possibly future arguments of exclusive groups)
Everything is explained in the commits message; after merge, can this version be tagged to
1.0.2
or something please? I made a bot to download my university lessons, and this edits would be very useful to replaceyoutube-dl
and to download them in parallel!Thank you šš»