ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
132.4k stars 10.04k forks source link

Allow including additional video properties in json output, without downloading videos | YouTube #30692

Closed kwap closed 2 years ago

kwap commented 2 years ago

Description

I'm macOS user. (debug output for -v parameter is down below, as it's the same for all of the commands and params I've tried)

When I run the following command in terminal: youtube-dl https://www.youtube.com/c/aliabdaal/videos\?view\=0\&sort\=da\&flow\=grid --skip-download --dump-json -v | cat > videos-too-much-information.txt

I get a large file that contains a lot of information I don't need.

On the other hand, if I run youtube-dl https://www.youtube.com/c/aliabdaal/videos?view=0&sort=da&flow=grid --skip-download --dump-json --flat-playlist | cat > videos-too-little-information.txt for every video in the playlist I get a line in an output file, looking like this:

{"_type": "url", "ie_key": "Youtube", "id": "XcZnSSmeK2I", "url": "XcZnSSmeK2I", "title": "How to prepare for BMAT Section 2 Physics, even if you're not doing it at A-Level | BMAT Tips series", "description": null, "duration": null, "view_count": 31077, "uploader": null}

which does not contain the information I needed.

I tried using -o parameter, to format the output strings, like so: youtube-dl https://www.youtube.com/c/aliabdaal/videos?view=0&sort=da&flow=grid --skip-download --dump-json -o "%(id)s | %(name)s | %(title)s | %(release_date)s | %(duration)s | %(view_count)s | %(like_count)s | %(dislike_count)s | %(repost_count)s | %(average_rating)s | %(comment_count)s"

but that would only result in a large file, too much information and "_filename": "NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA" ... added for each video in the playlist.

Proposition

Introduce additional parameter -jo (json output template) so when I run the command

youtube-dl https://www.youtube.com/c/aliabdaal/videos?view=0&sort=da&flow=grid --skip-download --dump-json -jo "%(id)s | %(name)s | %(title)s | %(release_date)s | %(duration)s | %(view_count)s | %(like_count)s | %(dislike_count)s | %(repost_count)s | %(average_rating)s | %(comment_count)s" | cat > desired_output.txt

then for each video passed as a parameter (or each video from the playlist passed as a parameter) - youtube-dl will attempt getting the values of all the properties specified in json output template. If property is present and has value - it returns this value as a string, if property doesn't have a value it returns empty string, if property doesn't exist it returns null.

So, specifically, for the video from the example with not enough information, should I run the command with new -jo switch and provide the template like above, the result for each video in file desired_output.txt would like like this:

{"_type": "url", "ie_key": "Youtube", "id": "XcZnSSmeK2I", "url": "XcZnSSmeK2I", "title": "How to prepare for BMAT Section 2 Physics, even if you're not doing it at A-Level | BMAT Tips series", "release_date":0170623", "description": "My online BMAT video course (75+ videos) = https://courses.aliabdaal.com/bmat-crash-course-online\n\nToday's video tackles the approach to physics, which is arguably the most feared part of the BMAT, especially given that most medical applicants don't do physics at A-level. I talk about why you shouldn't ignore the physics questions, and give some tips about the order in which to learn stuff from the assumed knowledge guide, and then some tips about how to practice.\n\nUseful Links:\n\nBMAT Ninja - https://bmat.ninja - 1,200+ free questions that you can do online. You can pay \u00a329 for access to the worked solutions written by Oxbridge medical students, or you can apply for one of our bursaries (we give out hundreds of those each year).\n\nOfficial Section 2 Assumed Knowledge Guide - http://www.admissionstestingservice.org/for-test-takers/bmat/preparing-for-bmat/overlay.html\n\nBBC Bitesize - http://www.bbc.co.uk/education/subjects/zpm6fg8", "duration": "287", "view_count": 31077, "like_count":"619", "uploader": "Ali Abdaal", "name":null, "release_date":null, "dislike_count":null, "repost_count":null, "average_rating":""}

[debug] System config: [] [debug] User config: [] [debug] Custom config: [] [debug] Command-line args: ['https://www.youtube.com/c/aliabdaal/videos?view=0&sort=da&flow=grid', '--skip-download', '--dump-json', '-v'] [debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8 [debug] youtube-dl version 2021.12.17 [debug] Git HEAD: 2dc375acc [debug] Python version 3.10.2 (CPython) - macOS-12.2.1-arm64-arm-64bit [debug] exe versions: ffmpeg 5.0, ffprobe 5.0 [debug] Proxy map: {}

dirkf commented 2 years ago

Use jq to filter the JSON output.

kwap commented 2 years ago

Use jq to filter the JSON output.

Would you mind sharing a short code sample on how to approach this? Of course not doing it for me, just a little hint on how to do it?

Lee-Carre commented 2 years ago

@kwap

Use jq to filter the JSON output.

Would you mind sharing a short code sample on how to approach this? Of course not doing it for me, just a little hint on how to do it?

It depends what you're trying to do. From my skimming of your OP, I didn't see a clear description of what output you're seeking or would find useful.

Plus, the JSON output is different between videos, playlists, & channels.

jq can do a whole lot more than merely filtering which elements are included / excluded. It can reprocess the output. For (a simple) example, converting duration (which is an integer count of seconds) into the more familiar HH:MM:SS format. Some of my more elaborate & adventurous filtersets for jq (in the context of youtube-dl) output various statistics for a given playlist.

There's plenty of documentation about jq on its website. I suggest reading it. If you can learn how to use youtube-dl then jq shouldn't be a problem, either.

Beyond that, jq is popular enough that you can search Q&A sites for solutions to specific cases, to use as inspiration.

Unless you mean how to use it as a CLI tool (passing the output of youtube-dl to jq), in which case you need a more general guide on using the command-line:

kwap commented 2 years ago

Use jq to filter the JSON output.

Would you mind sharing a short code sample on how to approach this?

Personally I would not do this. After many years, I learned that sometimes its easier just to write an actual program, instead of trying to learn a new command line tool. For example, you could do this:

youtube-dl --id --skip-download --write-info-json LQ3Mu8A7gjY

Then format like this:

package main

import (
   "encoding/json"
   "fmt"
   "os"
)

func main() {
   buf, err := os.ReadFile("LQ3Mu8A7gjY.info.json")
   if err != nil {
      panic(err)
   }
   var m map[string]interface{}
   json.Unmarshal(buf, &m)
   fmt.Println(
      m["id"], "|", m["title"], "|", m["upload_date"], "|", m["duration"], "|",
      m["view_count"], "|", m["like_count"],
   )
}

Result:

LQ3Mu8A7gjY | All of Me (John Legend) - Duranka Perera | 20160327 | 102 | 81006 | 1013

This is brilliant! I'll take it from there. Thank you so much, exactly what I needed to get going, I'll be able to wrap it up myself. Thank you!

kwap commented 2 years ago

Thanks for taking the time to reply. What I don't appreciate though, is the tone of your reply which I found snarky and condescending. Before asking for help I'd spent significant time working on the issue. I would be perfectly fine if my question remained unanswered (nobody is obliged to helping me out).

I was explicit in what I was trying to achieve, right after "the result for each video in file desired_output.txt would like like this:". @89z clearly saw that.

Not mad at you, just wanted to get the facts straight. Cheers

@kwap

Use jq to filter the JSON output.

Would you mind sharing a short code sample on how to approach this? Of course not doing it for me, just a little hint on how to do it?

It depends what you're trying to do. From my skimming of your OP, I didn't see a clear description of what output you're seeking or would find useful.

Plus, the JSON output is different between videos, playlists, & channels.

jq can do a whole lot more than merely filtering which elements are included / excluded. It can reprocess the output. For (a simple) example, converting duration (which is an integer count of seconds) into the more familiar HH:MM:SS format. Some of my more elaborate & adventurous filtersets for jq (in the context of youtube-dl) output various statistics for a given playlist.

There's plenty of documentation about jq on its website. I suggest reading it. If you can learn how to use youtube-dl then jq shouldn't be a problem, either.

Beyond that, jq is popular enough that you can search Q&A sites for solutions to specific cases, to use as inspiration.

Unless you mean how to use it as a CLI tool (passing the output of youtube-dl to jq), in which case you need a more general guide on using the command-line:

dirkf commented 2 years ago

Not to address OP specifically, less experienced programmers may see every task as an opportunity for a new program. The experience of creating those programs can be a big part of learning whichever language and environment. After reaching some level of maturity, the realisation comes that every program is a potential maintenance problem, especially if the chosen platform is unstable, like Go, Rust, .Net, Java/ECMA/Script, or even C++ to some extent. These are the people who prefer POSIX shell scripts (apparently not our hosts, who originally ran GitHub in that way and then bought in some third-party tools), Perl if they have that weird bent, or Python for all the 2 vs 3 vs vs 3.6 vs 3.10 palaver.

Slightly more off-topic old joke Or eventually they fall so low that this is the way to create a program:
From: pointy.haired@megacorp.example.com
To: wally@megacorp.example.com
Subject: Hello World

Please write me a Hello World program. I need it first thing tomorrow.

As yt-dl is being run by Python, that would probably be the natural choice if you were going to write a separate program to process JSON written by yt-dl. Or you could embed yt-dl as a module in your own program, so that the intermediate JSON is never written to a file or pipe. Obviously, the yt-dl codebase is full of examples of JSON processing.

Would you mind sharing a short code sample on how to approach this? Of course not doing it for me, just a little hint on how to do it?

jq is well documented and examples of use are not unknown to major web search engines. Having said that, this

youtube-dl -j -o - 'https://www.youtube.com/c/aliabdaal/videos%5C?view%5C=0%5C&sort%5C=da%5C&flow%5C=grid' 2>&1 | jq 'select(has("formats"))|{_type, ie_key, id, url, title, release_date, description, duration, view_count, like_count, uploader, name, dislike_count, repost_count, average_rating}'

might produce the sort of output desired. To unpack it:

BTW ... | cat > ... is a long way, including an extra process, of doing ... > ....

Lee-Carre commented 2 years ago

ProgrammerFight

{Starts handing out popcorn} 😋


More seriously; some of the commenters may appreciate a read of The Art of Unix Programming (by Eric Raymond, author of Cathedral & Bazaar).

dirkf commented 2 years ago

Avoiding the #ProgrammerFight, because there are always pros and cons and judgement calls,

... like Bash versus Zsh, Ksh93, Yash, or even Pdksh to some extent

that's why you write using the language defined in a mature version of POSIX.1 and validate with shellcheck.

Lee-Carre commented 2 years ago

there are always pros and cons and judgement calls

That was essentially my own thinking, too.

There are no (absolute, one-true-way) solutions, only trade-offs.

To each (one's|his) own.

Hence, instead of authoring an essay-length analysis of such, I cited that excellent book, which addresses relevant concepts (programming strategy, of a sort; high-level software architectural design), which says it all better than I could.

dirkf commented 2 years ago

Or, with PR #30723:

youtube-dl 'https://www.youtube.com/c/aliabdaal/videos%5C?view%5C=0%5C&sort%5C=da%5C&flow%5C=grid' --print '%(id)s | %(title)s | %(description)s | %(urls)s | %(duration)s | %(view_count)s | %(like_count)s | %(uploader)s'