ERROR: Unable to extract playables info when trying to download from Beatport

philszalay commented 1 year ago

Checklist

[ x ] I've verified that I'm running youtube-dl version 2021.12.17
[ x ] I've checked that all provided URLs are alive and playable in a browser
[ x ] I've checked that all URLs and arguments with special characters are properly quoted or escaped
[ x ] I've searched the bugtracker for similar bug reports including closed ones
[ x ] I've read bugs section in FAQ

Verbose log

[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--verbose', 'https://www.beatport.com/track/dont-care/16624764']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: b8b46501e
[debug] Python version 3.11.4 (CPython) - macOS-12.6-arm64-arm-64bit
[debug] exe versions: none
[debug] Proxy map: {}
[Beatport] dont-care: Downloading webpage
ERROR: Unable to extract playables info; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.11/site-packages/youtube_dl/YoutubeDL.py", line 815, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.11/site-packages/youtube_dl/YoutubeDL.py", line 836, in __extract_info
    ie_result = ie.extract(url)
                ^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.11/site-packages/youtube_dl/extractor/common.py", line 534, in extract
    ie_result = self._real_extract(url)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.11/site-packages/youtube_dl/extractor/beatport.py", line 50, in _real_extract
    self._search_regex(
  File "/opt/homebrew/Cellar/youtube-dl/2021.12.17/libexec/lib/python3.11/site-packages/youtube_dl/extractor/common.py", line 1012, in _search_regex
    raise RegexNotFoundError('Unable to extract %s' % _name)
youtube_dl.utils.RegexNotFoundError: Unable to extract playables info; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.

Description

When trying to download a track from beatport with the following command youtube-dl --verbose https://www.beatport.com/track/dont-care/16624764, I get the error shown above. In my opinion this should be working without any problems.

dirkf commented 1 year ago

It would be, but apparently (I have no knowledge of the site beyond what is in the extractor code) the site has been reworked with Next.js and the targets sought by the extractor no longer exist, such as the JS variable Playables reported above.

This rewrite against the master branch passes the tests.

--- old/youtube_dl/extractor/beatport.py
+++ new/youtube_dl/extractor/beatport.py
@@ -1,23 +1,35 @@
 # coding: utf-8
 from __future__ import unicode_literals

-import re
-
 from .common import InfoExtractor
 from ..compat import compat_str
-from ..utils import int_or_none
+from ..utils import (
+    determine_ext,
+    int_or_none,
+    join_nonempty,
+    merge_dicts,
+    parse_iso8601,
+    T,
+    traverse_obj,
+    txt_or_none,
+    unified_strdate,
+    url_or_none,
+    variadic,
+)

 class BeatportIE(InfoExtractor):
     _VALID_URL = r'https?://(?:www\.|pro\.)?beatport\.com/track/(?P<display_id>[^/]+)/(?P<id>[0-9]+)'
     _TESTS = [{
         'url': 'https://beatport.com/track/synesthesia-original-mix/5379371',
-        'md5': 'b3c34d8639a2f6a7f734382358478887',
+        'md5': 'cfcc245aafcad52a837b2c5a60a472c9',
         'info_dict': {
             'id': '5379371',
             'display_id': 'synesthesia-original-mix',
-            'ext': 'mp4',
+            'ext': 'mp3',
             'title': 'Froxic - Synesthesia (Original Mix)',
+            'timestamp': 1397854513,
+            'upload_date': '20140428',
         },
     }, {
         'url': 'https://beatport.com/track/love-and-war-original-mix/3756896',
@@ -27,20 +39,86 @@
             'display_id': 'love-and-war-original-mix',
             'ext': 'mp3',
             'title': 'Wolfgang Gartner - Love & War (Original Mix)',
+            'timestamp': 1346195831,
+            'upload_date': '20120917',
         },
     }, {
         'url': 'https://beatport.com/track/birds-original-mix/4991738',
-        'md5': 'a1fd8e8046de3950fd039304c186c05f',
+        'md5': '2dff00955b13c182931a708d979801b6',
         'info_dict': {
             'id': '4991738',
             'display_id': 'birds-original-mix',
-            'ext': 'mp4',
+            'ext': 'mp3',
             'title': "Tos, Middle Milk, Mumblin' Johnsson - Birds (Original Mix)",
+            'timestamp': 1386121876,
+            'upload_date': '20131209',
         }
     }]

     def _real_extract(self, url):
-        mobj = re.match(self._VALID_URL, url)
+        mobj = self._match_valid_url(url)
+        track_id, display_id = mobj.group('id', 'display_id')
+
+        webpage = self._download_webpage(url, display_id)
+
+        next_data = self._search_nextjs_data(webpage, display_id, fatal=False)
+        if not next_data:
+            return self._old_real_extract(url)
+
+        track = traverse_obj(
+            next_data,
+            ('props', 'pageProps', lambda k, v: k == 'track' and v['id'] == int(track_id)),
+            get_all=False)
+
+        title = track['name']
+        artists = ', '.join(traverse_obj(track, ('artists', Ellipsis, 'name', T(txt_or_none)))) or None
+        title = join_nonempty(artists, title, delim=' - ')
+        title = join_nonempty(
+            title, traverse_obj(track, ('mix_name', T(lambda s: '(' + s + ')'))),
+            delim=' ')
+
+        formats = []
+        # next.js page has <= 1 sample URL
+        f_url = traverse_obj(track, ('sample_url', T(url_or_none)))
+        if f_url:
+            ext = determine_ext(f_url)
+            fmt = {
+                'url': f_url,
+                'ext': ext,
+                'format_id': ext,
+                'vcodec': 'none',
+            }
+            if ext == 'mp3':
+                fmt['preference'] = 0
+                fmt['acodec'] = 'mp3'
+                fmt['abr'] = 96
+                fmt['asr'] = 44100
+            elif ext == 'mp4':
+                fmt['preference'] = 1
+                fmt['acodec'] = 'aac'
+                fmt['abr'] = 96
+                fmt['asr'] = 44100
+            formats.append(fmt)
+        self._sort_formats(formats)
+
+        return merge_dicts({
+            'id': track_id,
+            'display_id': display_id,
+            'title': title,
+            'formats': formats,
+            'artists': artists,
+        }, traverse_obj(track, {
+            'disc_number': ('catalog_number', T(int_or_none)),
+            'timestamp': ('encoded_date', T(parse_iso8601)),
+            'categories': ('genre', 'name', T(txt_or_none), T(variadic)),
+            'thumbnail': ('image', 'uri', T(url_or_none)),
+            'upload_date': (('new_release_date', 'publish_date'), T(unified_strdate)),
+            'track_number': ('number', T(int_or_none)),
+            'album': ('release', 'name', T(txt_or_none)),
+        }, get_all=False))
+
+    def _old_real_extract(self, url):
+        mobj = self._match_valid_url(url)
         track_id = mobj.group('id')
         display_id = mobj.group('display_id')

@@ -48,8 +126,8 @@

         playables = self._parse_json(
             self._search_regex(
-                r'window\.Playables\s*=\s*({.+?});', webpage,
-                'playables info', flags=re.DOTALL),
+                r'(?s)window\.Playables\s*=\s*({.+?});', webpage,
+                'playables info'),
             track_id)

         track = next(t for t in playables['tracks'] if t['id'] == int(track_id))

The page offers sample audio extracted from the full track available for purchase. Does the site offer full downloads with login or otherwise? Also, the old site offered AAC but the test URL that did so now only has the MP3 sample.

philszalay commented 1 year ago

@dirkf thank you! I can confirm that it works now. When logged in and with a subscription it is possible to listen to the full tracks. Do you know if a download with a login is possible atm? If I provide USERNAME and PASSWORD I still get the sample audio.

dirkf commented 1 year ago

The extractor doesn't know how to login using --password .../-p ... etc. You could try passing --cookies ... from your logged-in browser session but the patch above is clearly asking for the sample_url. Similarly, the old code only fetched the preview tracks.

If full tracks are available when logged in, it should be possible to extract them. A user with a login would have to analyse the data, or share the login details, or provide the output of --write-pages when logged-in cookies are supplied.

dirkf commented 1 year ago

You need to install the master, or nightly, code. join_nonempty() is one of a quite a few new utility functions added since 2021-12.

ytdl-org / youtube-dl