zyrouge / node-genius-lyrics

Simple lyrics fetcher that uses Genius. Also has official API implementations.
https://genius-lyrics.js.org
MIT License
60 stars 12 forks source link

HTML parsing error due to captcha #41

Open leecheeyong opened 1 year ago

leecheeyong commented 1 year ago

I got this error when I tried to use this module without the API key, which scrapes the genius lyrics.

undefined:1
<!DOCTYPE html>
^

SyntaxError: Unexpected token < in JSON at position 0
    at JSON.parse (<anonymous>)
    at SongsClient.<anonymous> (/home/runner/terminal/node_modules/genius-lyrics/dist/songs/client.js:47:37)
    at Generator.next (<anonymous>)
    at fulfilled (/home/runner/terminal/node_modules/genius-lyrics/dist/songs/client.js:5:58)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
const Genius = require("genius-lyrics");
const Client = new Genius.Client();
(async () => {
const searches = await Client.songs.search("faded");

const firstSong = searches[0];
console.log("About the Song:\n", firstSong, "\n");
console.log(firstSong)
})();
zyrouge commented 1 year ago

Code?

leecheeyong commented 1 year ago
const Client = new Genius.Client();
(async () => {
const searches = await Client.songs.search("faded");

const firstSong = searches[0];
console.log("About the Song:\n", firstSong, "\n");
console.log(firstSong)
})();
leecheeyong commented 1 year ago

I ran an axios fetch request to the genius lyrics page and the page begins with:

<!doctype html>
<html>
  <head>
    <title>Alan Walker – Faded Lyrics | Genius Lyrics</title>
zyrouge commented 1 year ago

I ran an axios fetch request to the genius lyrics page and the page begins with:

<!doctype html>
<html>
  <head>
    <title>Alan Walker – Faded Lyrics | Genius Lyrics</title>

That's weird since the results are different. The error is probably due to cloudflare check, but your request from axios didn't trigger that check. Currently, there is no way of bypassing cloudflare check but I'll see what I can do. I keep this open until then.

leecheeyong commented 1 year ago

I ran an axios fetch request to the genius lyrics page and the page begins with:

<!doctype html>
<html>
  <head>
    <title>Alan Walker – Faded Lyrics | Genius Lyrics</title>

That's weird since the results are different. The error is probably due to cloudflare check, but your request from axios didn't trigger that check. Currently, there is no way of bypassing cloudflare check but I'll see what I can do. I keep this open until then.

I do believe that it isn't, I've dealt with many cloudflare check and normally cloudflare checks would be: Page begins with:

<!DOCTYPE html><html lang="en"><head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Genius</title>

and the body content would have

 <div class="text alert">
        <h1>Scrrrr!!</h1>
        <div class="dek">Sorry, we have to make sure you're a human before we can show you this page.</div>
        <div class="cloudflare_content">
          <form id="challenge-form" class="challenge-form" action="/young-captain-na-li-dou-shi-ni-lyrics?__cf_chl_f_tk=osDcIXsJsHgwkalPyNRqy_UNjB9G_PMl3Ud7QnrFmwU-1672043083-0-gaNycGzNCBE" method="POST" enctype="application/x-www-form-urlencoded">
    <div id="cf-please-wait" style="display: none;">

A Screenshot of the page:

image
zyrouge commented 1 year ago

I ran an axios fetch request to the genius lyrics page and the page begins with:

<!doctype html>
<html>
  <head>
    <title>Alan Walker – Faded Lyrics | Genius Lyrics</title>

That's weird since the results are different. The error is probably due to cloudflare check, but your request from axios didn't trigger that check. Currently, there is no way of bypassing cloudflare check but I'll see what I can do. I keep this open until then.

I do believe that it isn't, I've dealt with many cloudflare check and normally cloudflare checks would be: Page begins with:

<!DOCTYPE html><html lang="en"><head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Genius</title>

and the body content would have

 <div class="text alert">
        <h1>Scrrrr!!</h1>
        <div class="dek">Sorry, we have to make sure you're a human before we can show you this page.</div>
        <div class="cloudflare_content">
          <form id="challenge-form" class="challenge-form" action="/young-captain-na-li-dou-shi-ni-lyrics?__cf_chl_f_tk=osDcIXsJsHgwkalPyNRqy_UNjB9G_PMl3Ud7QnrFmwU-1672043083-0-gaNycGzNCBE" method="POST" enctype="application/x-www-form-urlencoded">
    <div id="cf-please-wait" style="display: none;">

A Screenshot of the page:

image

This could maybe be fixed with proper headers. I'll have to analyse it though. Meanwhile you could check the headers send from this library and the request done with axios and check the differences.

leecheeyong commented 1 year ago

It is kind of weird though, axios shouldn't work if the cloudflare anti scrape feature was turned on, this confuses me alot

leecheeyong commented 1 year ago

It is kind of weird though, axios shouldn't work if the cloudflare anti scrape feature was turned on, this confuses me alot

It doesn't seems like the headers matter:

  fetch({ url: "https://genius.com/young-captain-na-li-dou-shi-ni-lyrics" }).then(response => {
            var { data } = response;
            const $ = cheerio.load(data);

            const selectors = [
                () => $(".lyrics").text().trim(),
                () =>
                    $("div[class*='Lyrics__Container']")
                        .toArray()
                        .map((x) => {
                            const ele = $(x);
                            ele.find("br").replaceWith("\n");
                            return ele.text();
                        })
                        .join("\n")
                        .trim(),
            ];

        for (const x of selectors) {
            const lyrics = x();
            if (lyrics?.length) {
                return removeChorus ? removeChorus(lyrics) : lyrics;
            }
        }
        });

With this code, there are no headers, which means default to the axios user agent

AliAryanTech commented 1 year ago

++ Same, tho Well Ig, using some headers 'User-Agent' or puppeteer would work

AliAryanTech commented 1 year ago

found API to use it, it's working fine

leecheeyong commented 1 year ago

found API to use it, it's working fine

What do you mean by that ?

AliAryanTech commented 1 year ago

found API to use it, it's working fine

What do you mean by that ?

just a delayed API which can scrap lyrics

leecheeyong commented 1 year ago

found API to use it, it's working fine

What do you mean by that ?

just a delayed API which can scrap lyrics

could you mentioning about the api here, thanks

pendragons-code commented 1 year ago

As of 25-03-2023, the fix for the rate limit has not been implemented. It should be noted that the issue only persists when using a enterprise-like ip address. Whereas on things like a home network it would not affect you at all.

It should be noted that most common remedies and methods have been applied, so expect header changing to not work.

I like this project and I really hope to see it succeed.

AliAryanTech commented 1 year ago

I fixed this problem, now everything working.

Bibyyy commented 1 year ago

Hello, i have the same problem again when i host my bot on replit. How can i fix it ?

AliAryanTech commented 1 year ago

Hello, i have the same problem again when i host my bot on replit. How can i fix it ?

Use my API https://weeb-api.vercel.app/docs Check Misc session

FelixSimonB commented 3 months ago

Hello, i have the same problem again when i host my bot on replit. How can i fix it ?

Use my API https://weeb-api.vercel.app/docs Check Misc session

Why not just share what you did?

AliAryanTech commented 3 months ago

Hello, i have the same problem again when i host my bot on replit. How can i fix it ?

Use my API https://weeb-api.vercel.app/docs Check Misc session

Why not just share what you did? Here:

FelixSimonB commented 3 months ago

Hello, i have the same problem again when i host my bot on replit. How can i fix it ?

Use my API https://weeb-api.vercel.app/docs Check Misc session

Why not just share what you did? Here:

It's your API not the fix. What I meant was how did you fix the captcha error? or did you used other package do it?

Also tested your API and some lyrics format are kinda off https://weeb-api.vercel.app/lyrics?url=https://genius.com/Justin-bieber-never-say-never-lyrics but thanks for free public API 👍

zyrouge commented 3 months ago

There is no way to bypass captcha, unless you solve it. Using hosts that are not blacklisted would also be a way to avoid captcha, which is how proxy would work in this case. Maybe there is something I don't know that enables bypassing this, perhaps something like cookie.

leecheeyong commented 3 months ago

I think the conclusion to this issue would be the following:

  1. Not using enterprise-liked IPs
  2. Solve the captcha once in awhile (when it occurs)
  3. Use services like 2captcha to solve the captcha
  4. Hop between IP addresses
  5. Try other APIs