z7r1k3 / creeper

Web Crawler and Scraper
GNU General Public License v3.0
12 stars 1 forks source link

Unable to Crawl JS Dependent Sites #3

Open Texan1835 opened 4 years ago

Texan1835 commented 4 years ago

Ran this scraper exactly as it was created, only modified path from logs to a .txt in a Windows folder. Captures about half the email addresses on a given webpage, but never captures phone numbers. Running code against an entire website, not just a single webpage. Error seems to occur even when I run it against a single webpage with multiple phone numbers listed. Windows 10, python 3.8, pyCharm. Please note - I'm a newbie to python, so it's possible the error is on my end.

Edit: Ran scraper against this link because it has lots of phone/email: https://www.hamradio.com/contact.cfm

Result:

`Crawling https://www.hamradio.com/contact.cfm

Emails:

Phone Numbers:

Process finished with exit code 0`

z7r1k3 commented 4 years ago

Are those phone numbers linked? i.e. If you view the source, does it have an href="tel:1234567890"?

Currently, only linked phones/emails are supported, but I do plan to eventually add support to search the entire page for anything that looks like a phone/email, linked or not.

Texan1835 commented 4 years ago

It is not formatted like that. Uses br tags.

Code for phone looks like this on the webpage: `
Phone: 713-533-7373
Toll Free: 800-854-6046

`

Email code:

` anaheim@hamradio.com

`

HamRadioCode

z7r1k3 commented 4 years ago

Unfortunately the crawler doesn't support scraping for plaintext phones/emails yet, although that is on the to-do list. For now it has to be an actual tel or mailto link.

As for that .cfm link, since .cfm isn't added to the whitelist, it's treating it as an unsupported filetype. I'll go ahead and add it, but you should be able to put that in as the original scraping URL as a workaround. Is that what you tried and did it still not work?

z7r1k3 commented 4 years ago

After debugging, the crawler is unable to view the webpage because it requires JavaScript.

As such it would appear this site (and any site like it) is unsupported. I may add a fix for this in the future if it becomes common enough, but I'll need to deep dive it a bit.

This is all the HTML the crawler gets to see:

<html><title>You are being redirected...</title>
<noscript>Javascript is required. Please enable javascript before you are allowed to see this page.</noscript>
<script>var s={},u,c,U,r,i,l=0,a,e=eval,w=String.fromCharCode,sucuri_cloudproxy_js='',S='eD0nMG1TYicuc3Vic3RyKDMsIDEpICsgJycgKycnKyJlIi5zbGljZSgwLDEpICsgIjRzdWN1ciIuY2hhckF0KDApKyJjbCIuY2hhckF0KDApICsgICcnICsgCiIwc3VjdXIiLmNoYXJBdCgwKSsnYScgKyAgICcnICsnMmEnLnNsaWNlKDEsMikrJzMnICsgICJmIiArICIiICsiM3N1Ii5zbGljZSgwLDEpICsgImNzdWN1ciIuY2hhckF0KDApKyIiICsndUc5Jy5jaGFyQXQoMikrJz1mJy5zbGljZSgxLDIpK1N0cmluZy5mcm9tQ2hhckNvZGUoMHgzNCkgKyAnNicgKyAgJzAnICsgICAnJyArJzQnICsgICI2ayIuY2hhckF0KDApICsgICcnICsgCiI5dyIuY2hhckF0KDApICsgIjgiICsgIjIiICsgIiIgKyczJyArICBTdHJpbmcuZnJvbUNoYXJDb2RlKDB4MzUpICsgIjgiLnNsaWNlKDAsMSkgKyAgJycgKyAKIjRzdWN1ciIuY2hhckF0KDApKyAnJyArJzAnICsgICI1bSIuY2hhckF0KDApICsgICcnICsgCiJhc3UiLnNsaWNlKDAsMSkgKyAiIiArU3RyaW5nLmZyb21DaGFyQ29kZSgweDM0KSArICdWeD4wJy5zdWJzdHIoMywgMSkgKyAnJyArImZzdWN1ciIuY2hhckF0KDApKyJjIiArICAnJyArJyc7ZG9jdW1lbnQuY29va2llPSdzc3VjdXJpJy5jaGFyQXQoMCkgKyAndXMnLmNoYXJBdCgwKSsnYycrJ3VzdScuY2hhckF0KDApICsnc3VjdXJyJy5jaGFyQXQoNSkgKyAnaScrJ19zdScuY2hhckF0KDApICsnc3VjdXJpYycuY2hhckF0KDYpKydsJy5jaGFyQXQoMCkrJ29zdWN1Jy5jaGFyQXQoMCkgICsnc3UnLmNoYXJBdCgxKSsnc3VjdXJkJy5jaGFyQXQoNSkgKyAncCcuY2hhckF0KDApKydyJysnJysnc3VjdXJpbycuY2hhckF0KDYpKyd4c3VjdScuY2hhckF0KDApICArJ3lzJy5jaGFyQXQoMCkrJ18nKyd1JysndScrJ2knKydkJysnX3N1Y3VyJy5jaGFyQXQoMCkrICdiJysnc3VjdXJmJy5jaGFyQXQoNSkgKyAnOScrJzlzJy5jaGFyQXQoMCkrJ2JzdWN1cmknLmNoYXJBdCgwKSArICdlJy5jaGFyQXQoMCkrJ2NzdScuY2hhckF0KDApICsnNnN1YycuY2hhckF0KDApKyAnZnMnLmNoYXJBdCgwKSsiPSIgKyB4ICsgJztwYXRoPS87bWF4LWFnZT04NjQwMCc7IGxvY2F0aW9uLnJlbG9hZCgpOw==';L=S.length;U=0;r='';var A='ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/';for(u=0;u<64;u++){s[A.charAt(u)]=u;}for(i=0;i<L;i++){c=s[S.charAt(i)];U=(U<<6)+c;l+=6;while(l>=8){((a=(U>>>(l-=8))&0xff)||(i<(L-2)))&&(r+=w(a));}}e(r);</script></html>
z7r1k3 commented 4 years ago

Reopening since the OP stated that it successfully scraped emails from another site, but not the phone numbers.

I'm assuming it's because the phone numbers on the other site (not hamradio) were in plaintext, but I will give the OP a chance to respond on the off-chance there's something else going on here.

OP, can you provide the URL that "Captures about half the email addresses on a given webpage, but never captures phone numbers"? Or at least a snippet of the source code?

z7r1k3 commented 4 years ago

No reply. Closing as everything presented in this issue is not supported.

z7r1k3 commented 3 years ago

Reopening as, since this is a website with proper links, etc. in the HTML, it should be supported.

There is no timeline for fixing this issue, but it is officially on the agenda.