Web Scraper Improvements

aubundy commented 3 years ago

The current web scraper works maybe 95% of the time. Sometimes a website detects the bot and won't display the article. Other times a user will save a PDF or video they watched, but the current scraper is built to only handle articles, so the content doesn't display correctly. For a start, I created a checklist below of these edge cases/other features that would make the scraper more robust.

[x] Work better with bot detectors - #40
[ ] Better error handling. Python errors should be passed into an array (along with the values array) that is printed at the end of reading_scraper.py to be picked up by nodejs
[ ] PDFs
[ ] Tests
[ ] YouTube and Vimeo videos
[ ] Web-based Podcasts
[ ] Better error handling/general refactoring for better readability
[ ] Working with articles written in other languages
[ ] As improvements are made to the scraper to make it more accurate, we need a way to go back and update old readings

Feel free to break off one of the tasks above into its own issue for better collaboration

aubundy commented 3 years ago

Update article scraper

hiki270bis commented 3 years ago

Hi, I don't know if can discuss here. As a newbie, I would like to try and start working on some of these issues. Are the devs who wrote the scraper part still on the project?

aubundy commented 3 years ago

Yes, we'd love your help! I wrote the scraper and can try to answer any questions you have

hiki270bis commented 3 years ago

Oh, nice to hear! I don't know many of the packages, so maybe when I have a doubt about one of them maybe I can ask you fro clarifications? Obviously I'll first read the docs. I tried to run the script from the terminal and it works if I just pass the link as an arg in the cmd line, but in the app, where does it get the url from? Can I assume it does work as if I am passing an arg from cmd line? And also the results can I assume that that's the way the script has to present them in order then to be picked up by (I guess) the rest of the javascript which then displays them on the site? And, last thing. Why are the reading_scraper and the reading_summary in two different files, and do they communicate somehow? 'Cause I don't see one importing the other but maybe that's just me missing something?

aubundy commented 3 years ago

Good questions!

Oh, nice to hear! I don't know many of the packages, so maybe when I have a doubt about one of them maybe I can ask you fro clarifications? Obviously I'll first read the docs.

Yes, that won't be a problem.

I tried to run the script from the terminal and it works if I just pass the link as an arg in the cmd line, but in the app, where does it get the url from? Can I assume it does work as if I am passing an arg from cmd line? And also the results can I assume that that's the way the script has to present them in order then to be picked up by (I guess) the rest of the javascript which then displays them on the site?

Yes, it should. In the app, the client sends a post request to /api/users/:id/readings. The createReading function in controllers/readings receives the user id and article url and sends them as arguments to the reading_scraper.py that parses the article and returns the values array in line 153 to be inserted into the db. The db then saves the title, domain, description, image, word count, and url.

And, last thing. Why are the reading_scraper and the reading_summary in two different files, and do they communicate somehow? 'Cause I don't see one importing the other but maybe that's just me missing something?

reading_summary.py generates a 500 word summary of the article. The idea was for someone who was interested in an article that was posted, but didn't want to read the whole thing, they could click a "summarize" button and read a summary in the app. It's not in use right now - I removed the frontend code that sends a request to this. I probably should have removed the file by now.

hiki270bis commented 3 years ago

Ok, so which issue do you think we should tackle first? I feel like the pdf one is challenging but not impossible. I googled around and there are few packages that allow you to scrape the pdf content, so I think one could come up with some sort of content to show in the preview, even if I don't know how easy would it be to separate the title from the rest. As for the picture to show, I didn't find anything about showing a smaller preview image of the pdf, so I don't know if I would be able to do it on my own, but there's still the option, when the article is on pdf, to show a custom icon image of pdf. Working better with bot detector seems tempting as well but I am not confident I could solve it. Besides, I don't know how powerful beautiful soup is as a scraper. Maybe we could consider using scrapy but again, I have no experience, I once tried to install that but couldn't get it to work properly, in the end I resolved to use selenium, but I don't think it would be very good for this kind of project, although I'm still inexperienced Videos and podcast maybe it's just me but it would be a feature to add after we solved the pdf thing. As for the "better error handling/general refactoring for better readability", which are the main issues? I've only written scripts for myself (and just few), so I only know about the "good practice" what the ZTM course told. Sorry I wrote a lot but I'm looking forward to work on this project

aubundy commented 3 years ago

I agree, the PDF viewer could be a good start. Feel free to use any package you choose. I've only come across pdfreader

I think the Goose package has built in video parsing, so that might be a quicker task as well.

And no need to worry about the refactoring yet. It'll probably be good to eventually put the scraper into it's own folder with the different functions separated into their own files for better readability

hiki270bis commented 3 years ago

ok, then I'll try to build a separate module for the pdf scraping. Don't know how long it will take me but I'll keep you updated

maxemileffort commented 3 years ago

How do we feel about using OCR API's? They can be used to pull text out of images and PDFs.

Google's: Link OCR Space: Link

The difference between the two is that Google's handles larger files but lets you have fewer calls, while OCR Space does the opposite (smaller files, but more calls). Google's is usually a higher quality as well.

Both have really good documentation. The free tiers are pretty generous for development, but nothing else really.

I'd like to tackle bot detection, but I have a question before I get started.

Locally, I can run a browser where I can mimic user behavior (searching for article titles through Google, having extra tabs open, changing some fingerprints that bot detectors look for, random smooth scrolls, etc.). This gets me through Amazon and Google bot detection, but I haven't tried any other sites.

Will a production environment have the capacity to implement that? Or should we stick to libraries that grab html?

hiki270bis commented 3 years ago

But using an API wouldn't mean we have to download the file and then pass it through the app? Just asking

aubundy commented 3 years ago

@maxemileffort I think OCR might be too much for what we would be using it for. PDF data should be able to be scraped without OCR. But often with scrapers, it's good to have a fallback option since tools will often have different strengths. Maybe we could send a PDF doc to an OCR API if a package we are using didn't gather the data we wanted

hiki270bis commented 3 years ago

Hi. So hopefully I didn't mess up with Git Hub: made a new branch and added the file I wrote of some kind of pdf scraping module I came out with. Some things need to be worked out but before I go further I'd like to know what you think about it, if I am at least on the right track. My main concern is that it seems to be a bit slow, but maybe that's my poor machine. Well, before going on I'll wait for some feedback (I don't mind starting all over again if you can point me to a better direction). Here's the link to the branch in my repository. https://github.com/hiki270bis/breads-server/tree/scraperbranchhiki Should I do a pull request to have the branch showed in the zero-to-mastery page or PR are just when you want to merge the branch with master? The file is the last one, scrap_pdf.py

aubundy commented 3 years ago

@hiki270bis It looks like you're on the right track. I thought of a few things to do before committing to the original branch -

Could you run ~10 PDFs through this and time how long it takes
Comment out return_words() and see how much faster it scrapes
One thing we need to add is PDF detection so we know whether to run the PDF scraper or the web scraper whenever a url is received. I'm guessing the first part of get_pdf_url() could work for this.

The final step would be to connect with node.js and try it out in a fully running local environment where a user saves a PDF reading.

aubundy commented 3 years ago

@maxemileffort I just realized I didn't respond to the second part of your comment. Sorry about that! Yes, the production environment should be able to run a headless browser like you mentioned. We could fall back to that if the original scraper doesn't return the data we need. I can send you a list in discord of article urls that have been posted where we were unable to get article data. You could test the headless browser on this list for a start

hiki270bis commented 3 years ago

@aTmb405 Ok, these are my results:

counting words: used 10 random PDFs; some with 500 words, others with 10000+. It took 109 secs counting pages: the same 10 PDFs. It took 45 secs counting neither pages nor words: the same 10 PDFs, took 42 secs

I think that the decision whether to run the web or the article scraper should be taken in the main; I wrote get_pdf_url() because in some instances (es: https://www.researchgate.net/publication/315905287_INTRODUCTION_TO_ANTHROPOLOGY) the article is inside a pdf viewer, but the main link is not to the actual file. This function starts with the assumption that there is a pdf article to scrape somewhere in the page. I know it wouldn't probably be the best, but just for now, couldn't we add a radio button or something in the form for the user to indicate if he/she's uploading a pdf link?

Also there are many cases where the scraper wouldn't work, I'll try and fix them. The hardest thing might actually be to transform the lines[] into a title+summary, since there's no a standard way to write pdf's and sometimes there's, for instance, a date or other non interesting things in the top left or elsewhere which will be returned as the first line. I'll try to fix that too but I don't think I'll be able to come out with something that works perfectly every time, so if you have suggestions, those are welcome.

Finally, as I'm completely new to github, I'll leave here the link to the branch in my repo where I made few adjustments (like counting pages), but is this really necessary or since it's a branch of the master you would still be notified? Even if I named it in a different way so that it won't be mixed up with the master, I still shouldn't make a push to the ztm repo, right? And if instead it's the case, which is the right command? Also, what command should I use to update my local version of the code so that it's up to speed with the rest of the project? https://github.com/hiki270bis/breads-server/tree/scraperbranchhiki

aubundy commented 3 years ago

Thanks @hiki270bis! I have a few questions/thoughts

For those time results, are they from running all 10 right after the other, or is that an average time?
I'm thinking we should focus on urls that end with .pdf for now, and then branching out to links with pdfs in them. We could then check for pdf in node.js and send the url to the appropriate scraper
For PDF titles, we'd need to think about what makes titles unique compared to other potential first line items (like no dates, or an average whitespace on either side of it). This would take some playing around with. Are you comfortable with Regex? This could make this task easier
I also think this is something worth finding a good tool for. Would it be possible to try out 1 or 2 other packages to compare results?

hiki270bis commented 3 years ago

@aTmb405

Those are the results from running all ten PDFs one after the other; you meant to check the average time? Ok I'll do that then.
Yes Regex is what I was thinking about to solve the title etc. issue, but I don't think it will be a perfect solution.
I will check for other packages and compare the results, but I set on this one after doing some research and this was considered one of the best, so I don't know if there will be some substantial improvement
Also, I found out that most of the errors I got from scraping various PDF were due to the return_words() method which for some reason wasn't able to count them. Now I wrapped all of the methods in try/except, so they won't crash the script, and instead will return None

hiki270bis commented 3 years ago

Well apparently I was wrong: I switched to pymupdf and the loading time dropped: the two lists below compare the times from the two different scrapers for the same pdf links

with pdfplumber: with both words and pages [24.76241636276245, 2.762158155441284, 5.287302494049072, 7.731441974639893, 24.89242386817932, 6.081347703933716, 10.33959150314331, 7.7054407596588135, 9.47854208946228, 17.02997398376465, 33.96194243431091]
with pymupdf: with both words and pages [5.303016662597656, 4.170005798339844, 2.7500038146972656, 3.900005578994751, 1.550002098083496, 1.3200018405914307, 0.872002363204956, 1.9300026893615723, 0.7500009536743164, 11.090015649795532, 4.360006093978882]

Also, pymupdf has a nice rendering method that creates an image of a pdf page so we could use that to show

usual link to my branch with the updated code https://github.com/hiki270bis/breads-server/tree/scraperbranchhiki

aubundy commented 3 years ago

@hiki270bis Awesome. Thanks for testing that out! Could you test one more thing for me? And then we just need to make sure it prints an array similar to reading_scraper.py.

Could you compare the return_first_lines() output with the PDF title so we can get a snapshot of how accurate it currently is?

hiki270bis commented 3 years ago

@aTmb405 No problem! I'm having fun. About titles and stuff, through regex I thought we could exclude lines with dates, hours and url, as well as empty strings or lines with only symbols or white spaces or numbers or any combinations of just these three, or and also lines with "copyright" or "journal" (not sure about journal though) in it, and "page n" . I thought we could exclude "volume", "issue" and "number", but these are words which could well be in the title or in the text, so I opted to let them be. It seems to work well. Tell me if you have any suggestion about other things to exclude. The real problem is that there is no guarantee that first line == title, as it could also be, for instance, the author. Also, the title could be split in two or more lines (this seems to be the more common issue), but I can't think of a way to come around it. On the whole, I'd say it gets it right about 50-70% of the times.

If this is ok, next I'll work on the array output

my branch repo https://github.com/hiki270bis/breads-server/tree/scraperbranchhiki

hiki270bis commented 3 years ago

@aTmb405 by the way, what in format should the image be returned? I see that in reading_scraper what's returned is the link to the img, but both if I render the first page or if I extract the image, it's going to be an img object... would a pillow image work?

aubundy commented 3 years ago

@hiki270bis I would say for right now, we don't need to have an image for a PDF. There is a placeholder image used on the front end for articles that don't have an image, and that would work for PDFs as well.

I can't remember what reading_scraper.py returns for articles without images. Probably null

hiki270bis commented 3 years ago

@aTmb405 I commented out the self.image and gave it a standard value of None which should be what reading_scraper returns when fails to get an image, but the method is there so whenever we want to implement it we just need to uncomment self.img. Also i wrote it so that when running the file as main it will return the print(json.dumps(values)). But if you prefer to call the script as a module from scraper_reading you can just do print(json.dumps(self.values)). It's my first time with json (and actually contributing to a project) so I hope there are no mistakes. I think it's more or less done. If there's nothing more to add and everything works well should I make a pull request?

aubundy commented 3 years ago

Awesome. Thanks for doing that. The final step is to post a few PDF urls in a local setup of the app to make sure it runs correctly. If it works for you locally, go ahead and sync your branch with the remote branch and make a PR

hiki270bis commented 3 years ago

@aTmb405 The final step is to post a few PDF urls in a local setup of the app to make sure it runs correctly I'm sorry I'm inexperienced, what does it mean? Do I have to put up a local server and run the whole web-app? If so, I'm not sure how to call my script... should I do it from reading_scraper.py ? Like, if url ends with .pdf then run scrap_pymupdf.py ? Also, sorry for the many questions, but I see that my branch is behind master of about 20 commits. How can I update my branch so that it's up to date?

maxemileffort commented 3 years ago

Hey @aTmb405 I refactored to a class and tried to fix the useragent error. Everything seems to be working locally now, but the pull request says it won't merge automatically. Is that going to be an issue? I'm interested to see if the error is fixed on the heroku server.

Here's the new code, if you want to take a look: https://github.com/maxemileffort/breads-server/blob/master/reading_scraper.py

I started some error handling but I wasn't sure if I should handle them in the scraper or on the server.

aubundy commented 3 years ago

@maxemileffort Thanks for doing that. Which PR are you referring to? You might need to create a new one for this.

I think errors related to scraping should be handled in the scraper, but we need to think of a good way to pass errors to the nodejs server. The package we are using looks for print statements and passes those to nodejs. This is why the return values are printed at the end. Maybe we could print an object that has a values array and error array. We could then check if the errors array is empty inside nodejs.

maxemileffort commented 3 years ago

@maxemileffort Thanks for doing that. Which PR are you referring to? You might need to create a new one for this.

I think errors related to scraping should be handled in the scraper, but we need to think of a good way to pass errors to the nodejs server. The package we are using looks for print statements and passes those to nodejs. This is why the return values are printed at the end. Maybe we could print an object that has a values array and error array. We could then check if the errors array is empty inside nodejs.

I started to make a pull request but when I saw the warning, I stopped. Happy to start a new one if it’s not going to be an issue.

I’ve been messing around with the backend to see how we could handle the errors. You’re right about handling it in the script I think. It’s probably simpler to handle it that way. Currently it just makes a card that says “invalid url” and has some delicious looking bread for the pic. 😆

aubundy commented 3 years ago

@maxemileffort Yes feel free to start a new one and we'll go from there to get it merged

aubundy commented 3 years ago

@hiki270bis

Do I have to put up a local server and run the whole web-app?

For now, yes. We could add tests to the list of needed improvements for the scraper so that this step will become unnecessary in the future.

If so, I'm not sure how to call my script... should I do it from reading_scraper.py ? Like, if url ends with .pdf then run scrap_pymupdf.py ?

Right now I would set up the pdf scraper to be imported into reading_scraper.py. Where the functions are called at the bottom of the program, add a conditional check for urls that end with .pdf, and if they do, scrape with the pdf scraper instead.

Also, sorry for the many questions, but I see that my branch is behind master of about 20 commits. How can I update my branch so that it's up to date?

You could run git pull upstream master and resolve merge conflicts from there or reset your commits, stash them, merge the upstream branch, and then recommit them.

Here are the steps to stash your commits after resetting -

git stash
git pull upstream main
git stash pop

maxemileffort commented 3 years ago

@aTmb405 so now that the scraper is working, I've taken the time to try and work on errors and currently all I am doing is passing back a mostly empty values array with a title of "Invalid URL." and the actual URL that was passed in by the user.

The app then builds a card with those 2 values. While typing this I had the idea to add another value for the domain that asks the user to re-check the URL before re-submitting the link...

Is this a direction you want the error handling to go, or was there another type of error that you were seeing before that you wanted to get a handle on?

hiki270bis commented 3 years ago

Sorry, I'm having trouble testing the script as I have never done something like that. I tried to use simpleHTTPserver but I can't seem to be able to start the webapp, and I have never used mysql server, so it's going to take some time for me to learn how to use it since I can't follow the steps in the contributing page...

aubundy commented 3 years ago

@hiki270bis no worries. I will try to clone your personal branch and run it through some tests this week, if that sounds good to you

aubundy commented 3 years ago

The app then builds a card with those 2 values. While typing this I had the idea to add another value for the domain that asks the user to re-check the URL before re-submitting the link...

Is this a direction you want the error handling to go, or was there another type of error that you were seeing before that you wanted to get a handle on?

@maxemileffort I think there's a few extra edge cases we need to catch and that's one of them. So you're saying instead of returning "Unable to get title" for an invalid URL, we should return something like "Invalid URL - please double check your link". I like that idea.

I think there could also be a number of errors that prevent a scraping to finish. Right now, there's no way for us to send a message to the user that something went wrong while scraping their article. The scraper only returns a list of values. If there's an error, the article isn't scraped, but the user doesn't see any message about this.

I'm thinking we could return a dictionary that has the same values list as well as an error dictionary. We could fill the error dictionary with what's thrown in python, then in the node.js function, we'd do a check to see if the error dictionary is empty or not and then go from there.

We could also add more keys to this dictionary in the future like which part of the scraper returned the correct data: the original url, the cached version, etc.

aubundy commented 3 years ago

@hiki270bis I haven't had time to try cloning your branch, and I'm not sure when I will. I'd be happy to help you troubleshoot setting up the app locally so you can test your additions

zero-to-mastery / breads-server

Web Scraper Improvements #30