xdebbie / forkkit

Web crawler to mine album review scores and metadata from pitchfork.com
MIT License
5 stars 3 forks source link
beautifulsoup4 crawler python scraper spider sqlite

Forkkit - Pitchfork's album reviews scraper

Scraper in Python for forkked

Database as of 25th May 2020 => 20,077 albums reviewed


Installing the scraper

  1. Clone the repository
  2. On the terminal, create a virtual environment by typing
    \$ virtualenv -p python3 .
    This project was conceived using Python 3.7
  3. To load the requirements, type on the terminal
    \$ . bin/activate
    \$ pip install -r requirements.txt

Installing the required libraries

  1. The script uses the excellent mapping tool peewee which you probably don't have installed. To get it, type
    \$ pip install peewee
  2. It also uses the requests.html library for the heavylifting (parsing the HTML pages). To install, hit
    \$ pip install requests-html
  3. To fetch the artworks' URL, I had to use BeautifulSoup because the URLs src are under a div/class/img tag. Src is an attribute and not a proper HTML tag, so the requests method does not really work for fetching a src URL under an img tag.
    \$ pip install beautifulsoup4
  4. To parse and format the date into the YYYY-MM-DD format instead of 'January 1 2020', so the data is better handled by the SQL database. For that, the library htmldate was used. It can be downloaded by installing
    \$ pip install htmldate
    \$ pip install --upgrade htmldate
    \$ pip install git+https://github.com/adbar/htmldate.git

Creating the database

  1. To create the database file with the preset tables, type
    \$ python3 models.py
  2. Voilà! You should have now in your folder an albums.db file

What exactly the scraper does?

Running the scraper

  1. To run the scraper, type on the terminal
    \$ python3 forkkit.py
    and wait - gathering all this data may take a while!

Changing the variables

  1. In the forkkit.py file, you can change a couple of variables:

Notes

Special thanks to @nabaskes