stash-c4s-pyscraper
THIS PLUGIN IS NOW DEPRECATED DUE TO CHANGES TO THE C4S SITE - PLEASE DO NOT USE
The plugin has been left on Github for posterity, or in case any of its component code is of any use. It's all open-source - feel free to rip it apart and cannibalise it
Python-based Clips4Sale scraper for Stash
Dependencies
This is a Python scraper, and as such, Python (Python3) needs to be installed.
- The official Stash docker container already contains python and all needed modules.
- For Windows systems, install python from python.org (instructions), NOT from the Windows store.
- For Linux systems please consult the relevant distro instructions.
- For Μac systems either use homebrew eg
brew install python3
or use the python.org installer (instructions)
pip requirements
Features
- Addresses a bug where descriptions are sometimes rendered without apostrophes. This is achieved by pulling in the description from an alternative source (l.clips4sale.com as opposed to clips4sale.com) and combining the two sources into one perfect description. It compares the description word-by-word from each source, and if there's a mismatch, replaces it with the word from l.clips4sale.com (i.e. the one with apostrophes).
- Strips errant HTML tags from the text of the description.
- Replaces single quotation marks with apostrophes, preventing instances of "don‘t" or "don’t", etc, and replacing them with "don't".
- Introduced a fallback for thumbnail processing. There appears to be two types of thumbnail for C4S content - one is a static thumbnail (most likely a custom thumbnail uploaded by the content creator) and the other is a still frame from a GIF preview. This script favours the static thumbnail, and falls back to the GIF still frame if that isn't available. In short, there should always be a picture of some sort this time. No more missing images.
- Line breaks are now processed correctly. Before, there were instances of paragraphs immediately following on from one another (not even leaving a space) - this should now be rectified.
Pitfalls
- I have tested this on a couple of hundred video clips, and the only issue I've encountered is the occasional 'Invariant violation' error - this happens in about one in every 50-100 scrapes or so, and can be rectified by just running the scraper again.
- Takes quite a bit longer than the YML/Xpath based scraper. It's probably quite inefficient code - I think there are three calls made to the C4S website. I'm sure this could be slimmed down.
Special thanks
I'd like to thank the writers of the following scrapers on the CommunityScrapers git repository - their code has proven very useful in helping this come together:
Also, mention has to go to ChatGPT, without which I wouldn't have been able to fumble blindly through this project.