Closed pmurley closed 4 months ago
Glad you like it! The original crawler code was very tightly coupled to our lab infrastructure and databases, has since been replaced by a more maintainable version, and will probably never be released as it was.
However, we are aware that it would be helpful to have an example driver for automating Chrome in a VisibleV8-aware manner and parsing basic information out of the resulting logs. Adding such an example to the repository is on my TODO list and is progressing (slowly) in a personal branch. I'll leave this issue open until that stuff gets merged into master.
In the meantime, to start addressing a few of your questions, we currently use Puppeteer as the base of our crawler. We use its request interception API as the mechanism to block external navigations. Some things we did (like creating an "isolated world" for JS execution and injecting a script with a URL/name we control) were not available via the Puppeteer API and we resorted to direct ChromeDevTools APIs via Puppeteer's CDPSession escape-hatch.
Take a look at wspr-ncsu/visiblev8-crawler! We have support for a custom crawler.js in there!
I learned a lot from your paper, so thank you! Great work on this, and thanks for being willing to release all of this code.
I'm wondering if you might be willing to release some or all of the code you used to drive Chromium as well. I'm interested in some of the mechanics as far as how you integrated
gremlins.js
, how you blocked external navigations during crawls, the flags you set on the browser, etc. Any details you can provide beyond what's in the paper would be helpful.Thanks again for this work -- really cool stuff.