yasserg / crawler4j

Open Source Web Crawler for Java
Apache License 2.0
4.54k stars 1.93k forks source link

Timeoutable regular expressions in RobotstxtServer #429

Open dgoiko opened 4 years ago

dgoiko commented 4 years ago

Fixes #425 by creating Matchers that throw RuntimeExceptions on timeout and a TimeoutablePathRule that extends PathRule that uses them.

The default behaviour of the system is not to use them, however, it can be enabled via RobotstxtConfig.

NOTE: The code for the timeoutable Matches is based on this stackoverflow answer and it decreases performance of regexp. The ideal thing should be to include a native efficient and timeoutable regex library, but this is a valid workaround