xrma / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

How to crawl .js files? #185

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
I've tested the crawler and everything goes right. The only problem is that the 
code never recognizes `.js` files. Is there anything I can do to download `.js` 
files as well? Thanks.

Original issue reported on code.google.com by alirezan...@gmail.com on 19 Jan 2013 at 11:50

GoogleCodeExporter commented 9 years ago
The solution for me was to patch some crawler4j classes to include <script> 
element handling. Finally, I also had to patch Tika's HtmlHandler, which simply 
ignores any <script> tag inside of html <head> :-(. See attached files for 
patched classes and search for keyword PATCH.

Original comment by m4rcow...@gmail.com on 21 Feb 2013 at 8:56

Attachments: