Currently the code blindly assumes everything linked to is an HTML page & tries to parse it as such. As a result fetches crash on non-HTML resource, e.g. PDF files.
PageAnalyzer should check the content type and only parse HTML files. Other types it should just return based on status code.
Currently the code blindly assumes everything linked to is an HTML page & tries to parse it as such. As a result fetches crash on non-HTML resource, e.g. PDF files.
PageAnalyzer
should check the content type and only parse HTML files. Other types it should just return based on status code.