Sitemaps should use the Tika implementations instead of using the current one
in two places:
1. Currently the Parser has two public methods to activate it, both have an
argument with the Media Type (Content type), I suggest adding two new parsing
methods in which we will use Tika to detect the MediaType, the parsing methods
would be as follows:
public AbstractSiteMap parseSiteMap(URL url);
public AbstractSiteMap parseSiteMap(File file);
The content of these methods will be something like:
byte[] bytes = IOUtils.toByteArray(onlineSitemapUrl);
String contentType = new Tika().detect(bytes);
return parseSiteMap(contentType, bytes, onlineSitemapUrl);
The new methods I suggest above will be very convenient for the light user who
only wants to parse a simple sitemap without getting into any nitty gritty - I
believe many people will appreciate it.
2. Change the Mime type parsing to use Tika's MediaTyep.
So instead of this code:
if (url.getPath().endsWith(".xml") || contentType.contains("text/xml") ||
contentType.contains("application/xml") ||
contentType.contains("application/x-xml")
|| contentType.contains("application/atom+xml") || contentType.contains("application/rss+xml")) {
// Try parsing the XML which could be in a number of formats
return processXml(url, content);
} else if (url.getPath().endsWith(".txt") || contentType.contains("text/plain")) {
// plain text
return (AbstractSiteMap) processText(content, url.toString());
} else if (url.getPath().endsWith(".gz") || contentType.contains("application/gzip") || contentType.contains("application/x-gzip") || contentType.contains("application/x-gunzip")
|| contentType.contains("application/gzipped") || contentType.contains("application/gzip-compressed") || contentType.contains("application/x-compress")
|| contentType.contains("gzip/document") || contentType.contains("application/octet-stream")) {
return processGzip(url, content);
}
I want to use something like the following:
String mediaType = MediaType.parse(contentType).toString();
if (mediaType.contains(MediaType.APPLICATION_XML.getSubtype())) {
return processXml(url, content);
} else if (mediaType.contains(MediaType.APPLICATION_ZIP.getSubtype())) {
return processGzip(url, content);
} else if (mediaType.contains(MediaType.TEXT_PLAIN.getType())) {
return (AbstractSiteMap) processText(content, url.toString());
}
Original issue reported on code.google.com by avrah...@gmail.com on 19 Apr 2014 at 8:20
Original issue reported on code.google.com by
avrah...@gmail.com
on 19 Apr 2014 at 8:20