Allow for Whizu pages to be indexed by Google

rdhauwe commented 11 years ago

The content of a Whizu page is mostly generated on the server side and injected in the client with JavaScript. It should be verified that Google supports this and can index the generated content. If not, Whizu should support page preprocessing in order to stream the (static) content with the initial page request.

rdhauwe commented 11 years ago

Google does index pages that contain JavaScript ;-) at least if the AJAX-call that generates the JavaScript is not blocked by robots.txt. So how to let Google crawl the AJAX-call but prevent indexing?

rdhauwe commented 11 years ago

http://googlewebmastercentral.blogspot.be/2011/11/get-post-and-safely-surfacing-more-of.html http://googleblog.blogspot.be/2007/07/robots-exclusion-protocol-now-with-even.html https://sites.google.com/site/webmasterhelpforum/en/faq--crawling--indexing---ranking

We've extended our support for META tags so they can now be associated with any file. Simply add any supported META tag to a new X-Robots-Tag directive in the HTTP Header used to serve the file. Here are some illustrative examples:

Don't display a cache link or snippet for this item in the Google search results:

X-Robots-Tag: noarchive, nosnippet

Don't include this document in the Google search results:

X-Robots-Tag: noindex

Tell us that a document will be unavailable after 7th July 2007, 4:30pm GMT:

X-Robots-Tag: unavailable_after: 7 Jul 2007 16:30:00 GMT

rdhauwe commented 11 years ago

https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag?hl=nl

I added the following to event listeners:

response.setHeader("X-Robots-Tag", "noindex");

I removed the event listeners from robots.txt because it disables Google to crawl and index it's content.

rdhauwe commented 11 years ago

I changed

response.setHeader("X-Robots-Tag", "noindex");

into

response.setHeader("X-Robots-Tag", "noarchive");

rdhauwe commented 11 years ago

See 'Making AJAX Applications Crawlable': https://developers.google.com/webmasters/ajax-crawling/

This explains why the current approach doesn't work. The event listeners should be re-added to robots.txt to avoid indexing. The same is to be achieved with noindex instead of noarchive:

response.setHeader("X-Robots-Tag", "noindex");

Proposed solution:

Now that you have your original URL back and you know what content the crawler 
is requesting, you need to produce an HTML snapshot. How do you do that? There 
are various ways; here are some of them:

* If a lot of your content is produced with JavaScript, you may want to use a headless 
browser such as HtmlUnit to obtain the HTML snapshot. Alternatively, you can use 
a different tool such as crawljax or watij.com.
* If much of your content is produced with a server-side technology such as PHP or 
ASP.NET, you can use your existing code and only replace the JavaScript portions 
of your web page with static or server-side created HTML.
* You can create a static version of your pages offline, as is the current practice. 
For example, many applications draw content from a database that is then rendered 
by the browser. Instead, you may create a separate HTML page for each AJAX URL.

whizu / whizu.java

Allow for Whizu pages to be indexed by Google #39