Open rdhauwe opened 11 years ago
Google does index pages that contain JavaScript ;-) at least if the AJAX-call that generates the JavaScript is not blocked by robots.txt. So how to let Google crawl the AJAX-call but prevent indexing?
http://googlewebmastercentral.blogspot.be/2011/11/get-post-and-safely-surfacing-more-of.html http://googleblog.blogspot.be/2007/07/robots-exclusion-protocol-now-with-even.html https://sites.google.com/site/webmasterhelpforum/en/faq--crawling--indexing---ranking
We've extended our support for META tags so they can now be associated with any file. Simply add any supported META tag to a new X-Robots-Tag directive in the HTTP Header used to serve the file. Here are some illustrative examples:
Don't display a cache link or snippet for this item in the Google search results:
X-Robots-Tag: noarchive, nosnippet
Don't include this document in the Google search results:
X-Robots-Tag: noindex
Tell us that a document will be unavailable after 7th July 2007, 4:30pm GMT:
X-Robots-Tag: unavailable_after: 7 Jul 2007 16:30:00 GMT
https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag?hl=nl
I added the following to event listeners:
response.setHeader("X-Robots-Tag", "noindex");
I removed the event listeners from robots.txt because it disables Google to crawl and index it's content.
I changed
response.setHeader("X-Robots-Tag", "noindex");
into
response.setHeader("X-Robots-Tag", "noarchive");
See 'Making AJAX Applications Crawlable': https://developers.google.com/webmasters/ajax-crawling/
This explains why the current approach doesn't work. The event listeners should be re-added to robots.txt to avoid indexing. The same is to be achieved with noindex instead of noarchive:
response.setHeader("X-Robots-Tag", "noindex");
Proposed solution:
Now that you have your original URL back and you know what content the crawler
is requesting, you need to produce an HTML snapshot. How do you do that? There
are various ways; here are some of them:
* If a lot of your content is produced with JavaScript, you may want to use a headless
browser such as HtmlUnit to obtain the HTML snapshot. Alternatively, you can use
a different tool such as crawljax or watij.com.
* If much of your content is produced with a server-side technology such as PHP or
ASP.NET, you can use your existing code and only replace the JavaScript portions
of your web page with static or server-side created HTML.
* You can create a static version of your pages offline, as is the current practice.
For example, many applications draw content from a database that is then rendered
by the browser. Instead, you may create a separate HTML page for each AJAX URL.
The content of a Whizu page is mostly generated on the server side and injected in the client with JavaScript. It should be verified that Google supports this and can index the generated content. If not, Whizu should support page preprocessing in order to stream the (static) content with the initial page request.