Nov 12 2009
GoogleBot Follows URLs in JavaScript
I was very surprised the other day when I received a notification telling me GoogleBot was having troubles executing JavaScript on my new app Good Camel Games. The thing is that app is built with GWT and has tons of JavaScript so I have implemented a Javascript error handler on the client that will send me a notification whenever there is an error executing some JavaScript. This is actually quite simple to implement:
<script type=”text/javascript”>
window.onerror=function(message, url, line) {
window.location.href=’/error?msg=’ + escape(”Error: ” + message + ‘\nUrl: ‘ + url + ‘\nLine: ‘ + line);
return true;
};
</script>
And then, on the server side, I just have a servlet that handles “/error” and sends me a notification with the content of request parameter “msg” plus some other info about the client.
So I received that error notification (just pasting the relevant parts here):
From: googlebot(at)googlebot.com
user-agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
msg: []
After looking at it carefully, I realized that GoogleBot does not actually execute JavaScript. You can see this because the “msg” parameter is there but empty, which means GoogleBot sent a request to “/error?msg=”, without the parameter value (”Error: …”). So GoogleBot just looks in the JavaScript and takes whatever resembles to a URL and tries to index that URL.
After some researches on the web, I found I was not the only one:
- This guy from seomoz.org having unexpected parts of his site indexed: New Reality: Google Follows Links in JavaScript.
- This guy explaining “Stealth Links” and mention JavaScript URLs as one of them: Stealth links and Googlebot.
- This guy having interesting stories about JavaScript crawling by GoogleBot: GoogleBot Can Crawl JavaScript With Clean URLs.
- This guy who got call from GoogleBot: Googlebot called me on my Timpani Live Help.
- All in line with a Larry Page quote: “We added the ability to search for code in more than 40 different programming languages [...]“ from Google Works on Unified Search Engine.
- This guy talking about the implications of that new feature in GoogleBot: Google Crawls Javascript, News at 11!
To be very sure, I added another test on the home of Good Camel Games:
<script type=”text/javascript”>
function thisFunctionIsNeverCalled() {
window.location.href=’/this_page_does_not_exists’;
}
</script>
If GoogleBot hits the URL “/this_page_does_not_exists”, I will receive a 404 error notification. I added that code today, I will update that post when I have the notification. Hopefully, Google will have some day a true JavaScript engine to crawl JavaScript heavy applications.
UPDATE – November 20 2009: I confirm GoogleBot tried to access the page “/this_page_does_not_exists”. This confirms GoogleBot does not executes the javascript, it just checks for URLs in it. Here is the 404 error notification I received (just pasting the relevant parts here)
From: googlebot(at)googlebot.com
user-agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Request URL: http://games.goodcamel.com/this_page_does_not_exists
No responses yet