• page

Searchable Ajax Apps ajaxy

 

This tutorial walks you through building a simple widget that listens for changes in the browser location hash and updates the content of the page. It demonstrates how to make a site Google crawlable and searchable.

The App

We'll make a mini app that updates the contents of page with an Ajax request when a user clicks on a navigation link. Then, we'll make this searchable with the ajaxy/scripts/crawl.js script.

The crawl script generates html pages that Google can use as a representation of the content of an Ajax application. Read Google's documentation on its Ajax crawling API before continuing this tutorial.

Setup

Download and install the latest version of JavaScriptMVC.

After installing JavaScriptMVC, open a command line to the steal.config.root folder (where you unzipped JavaScriptMVC).

We'll use the application generator to generate an application skeleton folder. Run:

[WINDOWS] > js jmvc/generate/app ajaxy
[Lin/Mac] > ./js jmvc/generate/app ajaxy

The Code

In the generated ajaxy folder, you'll find ajaxy.html and ajaxy.js. We'll add a content area and few links to ajaxy.html. When we click on links, we'll make ajaxy.js load content into the content area.

Change ajaxy.html so it looks like:

<!DOCTYPE HTML>
<html lang="en">
    <head>
        <title>Ajaxy</title>
        <meta name="fragment" content="!">
    </head>
    <body>
        <a href='#!videos'>Videos</a>
        <a href='#!articles'>Articles</a>
        <a href='#!images'>Images</a>
        <div id='content'></div>
        <script type='text/javascript' 
            src='../steal/steal.js?ajaxy,development'>     
        </script>
    </body>
</html>

Notice that the page includes a <meta name="fragment" content="!"> tag. This tells to Google to process ajaxy.html as having Ajax content.

Next, add some content to show when these links are clicked. Put the following content in each file:

ajaxy/fixtures/articles.html

<h1>Articles</h1>
<p>Some articles.</p>

ajaxy/fixtures/images.html

<h1>Images</h1>
<p>Some images.</p>

ajaxy/fixtures/videos.html

<h1>Videos</h1>
<p>Some videos.</p>

Finally, change ajaxy.js to look like:

steal('jquery',
      'can/construct/proxy',
      'can/control',
      'can/route',
      'steal/html',
      function($, can){

var Ajaxy = can.Control({
    "{route} change" : function(route, ev){
        this.updateContent(route.page)
    },
    updateContent : function(hash){
        // postpone reading the html 
        steal.html.wait();

        $.get("fixtures/" + hash + ".html", {}, this.proxy('replaceContent'), "text")
    },
    replaceContent : function(html){
        this.element.html(html);

        // indicate the html is ready to be crawled
        steal.html.ready();
    }
})

new Ajaxy('#content', { route: can.route(":page", { page: "videos" }) });

});

When a route ("{route} change") event occurs, Ajaxy uses the route.page value to make a request ($.get) for content in thefixtures folder. For more information on routing, visit can.route.

When the content is retrieved, it replaces the element's html (this.element.html(...)).

Ajaxy also calls updateContent to load content when the page loads initially.

Crawling and scraping

To crawl your site and generate google-searchable html, run:

[WINDOWS] > js ajaxy\scripts\crawl.js
[Lin/Mac] > ./js ajaxy/scripts/crawl.js

This script peforms the following actions:

  1. Opens a page in a headless browser.
  2. Waits until its content is ready.
  3. Scrapes its contents.
  4. Writes the contents to a file.
  5. Adds any links in the page that start with #! to be indexed
  6. Changes the url hash to the next index-able page
  7. Goto #2 and repeats until all pages have been loaded

Pausing the html scraping.

By default, the contents are scraped immediately after the page's scripts have loaded or the route has changed. The Ajax request for content happens asynchronously so we have to tell steal.html to wait to scrape the content.

To do this, Ajaxy calls:

steal.html.wait();

before the Ajax request. And when the page is ready, Ajaxy calls:

steal.html.ready();

Getting Google To Crawl Your Site

If you haven't already, read up on Google's Ajax crawling API.

When google wants to crawl your site, it will send a request to your page with _escaped_fragment=.

When your server sees this param, redirect google to the generated html page. For example, when the Google Spider requests http://mysite.com?_escaped_fragment=val, this is its attempt to crawl http://mysite.com#!val. You should redirect this request to http://mysite.com/html/val.html.

Yes, it's that easy!

Phantom for Advanced Pages

By default the crawl script uses EnvJS to open your page and build a static snapshot. For some pages, EnvJS won't be powerful enough to accurately simulate everything. If your page experiences errors, you can use PhantomJS (headless Webkit) to generate snapshots instead, which may work better.

To turn on Phantom:

  1. Install it using the install instructions here
  2. Open scripts/crawl.js and change the second parameter of steal.html.crawl to an options object with a browser option, like this:
steal('steal/html', function(){
  steal.html.crawl("ajaxy/ajaxy.html", 
  {
    out: 'ajaxy/out',
    browser: 'phantomjs'
  })
})