• function

crawl

 

crawl(url, opts)

Parameters

  1. url {Object}

    the starting page to crawl

  2. opts {String | Object}

    the location to put the crawled content.

Loads an ajax driven page and generates the html for google to crawl. Check out the tutorial for a more complete walkthrough.

This crawler indexes an entire Ajax site. It

  1. Opens a page in a headless browser.
  2. Waits until its content is ready.
  3. Scrapes its contents.
  4. Writes the contents to a file.
  5. Adds any links in the page that start with #! to be indexed
  6. Changes window.location.hash to the next index-able page
  7. Goto #2 and repeats until all pages have been loaded

2. Wait until content is ready.

By default, steal.html will just wait until all scripts have finished loading before scraping the page's contents. To delay this, use [steal.html.delay] and steal.html.ready.

3. Write the contents to a file.

You can change where the contents of the file are writen to by changing the second parameter passed to crawl.

By default uses EnvJS, but you can use PhantomJS for more advanced pages:

steal('steal/html', function(){
    steal.html.crawl("ajaxy/ajaxy.html", {
        out: 'ajaxy/out',
        browser: 'phantomjs'
    })
})