Making information graphics these days often requires scraping data from web sites, and Ruby is currently my goto language for most scraping tasks. The process of building a web scraper often involves a lot of trial and error, and I don’t necessarily want to pound the same site with HTTP requests again and again as I tweak and debug code.
So, I wrap HTTP requests in a tiny little class that saves the responses to the file system, so if you request the same URL again, it will load the cached data, eliminating the need for an HTTP request:
class HTTPCacher def initialize( base_dir ) @base_dir = base_dir end def get( url, key ) cached_path = @base_dir + '/' + key if File.exists?( cached_path ) puts "Getting file #{key} from cache" return IO.read( cached_path ) else puts "Getting file #{key} from URL #{url}" resp = Net::HTTP.get_response(URI.parse(url)) data = resp.body File.open( cached_path, 'w' ) do |f| f.puts data end return data end end end
Usage is pretty simple. Create a new HTTPCacher object
getter = HTTPCacher.new( '/path/to/data/dir/here' )
and then make a get request, passing two parameters: 1. a URL, and 2. the key that you want to cache it under. Any further requests with that cache key will load the file straight from the filesystem.
data = getter.get( 'http://otter.topsy.com/search.json?q=ipad&window=auto', 'ipad.html' )
Note that making sure your keys are unique between URLs is entirely up to you. If you try to request two different URLs but pass the same key, it won’t be able to tell them apart and it will return the cached data on the second request.