Making information graphics these days often requires scraping data from web sites, and Ruby is currently my goto language for most scraping tasks. The process of building a web scraper often involves a lot of trial and error, and I don’t necessarily want to pound the same site with HTTP requests again and again as I tweak and debug code.
So, I wrap HTTP requests in a tiny little class that saves the responses to the file system, so if you request the same URL again, it will load the cached data, eliminating the need for an HTTP request:
class HTTPCacher
def initialize( base_dir )
@base_dir = base_dir
end
def get( url, key )
cached_path = @base_dir + '/' + key
if File.exists?( cached_path )
puts "Getting file #{key} from cache"
return IO.read( cached_path )
else
puts "Getting file #{key} from URL #{url}"
resp = Net::HTTP.get_response(URI.parse(url))
data = resp.body
File.open( cached_path, 'w' ) do |f|
f.puts data
end
return data
end
end
end
Usage is pretty simple. Create a new HTTPCacher object
getter = HTTPCacher.new( '/path/to/data/dir/here' )
and then make a get request, passing two parameters: 1. a URL, and 2. the key that you want to cache it under. Any further requests with that cache key will load the file straight from the filesystem.
data = getter.get( 'http://otter.topsy.com/search.json?q=ipad&window=auto', 'ipad.html' )
Note that making sure your keys are unique between URLs is entirely up to you. If you try to request two different URLs but pass the same key, it won’t be able to tell them apart and it will return the cached data on the second request.