Over the past few months, one of my favorite tools has become CasperJS, which is a navigation and testing utility than runs on top of PhantomJS, a headless web browser. This is a great tool for doing web scraping, which you can use to automate the retrieval of data from webpages, among other things. Sometimes though, you want to test a target web page from a variety of different IP addresses, or find yourself behind a block of banned IP addresses, or just need to anonymize your activity. This is where Tor comes in handy.
Web scraping is a lot of fun, but make sure you are following the commonly accepted rules of web scraping:
- Make sure you’re following the target site’s Terms of Service. This means respecting
robots.txtand any other restrictions there may be.
- Limit your requests. Scraping bots can navigate webpages much faster than normal humans, and you don’t want to accidentally DOS a site with an out of control scraper.
- Be nice to the server. If you don’t need images, modify your scraper so it doesn’t download images (PhantomJS has a
--load-images=falseflag for this). If you want to be really nice to the server, put your e-mail address in the scraper’s HTTP headers so the server admin can contact you if your scraper is giving them a problem.
(These rules were partially adapted from this list)
Now, on to how to “Tor-ify” CasperJS.
Well, obvs. I normally use REHL-based Linux distros, so I’m going to link to those instructions. Once
torproject.repo is in your
/etc/yum.repos.d, you should be able to
yum install tor with no problem. Then start the service with
service tor start.
After you’ve confirmed that Tor can run on your machine, feel free to shut it down, as we’ll be coming back to that later.
Write a Script for Testing
How will you be able to tell that CasperJS is properly proxying through Tor? Why not write a script that scrapes whatismyip.com?
As a force of habit, I normally add
.trim() on to the end of any text I pull out of a DOM node since there’s no point in keeping useless whitespace.
Now, let’s star writing our CasperJS script.
1 2 3 4 5
I like to keep logging on for most of my Casper scripts, just because it’s helpful to see what’s going on and the output looks cool. Now we’ll define the first step of our scraping process:
This simply tells CasperJS to use PhantomJS to load up http://whatismyip.com.
evaluate function. So we’ll add to our first step:
1 2 3 4 5
Once you’re inside
this.evaluate, you have access to the
document object. Also, notice how we do output with
console.log, instead of
this.echo. This is because once inside the
evaluate function, you no longer have access to the
this that refers to the Casper object.
Ok, so tying everything together for this really complicated script:
1 2 3 4 5 6 7 8 9 10 11 12 13
You always have to put
casper.run() at the end of your scripts to kick of the process of running through the steps. Now, if you’ll watch your terminal output, you should see your IP address output in the
[info] [remote] section of the output. Yay!
Now, start up Tor again and we’ll pass in some parameters and see if the IP address changes.
Tor is a SOCKS proxy, not an HTTP proxy. (Most of my previous exposure to Tor was through the Tor Browser Bundle, so this was interesting to me). But PhantomJS can also run through a SOCKS proxy, so no worries. Add these parameters when you start the script:
(If you Tor proxy isn’t running on 127.0.0.1:9050, you should change that parameter. You’ll see where the Tor proxy is running when you run
service tor start.)
So, running my final Tor-ified CasperJS setup looks a little like this:
casperjs --proxy=127.0.0.1:9050 --proxy-type=socks5 whatismyip.js
Note: while using Tor requests can take significantly longer to go through. I’ve seen some requests with this script take as long as 114 seconds to resolve, but such is the nature of Tor.
Won’t you do the world a favor and add a relay to the Tor network?