Tor-ifying CasperJS
Over the past few months, one of my favorite tools has become CasperJS, which is a navigation and testing utility than runs on top of PhantomJS, a headless web browser. This is a great tool for doing web scraping, which you can use to automate the retrieval of data from webpages, among other things. Sometimes though, you want to test a target web page from a variety of different IP addresses, or find yourself behind a block of banned IP addresses, or just need to anonymize your activity. This is where Tor comes in handy.
Web scraping is a lot of fun, but make sure you are following the commonly accepted rules of web scraping:
- Make sure you’re following the target site’s Terms of Service. This means respecting
robots.txt
and any other restrictions there may be. - Limit your requests. Scraping bots can navigate webpages much faster than normal humans, and you don’t want to accidentally DOS a site with an out of control scraper.
- Be nice to the server. If you don’t need images, modify your scraper so it doesn’t download images (PhantomJS has a
--load-images=false
flag for this). If you want to be really nice to the server, put your e-mail address in the scraper’s HTTP headers so the server admin can contact you if your scraper is giving them a problem.
(These rules were partially adapted from this list)
Now, on to how to “Tor-ify” CasperJS.
Download Tor
Well, obvs. I normally use REHL-based Linux distros, so I’m going to link to those instructions. Once torproject.repo
is in your /etc/yum.d
or /etc/yum.repos.d
, you should be able to yum install tor
with no problem. Then start the service with service tor start
.
After you’ve confirmed that Tor can run on your machine, feel free to shut it down, as we’ll be coming back to that later.
Write a Script for Testing
How will you be able to tell that CasperJS is properly proxying through Tor? Why not write a script that scrapes whatismyip.com?
Taking a look at the source of whatismyip.com, it looks pretty simple to scrape. The IP address is contained in a div that has a handy id that we can pull data from. My workflow for writing scripts with CasperJS is to fire up the target webpage, write some code in Chrome Developer Tools, then copy that code back into my CasperJS script. Taking a look at the source for whatismyip.com, we can use some vanilla javascript to grab the IP address.
1
|
|
As a force of habit, I normally add .trim()
on to the end of any text I pull out of a DOM node since there’s no point in keeping useless whitespace.
Now, let’s star writing our CasperJS script.
1 2 3 4 5 |
|
I like to keep logging on for most of my Casper scripts, just because it’s helpful to see what’s going on and the output looks cool. Now we’ll define the first step of our scraping process:
1 2 |
|
This simply tells CasperJS to use PhantomJS to load up http://whatismyip.com.
An important note about CasperJS is even though everything is written in Javascript, you can’t actually manipulate or read the page you’ve loaded into CasperJS without running the evaluate
function. So we’ll add to our first step:
1 2 3 4 5 |
|
Once you’re inside this.evaluate
, you have access to the document
object. Also, notice how we do output with console.log
, instead of this.echo
. This is because once inside the evaluate
function, you no longer have access to the this
that refers to the Casper object.
Ok, so tying everything together for this really complicated script:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
You always have to put casper.run()
at the end of your scripts to kick of the process of running through the steps. Now, if you’ll watch your terminal output, you should see your IP address output in the [info] [remote]
section of the output. Yay!
Now, start up Tor again and we’ll pass in some parameters and see if the IP address changes.
Proxy PhantomJS
Tor is a SOCKS proxy, not an HTTP proxy. (Most of my previous exposure to Tor was through the Tor Browser Bundle, so this was interesting to me). But PhantomJS can also run through a SOCKS proxy, so no worries. Add these parameters when you start the script:
--proxy-address=127.0.0.1:9050
--proxy-type=socks5
(If you Tor proxy isn’t running on 127.0.0.1:9050, you should change that parameter. You’ll see where the Tor proxy is running when you run service tor start
.)
So, running my final Tor-ified CasperJS setup looks a little like this:
casperjs --proxy=127.0.0.1:9050 --proxy-type=socks5 whatismyip.js
Note: while using Tor requests can take significantly longer to go through. I’ve seen some requests with this script take as long as 114 seconds to resolve, but such is the nature of Tor.
Won’t you do the world a favor and add a relay to the Tor network?