Over the past few months, one of my favorite tools has become CasperJS, which is a navigation and testing utility than runs on top of PhantomJS, a headless web browser. This is a great tool for doing web scraping, which you can use to automate the retrieval of data from webpages, among other things. Sometimes though, you want to test a target web page from a variety of different IP addresses, or find yourself behind a block of banned IP addresses, or just need to anonymize your activity. This is where Tor comes in handy.
Web scraping is a lot of fun, but make sure you are following the commonly accepted rules of web scraping:
Make sure you’re following the target site’s Terms of Service. This means respecting robots.txt and any other restrictions there may be.
Limit your requests. Scraping bots can navigate webpages much faster than normal humans, and you don’t want to accidentally DOS a site with an out of control scraper.
Be nice to the server. If you don’t need images, modify your scraper so it doesn’t download images (PhantomJS has a --load-images=false flag for this). If you want to be really nice to the server, put your e-mail address in the scraper’s HTTP headers so the server admin can contact you if your scraper is giving them a problem.
(These rules were partially adapted from this list)
Now, on to how to “Tor-ify” CasperJS.
Well, obvs. I normally use REHL-based Linux distros, so I’m going to link to those instructions. Once torproject.repo is in your /etc/yum.d or /etc/yum.repos.d, you should be able to yum install tor with no problem. Then start the service with service tor start.
After you’ve confirmed that Tor can run on your machine, feel free to shut it down, as we’ll be coming back to that later.
Write a Script for Testing
How will you be able to tell that CasperJS is properly proxying through Tor? Why not write a script that scrapes whatismyip.com?
var ip = document.getElementById('greenip').textContent.trim()
As a force of habit, I normally add .trim() on to the end of any text I pull out of a DOM node since there’s no point in keeping useless whitespace.
Now, let’s star writing our CasperJS script.
I like to keep logging on for most of my Casper scripts, just because it’s helpful to see what’s going on and the output looks cool. Now we’ll define the first step of our scraping process:
This simply tells CasperJS to use PhantomJS to load up http://whatismyip.com.
Once you’re inside this.evaluate, you have access to the document object. Also, notice how we do output with console.log, instead of this.echo. This is because once inside the evaluate function, you no longer have access to the this that refers to the Casper object.
Ok, so tying everything together for this really complicated script:
You always have to put casper.run() at the end of your scripts to kick of the process of running through the steps. Now, if you’ll watch your terminal output, you should see your IP address output in the [info] [remote] section of the output. Yay!
Now, start up Tor again and we’ll pass in some parameters and see if the IP address changes.
Tor is a SOCKS proxy, not an HTTP proxy. (Most of my previous exposure to Tor was through the Tor Browser Bundle, so this was interesting to me). But PhantomJS can also run through a SOCKS proxy, so no worries. Add these parameters when you start the script:
(If you Tor proxy isn’t running on 127.0.0.1:9050, you should change that parameter. You’ll see where the Tor proxy is running when you run service tor start.)
So, running my final Tor-ified CasperJS setup looks a little like this:
I’m writing this post because I’ve been using Drupal to rapidly-prototype an MVP (and I have many, many thoughts on Drupal, but I won’t get into them here). One of the features was to have a list of followers (who follow a user) and a list of the users that a certain users follows, which is a very typical setup. I’m writing this post because I had one hell of a time putting this thing together in Drupal 7, and I’m hoping it will save someone else from going through the pain I went through.
First, let’s rubber duck and clearly define our requirements for the two lists.
Following: A list of users that a specific user has flagged ‘follow’. In other words, a list of users flagged by another user.
Followers: A list of the users that have flagged a specific user to ‘follow’. In other words, a list of users that have flagged another user.
I’m using the terminology flag here since we’ll be using the flag module to put together the following/followers functionality. The two modules we’ll need are:
First, you’re going to need to create a new flag for ‘follow’. Setting up a new flag is fairly trivial, and I called my flag ‘follow_user’ and set it to be a global flag.
Now, create a new view for Followers. You’re going to need to set the following options:
Title (very important, but just for you)
Path: /user/%/followers (or anything you’d like, as long as the % is in there)
Contextual Filter: User(uid) (this is to grab the uid out of the % in the path)
Set the Contextual Filter above to ‘do not use a relationship’. This step is important
Create a ‘Flags: User Flag’ relationship for any user (not just the current user)
Create a ‘Flag: User’ relationship to grab the user data from the above relationship. (the relationship drop down should reference the flag created in the step above)
Create a ‘User:Name’ field that uses the above relationship
And that’s how you do the followers tab. For the following tab, create another view, with basically the same parameters (the path and will be different, obvs), but the major differenct is with the Contextual Filter:
Make sure the contextual filter is referencing the Flag User relationship (you may need to set up the relationships first, then add the Contextual Filter).
Also make sure that the ‘User:Name’ field does not reference the relationshp (or else it will just repeat the name of the current user you are viewing).
For reference sake, here is what the views look like in my admin:
Facebook is making changes to its Data Use Policy and Terms of Service (Statement of Rights and Responsibilities). Per the current policies, these changes have to be put up to a vote by Facebook users. Of course, the vote is only valid if 30% of the user base participates in the voting (that’s about 300 million people). At the time of this writing, only about 16 thousand users have voted and a majority of those (90%) have voted against the new policies.
In addition to minor changes clarifying data collection, privacy settings, and affiliates, the major change that comes with the new policies is the abolition of voting itself. If the new policies are passed, then there will be no more voting on new policies. Instead, there will be a seven day comment period before they are put in place.
Given the virality of the fake Facebook copyright notice, I’m surprised that the voter turnout is so low. On the other hand, Facebook has not done much in terms of publicising this effort. I suspect they want to do away with voting altogether, as it has never had a meaningful impact on the site’s policies and operations.
In lieu of a document comparing the two policies (other than this pdf from Facebook), I’ve created a little voters guide.
Here are the upcoming Facebook policy changes:
Data Use Policy
Required to provide name, email, birthday, and gender.
Have to provide information such as that in the old policy, but may use a telephone number.
Information received by Facebook
Does not mention Facebook’s affiliates.
Language modified to include Facebook’s affiliates(see note)
Note on Affiliates
Facebook defines affiliates to be business that are legally part of the same group Facebook is a part of. I take this to mean Facebook Ireland, which is a company set up to take advantage of Ireland’s tax laws, and Facebook Hypderabad.
Messaging (@facebook.com email address)
Contained details on how to control who messages you.
Now anyone in a conversation can message you.
How Facebook uses your information New Policy:
Insertion of a clause – “in addition to helping people see and find things that you do and share” – prefacing examples of how Facebook uses your information.
New Policy: A reminder that even though you may hide a post, people may see it elsewhere, like on someone else’s timeline or search results.
Finding you on Facebook
Only friends will be able to find you via e-mail address or phone number, depending on your privacy settings
People will be able to find you though a post to a public page or if you are tagged in a friend’s post or photo
Personalized Ads New Policy: Clarifies how personalized ads work.
Sponsored Stores New Policy: Subscribers, in addition to friends, will see sponsored stories.
Data Retention New Policy: Facebook may retain information from suspended accounts for up to a year to detect repeat offenders.
Facebook will send up to two reminders to friends you invite.
Facebook will send a few reminders to friends you invite.
Affiliates New Policy: Facebook may share information about affiliates (see note above).
Opportunity to Comment and Vote New Policy: Voting is removed, as is the 7000 comment threshold for triggering a vote. Now users have seven days to comment before a change goes into effect.
Statement of Rights and Responsibilities
Your Facebook Timeline
You will not use a personal Facebook account primarily for commercial gain.
You will not use a personal Facebook account primarily for commercial gain, and will create a Facebook Page to do so.
Promotions from Pages New Policy: If you run a promotion on your timeline from your page, you agree to the Pages Terms of Service.
Amendments to the Policy New Policy: Voting removed as well as the 7000 comment threshold (similar to the changes to the Data Use Policy). Seven day comment rule also applies here.
This week I’m deep in the process of converting a variety of PHP/MySQL backed sites (mostly Joomla and Symfony 1.4) over to Octopress, mostly because I don’t want to deal with the overhead of running MySQL (In the past, I’ve had to upgrade my Slicehost VPS in order to keep MySQL from hanging). Some of the pages had a little Facebook likebox included with them to collect likes.
My search for an Octopress aside that would create a likebox was fruitless, so I made one.
How it works
It’s a simple plugin that generates some interesting markup for embedding videos (you essentially embed the raw source into an iframe and hope the server on the other end will serve up a player with the html5 or flash stream).
The size of the player/viewport is controlled CSS that defines a responsive, intrinsic ratio for the video.
Deanna’s mother takes Worf’s son to a skitzoid paradise and explains to him that he is only a body filled with organs. Meanwhile, the space parasites that were eating the ship are ejected into an asteroid belt, and the Enterprise becomes a body without organs.
So I recently bought a Nokia N900 from ebay. It arrived in a nice little package and while the seller promised the phone would be new, it was actually a refurb.
The phone came pre-loaded with a name and some contacts entirely in Chinese. I tried calling the numbers with my sim card, but I only recieved an automated recording saying the number you have dailed is no longer available and then some code with the letters NY in it… which was interesting in it’s own right as I was calling from New York State.
I set out to preform a factory reset on my N900. Doing a factory reset, which to me entails wiping the contacts, calendars, cookies, and any other personal information from the device and resetting it to it’s fresh out of the factory state seems like it would be a simple operation. However, as I would soon come to learn, nothing is simple with the Nokia N900.
The Nokia N900 is more comptuer than phone. It has a 600Mhz processor and runs debian. While iPhones and Android Phones have a built in software factory reset, and older Nokia phones can simply use a four digit numerical code to reset. Since the N900 is anything but simple, doing a factory reset is on the opposite specturm of ease from the above methods.
The Maemo wiki details a two step process for preforming a factory reset (i.e. flashing the device). There are two types of disc storage on the N900: the eMMC storage which holds user-specific settings (/home/user, for example) and the rootfs storage which holds the root file system. According to the wiki, it’s every important to flash the eMMC storage first then take care of the rootfs. Luckily, the commands for flashing the two are virtually identical.
First, I needed to download the flasher program from the Maemo development site. The Linux-based package I downloaded was a nicely compiled binary that found my USB ports and ran fine. YMMV.
Second, I needed to download both the eMMC image and the rootfs image from here. Note: the device ID number is also printed on a sticker on the box; it’s not necessary to pull out the battery (yet). Remaing the eMMC and rootfs .bin files to something less verbose makes life easier.
Once all the *.bin files were placed in my flasher-3.5 folder, doing the flashing was simple.
Don’t forget, eMMC first, then rootfs
./flasher-3.5 -F eMMC.bin -f
A message saying Suitable USB device not found, waiting indicates that the N900 needs to be turned off, then pluged into a USB port while the ‘u’ key is held down. I guess this boots the device into special USB flashing mode, and I still can’t figure out if this key is easier to hold down with my finger or the stylus.
Flashing the eMMC takes a few minutes, then to be sure everything worked, I pulled out the battery (per the wiki instructions (!)). Once again, leaps and bounds from the iPhone and Android in terms of simplicity.
After the eMMC is flashed, the rootfs needs to be flashed. It’s the same command, only the -R flag is tagged on to reboot the device.
./flasher-3.5 -F rootfs.bin -f -R
Ok, yay! Now the N900 is back to factory settings and essentially a new device. All that I had to do was:
Inspired by this Stack Overflow question–which I think is a totally legitimate question and should not have been closed, here is how I did something similar.
I needed a way to easily embed vimeo videos in posts. I wanted to use the vimeo video id to reference the video, and I wanted the top of my markdown files to look like this:
title: "What a cool video "
date: 2012-09-02 21:35
The video id (in the vimeo param)is used in vimeo’s generic embed code. Now came time to dig through the Octopress directory structure. It’s pretty complex, and tree is quite helpful for visualizing this. In the source directory, you have layouts and includes. layouts is mostly for page layouts made up of different components in the includes folder.
Once inside includes, I found it easiest to simply modify article.html. Best practices probably dictate going back and tweaking something with the layout, but I don’t know enough about Octopress and Ruby and was having tremendous difficulties with variable scoping (and also escaping Ruby control structures in a codeblock).