Where does our data come from?

You might wonder how did we get the data we work from?

We wanted to retrieve as much information as we could on published fanfictions, in the most precise way available to us. The fics archive website that we use (and, I have to say, love) the most is AO3. As a developer the language I am the most familiar with is PHP, so it only made sense to try and code a crawler… in PHP.

Sadly AO3 does not offer access to an API, a tool that we developers love. I can only hope that one day it will be the case, but that’s a lot of work and AO3 is run by (great) volunteers, so that might not be a priority. An API is a service that allows other services to access the data efficiently and easily, for instance, with the Twitter API I would be able to retrieve the last 50 tweets of any user pretty easily.

Anyway, no API available, so I did it the old-fashioned way: by doing HTTP requests and crawling the HTML code. Oopsie. The main idea is to access (“call”) the same URL of AO3’s search results as I would as a human, but my code iterates on the parameter “page” until there are no results available anymore and I crawled everything. The URL I called is this one:

http://archiveofourown.org/works/search/?utf8=%E2%9C%93&work_search%5Bquery%5D=&work_search%5Btitle%5D=&work_search%5Bcreator%5D=&work_search%5Brevised_at%5D=1+year+ago&work_search%5Bcomplete%5D=0&work_search%5Bsingle_chapter%5D=0&work_search%5Bword_count%5D=%3E+10000&work_search%5Blanguage_id%5D=&work_search%5Bfandom_names%5D=&work_search%5Brating_ids%5D=&work_search%5Bcharacter_names%5D=&work_search%5Brelationship_names%5D=&work_search%5Bfreeform_names%5D=&work_search%5Bhits%5D=&work_search%5Bkudos_count%5D=&work_search%5Bcomments_count%5D=&work_search%5Bbookmarks_count%5D=&work_search%5Bsort_column%5D=created_at&work_search%5Bsort_direction%5D=&commit=Search&page=n

For every call (via PHP-cURL) I browsed the DOM with the PHP class DOMDocument, and identified the data that we deemed interesting. As it turned out, I stored all of the data available because it was not that much. Let’s see what kind of information I had access to:

Screen Shot 2016-08-20 at 11.54.39

So basically: ratings, status, tags, fandom(s), length, kudos, hits, etc. I just had to identify how the data was formatted in the DOM (HTML) by using the code inspector. For instance, the title “Nine Eleven Ten” is located as such in the screenshot: “div” > “h4” > first “a”. Then, I stored the values in the database.

Screen Shot 2016-08-20 at 12.05.21

Then I exported the data to CSV files so we could analyse it.

This kind of technique is absolutely not strong nor viable. As my script depends on the way the data is formatted in the search results of AO3, if tomorrow AO3’s developers choose to change the way they display the data, I will have to adapt my script. It might take a while, they also might choose to make less data accessible from the search results and I would not really have any way to get it back.

Moreover, I am limited to fanfictions publicly posted, because I am not logged in as a user when I run the script. I could not access the restricted posts.

Because I like all of the work from the people from the Organization for Transformative Works, I tried to play nice and not overload their servers (also I didn’t really want to get banned). I voluntarily slowed down my script to mimic a human user, and waited a few seconds between each page load.

If you want to see the code I wrote to do that, please reach out. It’s not amazing but it does the trick.

Note: I stumbled upon an exemple of a “AO3 API”, if you want to check it out: https://github.com/linuxdemon1/Ao3-API

 

Author: mme psychosis

I like web development stuff and feminism. I try to be on tumblr but to be honest I don't get it: http://mpsychosis.tumblr.com/ + http://witchsandwich.tumblr.com/ Also on twitter: https://twitter.com/_mmepsychosis

Leave a Reply

Your email address will not be published. Required fields are marked *