Data Export

In which I document the experience of exporting data from several services, commenting on the UI, UX, terminology and utility/tangibility/formatting of the data produced.

The defining problem with data silos on the web is that your content is holed up, unavailable to you via any means other than a limited, often short-lived web interface, and possibly an API if you’re a developer.

In an effort to combat the negative perceptions resulting from this problem, an increasing number of silos are offering a way to export all of your data in one go. Having casually tried a couple of these services, I was struck by the inconsistencies both between silos and within them. Who and what were these services really designed for?

Terminology

There is little consistent terminology between different data export services. Both “Export” and “Archive” are used as verbs and nouns. Facebook also uses “download your information/data” a lot, as well as “archive” with no mention of “export”. Google use their own term “takeout”, but are inconsistent and also use “takeaway” and “archive”.

WordPress and Diaspora both use “export”, probably because they’re more developer-oriented. Another factor could be that WordPress allows the exported data to be imported, and Diaspora was going to but as of it hasn’t been built. There is no counterpart to “archive” as there is to “export/import”, so archive implies one-way movement of data, export/import imply two-way movement.

A Google search for Export returns 870,000,000 results vs 2,840,000,000 for Archive.

Twitter

Twitter, after long being admonished for only allowing access to a user’s last 3200 tweets via their API, announced Twitter Archives on . They gave the following reasons why people might want to make use of the archive:

You may have found yourself wanting to go back in time and explore your past Tweets. Maybe you wanted to recall your reaction to the 2008 election, reminisce on what you said to your partner on your 10th anniversary, or just see your first few Tweets. We know lots of you would like to explore your Twitter past.

The usage they’re targeting is clearly that of the reminiscing historian, which fits their use of the word “archive”. There’s no mention anywhere of the possibility that your data might not be safe with Twitter, that you might want to keep a personal copy, or that you’d want to move it somewhere else.

It took 00:29:10 for my archive of approximately 2,240 tweets to arrive. The download UI had no progress bar or ETA, only a vague indication it might take “a while”.

I was alerted of export completion via email. Downloading and unzipping revealed a “tweets” folder (3.9MB in total) containing some folders, an index.html, tweets.csv and a README.txt. The readme was helpful (see it here), providing casual users with the quickest way of getting at the data (Just double-click index.html from the root folder), explaining tweets.csv for the more advanced, then the raw JSON files for developers.

The instructions in the readme are incorrect: they state that “In the data folder, your Twitter archive is present in two formats: JSON and CSV exports by month and year.”. In fact, the data folder contains another folder which contains javascript files and a “tweets” folder containing the JS archive by month, and the CSV file is in the same folder as the README.

It’s also worth noting that these are not JSON files unless you strip the first line off. This is another case where the readme is incorrect, as it tells you to strip the first and last lines.

The browser interface itself is rather well put together — combining data and navigation is a particularly effective touch. It feels a little odd that “reply” and “favourite” Twitter action links are provided — how often does anyone reply to or favourite their own tweets? The search feature is rather impressive — a highly useful extra which no other export browser provided.

The browser interface is all created dynamically by javascript, which puts a slight damper on the otherwise wonderful fact that, whilst in the browser, the tweets are marked up using brand-spanking-new microformats2 h-entry! Hopefully this is an indicator that h-entry will make its way onto twitter.com soon.

Conspicuously missing from the export is any photos uploaded to twitter, which for people who post many photos is a gaping hole — it looks like there’s currently no official user-facing way to export photos. TwitPic Exporter might be a good 3rd party exporter.

My favourite part of the Twitter export, however, is the tweets.csv file. This does pretty much what it says on the tin — each of your tweets is a row in a very large spreadsheet. You get the tweet ID, in-reply-to tweet and user IDs, retweeted tweet and user IDs, timestamp, source (Twitter client used), text and any expanded URLs within the text.

What I love about this is the way it makes your tweet data so much more tangible — any spreadsheet application can open CSV files, giving anyone the ability to play with their data. However, I feel that the data presented is not optimal for this target audience of tech-savvy-but-not-developer people — in particular all the opaque IDs and lack of retweet/favourite count. For example, why can I not easily order my tweets by favourite count; see if there is a correlation between length and favourites/retweets, etc. If only a little more data was provided these tweet archives could become excellent datasets for use in education.

WordPress(.com)

All WordPress blogs have both export and import functionality within the admin interface. This is consistent with their use of the “export/import” terminology as it supports two-way movement of data. It took about a second for my 12KB RSS-like XML export of only three posts to download (this was a test blog I used to try out the WordPress UI).

Interestingly next to the download option there’s another, paid export option which enlists the help of a “happiness engineer” to take your content and put it onto a 3rd party self-hosted WordPress blog.

You get the option to export everything, just your posts, just your pages or just your “feedbacks”. The “everything” option mentions several other bits of content which don’t seem to appear in the granular selection, nor is there a way to select, for example, both posts and pages but not “feedbacks”.

The export tool explains that you can import this content into other wordpress installs. There’s also a comment block at the top of the xml file which has much more detailed import instructions — I’m not exactly sure why these are hidden away, they’re quite helpful!

The export doesn’t seem to include any images or other media you’ve uploaded.

Potential future research: try actually importing content, dig into export format markup, resolution and and utility, try exporting when I’ve uploaded some images and see if they’re included.

Facebook

The ability to download your data from Facebook was announced by Mark Zuckerberg on the Facebook blog on . The blog post doesn’t actually cover what reasons people might have for wanting to download their content, hinting at a feature that either Facebook doesn’t want you to use, or one which they didn’t research users’ need for before developing it:

If you want a copy of the information you've put on Facebook for any reason, you can click a link and easily get a copy of all of it in a single download.

Facebook inconsistently refer to the product of their data exporter as “your information”, “your facebook data”, “your archive” and “your personal archive”.

After finding and clicking the download a copy link, there was another button and two informational modals to navigate before my archive was even pending. One of them has the option to cancel, the other does not.

It took 01:16:20 for my archive to be ready. Like Twitter, the downloading UI had no progress indicator or indication of how long it might take.

I was alerted of the export completion by an email. I then had to re-enter my password (exploitable phishing hole?) before getting download access.

Turns out there are actually two different export options, the standard and the extended. I downloaded them both.

Standard

The standard archive consisted of a folder named “barnaby.walters”, total size 2.5MB and containing a README.txt with the following contents:

Downloaded by Barnaby Walters (http://www.facebook.com/barnaby.walters) on June 15, 2013 at 12:48

There’s no indication of how to view my content. Folder contained both “photos” and “photo” folders, former is split up into “Cover”, “Profile”, “Timeline” and then “Timeline Photos - 149404371782841”.

Photos are named inconsistently. Cover photo and profile picture names (inconsistent naming too) are 15 numeric digits long, whereas timeline photos are named using the first 51 characters of the comment/description/status update. Whilst this is rather a delightful touch, it would be better to have some indication of the date in the name.

The export also contains “html/” and “index.html”. On viewing the latter I get rather a nice little website with my face on, and a menu.

Profile is exactly what you’d expect, a table of some of my profile fields. Wall is everything I’ve posted, ever, all in one 819KB HTML file. That I am not the most active of facebook users, coupled with a distinct lack of links in the HTML means that for others this file is likely to be unmanageably huge.

Most of the HTML is marked up using classic microformats, which is a nice touch. The bits which aren’t (Messages, Photos) has granular, semantic classnames, meaning you could easily write a scraper. Unfortunately almost all the content is marked up using tables.

The biggest problem with the HTML is the lack of links. Out of all the things which could be links, the only things which actually are are things I’ve linked to in Facebook posts. Profiles, post permalinks, page permalinks are all plain text.

Extended

I had never heard of this before, but saw a small link on the first download page for the standard export. I requested it be built, and it was delivered shortly after (no times for this one, but it didn’t take much more than 10 minutes).

When downloaded, I got a folder confusingly entitled “dyi_100001398384863”, containing an index.html and a html folder — no README this time, not that the other one was very helpful. The HTML files are all minified, making manual inspection awkward.

The extended export contains a large amount of extra, less substantial information, including records of your active browser sessions, administrative records (e.g. password change), ad clicks (I had none, and am not exactly sure what gets stored), “Ads Topics”, deleted friends, facial recognition data (again none for me), pending friend requests, account activity from the last month or so, and a fair amount of assorted configuration/settings/admin type data.

It’s all presented in simple table form, no sign of any microformats here but if you’re a desperate developer you could extract it all fairly easily.

The most interesting thing I found in here was the “Ads Topics” — presumably the various things facebook has decided I should be shown ads about. A short sample is included below — note the (?) hashtags:

Ads Topics
#ABC notation
#Accordion
#Amplifier
#Apogee Electronics
#Appalachian dulcimer
#Best Buy

Overall Facebook’s archive is a mixed bag. Some parts (the microformatted HTML viewer, well-named full-size photos) are excellent examples of delivering a good archive browsing/using experience. Other aspects, like the fact that all of your posts ever are on one huge page, and the unhelpful readme, are awkward and unhelpful. The extended archive contains some interesting information, but it’s presented as if their database threw up.

Diaspora

Under “Account Settings” there’s an “Export Data” section, containing several buttons offering me “XML” and “Photos” downloads. Trying to download photos resulted in this less-than-helpful error message:

Photo exporting currently unavailable

I have tried exporting my photos several times over the period of two months, and photo export has never been available. I suspect it simply doesn’t work.

On exporting my “XML”, after a few seconds download time I was rewarded with a 713KB XML file entitled “barnaby_diaspora_data.xml”. This is a proprietary format with no documentation I can find. There is no import utility, and currently I know of no software which can do anything useful with it.

On digging around in the XML, it appears highly disorganised. There’s a user element which contains my private key and a person, which contains my profile data, including exported_key, which turns out to be my public key. aspects contains an element for each of my aspects, which in turn contains an element containing the post ID of every post I’ve authored targeted at that aspect. contacts contains a list of everyone I know, and which of my aspects they’re in, but their actual user data is in the people element.

And then, the killer — the posts element is empty. This “export tool” did not actually export the content I, the user, created, making it the most useless export tool in the history of the web. There, I said it.

Not only that, but the consistent use of the word “export” implies that the ability to re-import your data exists, or might exist in the future. Even if it does get built, the export tool will have to include people’s content before it’s any use.

Google

Google seemingly takes much pride in “Takeout”, it’s data export utility. The data export UI itself is fairly well designed, giving you control over exactly which services’ data you download, but no indication of exactly what you get or what you might be able to do with it. Each customised download is stored in the download history and can be downloaded until it expires after a week.

The download UI shows you how big each component of the export is and how much of it has been created so far. You can either wait and watch the pretty blue bars, or opt to be notified via email.

It took 00:34:53 for my complete archive to be ready. The granular progress indicators are an effective touch, although a summary ETA would have been useful to gain a quick indication of how long I have to wait.

If your export is large enough (e.g. when exporting youtube videos) it might be split into multiple files. I had two, 1.5 and 1.6GB, which took just over an hour in total to download.

There’s no readme or other indication of how to browse your data. The contents of the export is split into folders with friendly, human names (apart perhaps from +1s). I had +1s, Drive, Google+ Circles, Google+ Stream, Hangouts, Profile and Reader. I’ll comment on each of these in turn.

+1s

Nothing much to see here, just a “bookmarks.html” file. On closer inspection this seems to be marked up using standard, if archaic, Netscape bookmark flavoured HTML (ALL CAPS, P elements inside dls), meaning it can be imported as bookmarks into browsers such as Safari, Opera or Internet Explorer — the bookmarks turn up in a folder called “Google +1s”. It’s also a perfectly valid, well marked-up HTML document which could be published or incorporated into a personal site.

Overall a potentially excellent experience with tangible, usable data — if only some usage instructions had been provided.

Drive

A folder containing all of your old Google Docs files/any other Drive files, in their original formats. Overall this is a good result as people should be able to use the files right away. No usage instructions, but this is one instance where they’re not really necessary, as the output format matches the in-app UI almost exactly (a filesystem).

Google+ Circles

A folder of .vcf files, each containing the vcards of the people in that circle. Unfortunately the only information provided for each user is their name and “homepage” — a G+ URL with a long number at the end.

This might have been extremely useful, if more user information had been included, and some usage instructions were provided. Most casual users are not likely to be familiar with vcf files and may be unsure what to do with them.

Google+ Stream

The stream export is a folder of HTML files, one for each activity I’ve created (posts, hangouts, events). They share a name with the activity they represent, and are all marked up fairly well with the classic hAtom microformat. There are no usage instructions, master feed view or way of navigating between the posts once they’re in the browser.

Interestingly, whilst there’s no indication of the creation datetime of each post in the filename, the file creation date is accurate. This seems vastly less useful than either grouping the files in folders by month or just starting the filenames with the datetime.

Hangouts

I think this one is a joke — it’s a folder which contains a single, solitary “chat.json” file. This is a JSON file containing almost no human-readable data. After pouring over it for five minutes I have no idea what it’s trying to tell me, how it’s useful or what I might be able to do with it. Non-developer users are only going to be more confused.

Profile

Another JSON file, this time “Barnaby_Walters.json”. This seems to be an ActivityStreams-esque user object, and contains basic profile data. However, there are no usage instructions and no indication of how this file might be used.

Reader

Yet more JSON files with no instructions, explanation or indication of utility. If you’re a developer and you’re desperate to extract exactly what you’ve shared, liked, starred or noted, or if you’re making a competing feed reader, then you might be able to get something useful out of these. Otherwise they’re a bunch of opaque, confusing blobs.

Subscriptions.xml is an OPML file containing the folder structure of all the feeds you’re subscribed to. As OPML is fairly standard and well-supported in the feed reader world, this is actually potentially very useful to the exportee, but with no usage instructions you have to already know what it does in order to use it.

Google Conclusion

Like much of the rest of Google as a company, their export tool seems fragmented and inconsistent. What at first looked like a human-friendly folder structure turned out to mainly have unexplained JSON files, with a few slightly brighter prospects like bookmarks.html and the subscriptions.xml.

A little work adding usage instructions could go a long way, and a lot of work making all of the exported data more comprehensive and actionable could make this a genuinely useful tool. A lot of interesting data is provided, there’s just no easy way to get at most of it.

In Conclusion

Of the silos I researched, it’s difficult to say which one comes out on top. Twitter’s export has the best designed experience and arguably most useful data but the best parts of it (the readme and tweets.csv) are incorrect and present suboptimal data, respectively.

WordPress is the only silo which the exported data could be reimported back into, although some of the data from Google’s export is directly reusable in other services, for example the reader OPML file and G+ circles VCF files. As it’s widely acknowledged that backups are useless unless you can restore from them, only the WordPress export actually counts as a backup.

There’s little consistency between the data export experiences along any axis, possibly because each service seems to be designed for completely different audiences with completely different purposes, from casual browsing and nostalgia (e.g. Facebook, Twitter), to research (e.g. Twitter) and even actual reuse of data through importing tools (WordPress, some of Google). Mostly though, they’re neglected, niche tools, aimed mainly at geeks.

The Future

If data archiving/exporting tools are to be used, they must be useful. In order to be useful, tools must be made which make use of them. In order for people to be able to make tools, the export services must be documented. This blog post is a start.

If you work at a silo, in the department responsible for data export or elsewhere: Champion and advocate an excellent data export experience. Try the data export utility for your own content (you dogfood, right?) and assess where it might be improved. Advocate/build importing as well as exporting. Decide who+what you are designing for. Document the bloody thing, and keep the documentation up to date.

If your silo is used to create new content, build importing of data exported from other silos, but not before you’ve polished your own data export service.

If you own your identity on the web: take a look at the exportable data from some silos which you use(d) and try importing it into your own site. Document the process, and open-source the code you used.

I’ve started this process by importing my old Diaspora content into my indieweb site, releasing the code as diaspora-export and documenting the process. I encourage others to do the same, collect knowledge on indiewebcamp.com/export and discuss data liberation and content ownership on the IRC channel on freenode.