Aaron Parecki

Some thoughts on the XRay and jf2 JSON formats

Since beginning the jf2 spec, I've continued developing XRay, and its format has diverged from the original jf2. Tonight I spent a while trying to reconcile the changes to submit a PR to the spec. I was unable to come up with a short PR, and instead got drawn in to thinking about the motivations behind a simpler mf2 JSON format to begin with.

I use XRay in a number of projects for various purposes.

  • My website runs every external URL through XRay to handle consuming the Microformats on the page, converting it to a simplified form. This is used whenever I reply to a post to display the reply context, as well as to fetch the post contents when I make a repost.
  • Loqi uses XRay to create a one-line summary of URLs pasted into IRC.
  • webmention.io uses XRay to parse the source URL of webmentions to extract useful data about the webmention, and makes this data available via an API.
  • IndieNews uses XRay to parse submitted URLs to display the name and author of the posts.
  • Quill uses XRay to show a preview of in-reply-to URLs.
  • My rudimentary reader uses XRay to extract the h-entry data from posts to display in my reader.

There are a number of things that XRay does when extracting the mf2 data.

  • Finds the author of a post following the authorship algorithm
  • Follows the comments presentation algorithm to remove the name property if it's a duplicate of the content.
  • Figures out the primary object on the page, or whether the page represents a list of posts, which is sometimes tricky. (some discussion on representative object)
  • Is vocabulary-aware, so always returns a consistent set of properties, and doesn't return unknown properties. e.g. published is always a single string, and category is always an array.
  • Sanitizes all HTML, allowing only a small subset of HTML tags and Microformats classes on the HTML elements.
  • For any values that might be embedded objects, e.g. a person-tag or in-reply-to property, always returns the URL in the value and moves the embedded object to a refs object, making it easier to consume.
  • The author property is a simplified h-card containing only name/photo/url properties that are single values.

As you can see, a lot of what XRay is doing is cleaning up some of the the "messy" parts of Microformats JSON. Not necessarily the specific JSON format, but more about the overall structure, such as how an author of a post can be in many different places in a parsed Microformats JSON object. This is not to place blame on Microformats, since what it's doing is creating a JSON representation of the original HTML, and allowing authors flexibility in how they publish HTML rather than prescribe specific formats is a core principle.

What this means is XRay is actually acting more as an interpreter of the Microformats JSON, in order to deliver a cleaned-up version to consumers. Most of my projects that use XRay could actually be considered "clients", such as how I use XRay to parse posts for my reader, whether that's output to me in IRC or re-rendered as a post on IndieNews.

My primary need for an alternative Microformats JSON format is actually a client-to-server serialization, where the client is getting a cleaned up version of external posts, and can assume that the server it's talking to is responsible for taking the messy data and normalizing it to something it expects. In this sense, the use case of jf2 is a client-to-server serialization, whereas the Microformats JSON is a server-to-server serialization. This would then be a core building block for Microsub, a spec that provides a standardized way for clients to consume and interact with feeds collected by a server.

The main current challenge in defining a spec for this use case is how tied to specific vocabularies it should be. For example, Microformats JSON says that every value should always be an array. However, there are a few properties for which it never makes sense to have multiple values, and creates additional complexity in consuming it, e.g. published, uid, and location. It's easier to consume these when the values can be relied upon to always be a single value. With the author of a post, the author of an h-entry may be an object or a string, making it more complicated to consume that when it can vary, so XRay's format always returns a consistent value. However this is tied to the h-entry vocabulary, since other Microformats vocabularies don't have an author property. In general, the success I've had with XRay's format is due to the fact that it makes hard decisions about what properties it returns, and is consistent about whether those properties are single- or multi-valued, in order to provide a consistent API to consumers.

I am just not sure how to balance wanting to provide that simplicity for consuming clients while also allowing flexibility in publishing, while also not hard-coding too much into a spec that might be obsoleted later.

Aaron Parecki

I just made a little app on the website so you can add domains you're offering to the website right now! Go ahead and add a few before the event tomorrow! Looking forward to seeing everyone there! https://domainswap.xyz/

Aaron Parecki

I forgot this is such a terrible time to grocery shop here

Ben Werdmüller

Well, this is kind of a shitshow. Livejournal was such an important part of my early web life. Really sad. http://io9.gizmodo.com/russian-owned-livejournal-bans-political-talk-adds-ris-1794143772

Ben Werdmüller

I'm hosting Homebrew Website Club at @mattervc SF tomorrow. Join us at 6:30 for demos & talk about a stronger independent web.

Ben Werdmüller

Really nice update from - and I'll defend this! - the best TV show ever made. http://www.bbc.com/news/entertainment-arts-39444025