I’d like to take a given blog and generate an OPML feed subscription list from it. The idea is to look at their blogroll, their posts, and possibly even their commenters to get a list of who they’re reading and who pays attention to them–their neighborhood, I guess. I’d eventually like to turn it into a reading list generator. Reading lists are dynamic feed subscription lists implemented as OPML identified by the URL where it lives. When the reading list changes (e.g., in the case I’m thinking of, a new blog appears on the target’s blogroll or the target references a new blog in a post), the OPML changes. Reading list aware software then uses the new OPML instead of the old, automatically offering new feeds to the consumer of the reading list.
This seems relatively simple. OPML is simple. Now that I (sort of) know Python, parsing HTML is simple. Making OPML is pretty simple. Yet I still find myself doing some yak shaving. Yak shaving is when you start to do a seemingly simple task but find yourself in a recursive loop solving prerequisite tasks that have their own prerequisites that in turn have their own prerequisites. I could just hack together what I want. But there are things I want to know first.
Seth Godin says “Don’t Shave That Yak”: “The minute you start walking down a path toward a yak shaving party, it’s worth making a compromise. Doing it well now is much better than doing it perfectly later.” I can’t resist shaving a bit of that ugly matted fur off though. I need to research XML and OPML a bit more before I’m ready to code more. I’m turning it into a blog post though, because it gives me a sense of accomplishment. Also, some of my astute readers might catch errors or blind spots in my thinking.
OPML stands for Outline Processor Markup Language and serves as the most common (or perhaps only?) format for storing and exchanging feed subscription information. If you export your subscriptions from Bloglines, for instance, you’ll get an OPML file.
OPML is a markup language defined in XML. I know a bit about XML but I haven’t read an O’Reilly book about it or even bought one (did you know technical information magically migrates from books on your shelf into your brain?) so my understanding might be shaky. XML allows you to define custom markup languages for sharing data across systems. It provides a human-readable format. It’s based on SGML, a very complex markup language definition language that nobody ever figured out how to use. XHTML is the XML-ized version of HTML. There are various mechanisms like DTDs (document type definitions) or XML Schema for defining what’s allowable in a certain XML language, but they don’t all support namespaces. Namespaces are important for both saying what terms like entity names or attributes are allowed and for defining contexts in which certain terms are guaranteed unique.
You can use XSLT (XSL Transformations where XSL=Extensible Stylesheet Language) that can transform data from one XML representation into another representation, XML or otherwise. What’s XSL? Hey, that yak fur there doesn’t look like it needs trimming right now. I’ve heard or read that XSL is quite complex to use… but I’m wondering if I should use that instead of a Python script to go from HTML to OPML. Python is fun. XSLT doesn’t sound fun. What do I get from XSLT vs. using a scripting language? I don’t know.
One question I had when thinking about my subscription list generator was this: what if I want to store some extra data in the OPML, like where the feed came from (blogroll vs. referenced in post vs. from comment) and how many times it appeared? I found some good info from BlogBridge on attributes in OPML that might solve this requirement. Incidentally, BlogBridge supports reading lists, and I think it may be the only news aggregator other than Dave Winer’s NewsRiver (that runs inside his OPML editor, I think) to do so. Anyway, that BlogBridge post has a bunch of suggestions for what should be made core OPML attributes. For example, “rating” could indicate the OPML owner’s opinion of a certain feed relative to other feeds in the outline. However, some commenters feel that an attribute like that should be specified in a separate namespace. All this taken together makes me think I can add application-specific attributes to my OPML subscription list by creating a separate namespace.
Hence: must learn about namespaces. More yak shaving. Or, I can just add the attributes without a namespace. I assume then the OPML wouldn’t validate but my point’s not to make valid OPML yet… I understand most current news aggregators do not produce valid OPML.
I want to extend OPML so that I can then suck my own OPML into a repository of news sources, not so I can publish it to other OPML-aware tools. So it probably doesn’t matter whether my OPML validates or not or whether I use DTDs or XML schemas (with namespaces? still not clear how those relate) or do something else.
The yak already looks better. Tomorrow when the kids are in school I’ll tackle my little script. I already did one version of it so I have a good starting point. But I’ve read a lot about OPML and attention since then so I am inspired to go way far beyond what I already did.
But perhaps I need to look into attention.xml? Sigh.

2 Comments
Nice post Anne. You asked for feedback, so here it is
You wrote: “OPML stands for Outline Processor Markup Language and serves as the most common (or perhaps only?) format for storing and exchanging feed subscription information. ”
OPML is the only one I know of that does this, but the flip side is that OPML can do more than just act are a list of feeds. They could any url that happens to be static too. Example alt. OPML formated lists could be: wishlist (pointing to pooks on Amazon), or the cars I’ve owned (search cars AND OPML), or other OPML lists. You get the picture.
On the relationship between DTDs and namespaces: think of a namespace as a URI to another doc (the DTD) within your xml file. The DTD contains the ‘rules’ you want to follow.
See: http://msdn.microsoft.com/msdnmag/issues/01/05/xml
and this:
http://www.w3schools.com/dtd/default.asp
Hi Alex, thanks for the info… very helpful. I’m pretty intrigued by OPML and have gotten the OPML Editor running on my machine but I’m spending too much time blogging to actually do anything with it.
I think what I read on Wikipedia was that DTDs don’t support namespaces, which seems completely wrong based on what you’ve said. You seem to be saying DTDs define namespaces which is what I was thinking before I read the Wikipedia XML entry. So I need to check that out as well as the links you gave.