Friday 07 February 2003

Meaningless Archiving meets Numerology

Without actually mentioning my rant about weekly archives, gord from Poetic Geek takes me to task in the nicest possible way over the form of my individual archives:

Let’s take Jonathon Delacour for example, he’s both an excellent writer and seems to care about semantics. His most recent post is archived at http://weblog.delacour.net/archives/000833.html. What does this URL tell you about the piece it points to? Almost nothing, only that it is stored in the archives and there are 832 previous pieces. Imagine the URL looked like this: http://weblog.delacour.net/archives/2003/02/05/can_I cite_you_on_that.html. What does this tell us about the post: It was posted on February 5th, 2003 and is entitled “can I cite you on that”. This is much more informative. It would also allow people to construct URL’s based on information they have on the date the post was made. This cannot be accomplished with the previous method.

Burningbird commented:

Good points. Unfortunately, if we’ve had the weblogs for a while and have been linked to, it makes it difficult to change. That can be a lot of redirects to manage.

That would have been my response too. When I switched to Movable Type, I think I decided on the simple numbered archive format because:

  • The dated folder structure seemed unnecessarily complex (Radio UserLand does it that way too).
  • I thought the file name based on the entry title looked ugly.
  • And therefore I ignored my usual MarkUp Mentor, Mark Pilgrim, and followed the example of Burningbird and Phil Ringnalda instead (not that I’m blaming Bb or Phil—I liked the numbered format).

But if I’d read gord’s post first, I would almost certainly have followed his advice—and not because he ends it with:

Whichever method you use, it will most certainly be more helpful than the padded entry ID. Jonathon is by no means the only person doing this, he’s just by far the cutest.

In any case, it’s clearly not too late, since I could follow gord’s suggestion to:

leave your current pages as they are so that no one’s links are broken and then just have it generate all new pages that way.

That’s yet another of the nice things about MT—it just leaves the old archive pages in place. So, unless someone offers a compelling argument that I should leave things as they are, I may well implement gord’s archive system. Although there is one important issue: given that numbers have great significance for me, at which post number do I switch from one system to the other? (For example, I’ve just realized that this entry will be #835, which is a good number because 8=3+5. But I’ll want to post some other entries while I’m waiting for counter-opinions. So the turnover point might have to be #844. We’ll see.)

Permalink | Technorati

Comments

Brilliant.

(by the way, I am obsessive compulsive) Now I will be watching your blog all the time for interesting number combinations.

If space isn't a premium, you can implement MT's multiple archiving and have it both ways. That way your URLs don't drop off the face of the earth, you get to keep figuring out interesting number patterns, and people get metadata rich URLs.

I likey. I might just have to implement something similar at poeticgeek.

Sorry to compare you to "Joanne", she doesn't exist, but I if she did, she'd be a twit, which you most certainly are not. That, and she collects road kill and stuffs them in her basement. If you are at all like me, then DIY taxodermy is a big turn-off.

Posted by gord on 7 February 2003 (Comment Permalink)

Jonathon, this is what I did a while ago, and therefore have no objections to you doing this :)
Of course, it leaves them in an old style for ever more (unless you are prepared to maintain them)

Compare and contrast
http://www.dellah.com/orient/2002/03/11/daily_css_fun.shtml
http://www.dellah.com/orient/archives/000004.shtml

Posted by Paul Freeman on 7 February 2003 (Comment Permalink)

Paul, thanks for the practical example. Are you running multiple archives? Or did you switch from the numbered scheme to the gord scheme?

Gord, no problem about Joanne. She might be my female alter ego. One question arises, though... how does Google deal with the duplicate archive copies of each post?

Posted by Jonathon on 7 February 2003 (Comment Permalink)

I just wrote about this myself:

http://www.johnsjottings.com/archives/2003/02/05/i_like_mod_rewrite.html

Now in my case it was easy for me to use mod_rewrite to do the redirection so that I could get rid of the original files, because I didn't have many hundred of postings like you do. So for you the best option would clearly be leaving the old files around.

What I like best about the improved URL format is that it makes reading my log files so much easier.

Posted by john on 7 February 2003 (Comment Permalink)

Be just a tad more careful than that. There is more to this than meets the eye.

If you have old pages with comments and trackbacks, and you embed comments and/or trackbacks in your pages, what gets re-built is the new archive page. But if someone links to your old archive page, they'll comment, but it will never show because the invidual page is never re-built.

This is particulary sensitive for me because I re-build individual archive pages for new trackbacks as well as comments.

Over time, most activity would go to your new pages, with the new archive architecture. Until this complete migration takes effect, for a time you'll have people following links to old pages (perhaps from Google, or links from other weblogs), they'll make comments and then be confused as to why the comment isn't showing in the individual archive page, because they're viewing the old page, not the new.

When I re-architected from PHP to HTM files, I did this change old/new archive change. But I also did a re-direct for all older pages that still get comments every once in a while, such as Parable of the Languages.

If you don't get comments on older posts, or you don't care if the people don't see these showing up (they still work, comments and trackback are entry id based, not uri based), then you shouldn't have a problem.

(And I just said 'if you don't care...' to the man worried about which number to stop on.)

However, even if you didn't care about this, you'll also have to change database fields that reflect your old archive address, such as mt_trackback.

Posted by Burningbird on 8 February 2003 (Comment Permalink)

Weird number geek that I am, I thought the question of which number to end it on was more interesting.

You've listed a couple of subtraction examples. How about division (842, 881 if you really wanted to wait)? Prime numbers (839, 853 - back to your subtraction - 857, 859, etc.)? 864 is even and a relatively close (863.9 plus) multiple of pi (275 times).

Have some fun thinking about it.

Posted by David on 8 February 2003 (Comment Permalink)

If you want to follow John's example, you could use moveable type to create a .htaccess file containing all the rewrite rules.

This avoids all the problems with having multiple copies of your archive pages, without the effort needed to create hundreds of redirects by hand.

Posted by paul hammond on 8 February 2003 (Comment Permalink)

Hmm, Paul I just made a Lazyweb request for exactly what you propose - has someone already done it?

Posted by john on 8 February 2003 (Comment Permalink)

Paul, are you thinking of Jonathon using MT to automatically create separate re-direct entries for re-directing all of his pages to their new archive locations? Re-direct RegEx match won't work because there's really nothing to trigger the regular expression matching.

That's a lot of entries in .htaccess.

Did you have a different approach? I'd be curious about that myself.

Posted by Burningbird on 8 February 2003 (Comment Permalink)

I think Shelley is on the right track here. Create a separate MT individual archive for the new style (for those who don't know: create the template, then go to Blog Config/Archiving and associate it with the Individual Entry type -- you can associate as many templates as you want with each type), then change the old template to simply contain a META HTTP-EQUIV=REFRESH to the new URL (which you can specify using MT tags). (I know it's not the ideal way to do redirects, but it may be the best solution in this situation. Even Google is smart enough to pick up on it and treats it as a redirect, as if you had specified it in .htaccess.) Rebuild all your archives once, verify that it works, then delete the old template (since new entries won't need the redirect).

Then again, I was up very late last night, so it's possible I'm not making a lick of sense.

Posted by Mark on 8 February 2003 (Comment Permalink)

John - does the following mt template code help?

RewriteEngine On
<MTEntries lastn="1000000">
RewriteRule <$MTEntryID pad="1" $>.html /blog/<$MTArchiveDate format="..." $>.html [R,L]
</MTEntries>

I've not tested this exact code, and it would need adapting to your url scheme, but it should give you enough to play with. Watch out for line wraps though.

BurningBird - You're right that regexps don't help, and that you'd need individual rules for each entry. I've used this technique on a blog with around 70 entries, where it works very well. I can see that it might not scale to thousands of entries though. If this is a problem for you, the only alternative seems to be meta-refresh html tags, which are less than ideal but would work.

Posted by paul hammond on 8 February 2003 (Comment Permalink)

I changed the URL scheme in my site a while back, and implemented a Perl script to take care of 404 errors using MovableType's own API to find the proper redirection for old entries. I can share the script if somebody is interested.

Posted by Ronaldo on 8 February 2003 (Comment Permalink)

If you're using MovableType with MySQL then this might help you:
http://slackerbit.ch/archives/2002/12/14/switching_from_flat_to_googlefriendly_archive_urls.html

Posted by Alex on 8 February 2003 (Comment Permalink)

Jonathon, I just switched. I wasn't aware of the multiple archive option at the time.

Trackback wasn't enabled on any of my old archive pages (this was a while ago I switched), and I rarely get comments on the old pages. Google seems to have done a good job of old finding the new pages (except for one page which gets me a 1000 hits a day, it is very odd.

As a test, I made a comment on this old page. Once completed, it took me to the new page. The comment didn't appear on the old page, as Shelley explained. I still got the email though.

That last comment from Alex looks quite useful. Might try that out. If not, I may eventually get round to writing a python script to sort it out. I suspect it is different for me, I did the change I while ago, so there are relatively few pages affected.

Posted by Paul Freeman on 8 February 2003 (Comment Permalink)

Mmm, that didn't take long. I guess it's a matter of figuring out which method works best for me. I'm instinctively drawn to one that implements a redirect but doesn't leave the old numerical archive entries around (my anal-retentive temperament showing through). Alex's does that. But Burningbird suggests creating "a hashed database of redirects stored directly in the file system."

Which will work better, I wonder?

And David, thanks for suggesting the division, prime number and even sequence variations. I'll pass on the 275 times pi method though, relatively close doesn't hack it for me.

Posted by Jonathon on 8 February 2003 (Comment Permalink)

I'm not pushing my approach, but you can't use Alex's. Hosting Matters won't allow you to play with httpd.conf.

Posted by Burningbird on 8 February 2003 (Comment Permalink)

I'm not going to push my approach either, even if it bears fruit, having already twice inspired you to do things that I'd rather not be doing myself (entry ids and .php), but I will note that a .htaccess file using mod_rewrite on my 499 entries doesn't seem to cause any major server stress. That I've noticed. In light testing.

Posted by Phil Ringnalda on 8 February 2003 (Comment Permalink)

While I think gord has an interesting point, and one with merit, I don't believe it has any more merit than the current system of archiving. The analogy of the generic filing system, where memos are indexed with a seemingly meaningless number, fails to acknowledge the more realistic scenario where "Joanne" probably has a indexing key which unlocks the numerical meaning attached to the memos. Any good filing/archiving system has a method for filing, and a method of retrieval. The default method of sequential numbering is neither better nor worse than a system where archived pages have names that are in some way tied to the content contained within.

When designing a form of part numbering or alias scheme, there are basically three paths you can go down, and they all utilize a key to understanding the scheme. One path is to set about designing a scheme that defines and attaches meaning to the entire part number/alias in an effort to make it possible to discern meaning without necessarily having a key at hand. This is certainly the best and most useful scheme to use when you are dealing with a small number of variables.

The next path would be to define a scheme that defines and attaches as much meaning as possible to the part number/alias. The part number/alias still needs to be meanigful, but the variables are so numerous that a scheme which covers them all would become unweildy, and impossible to easily translate. It's at this point that the indexing key becomes necessary. You can understand the generalities of what you're dealing with, but you can't learn the specifics from the alias alone, and therefore need a lookup key to tell you exactly what that part number/alias stands for.

The last path is to eschew attempts at meaning, and attach a generic convention that carries with it little to no meaning at all unless it is used in conjunction with a lookup key. This last path is usually best suited to situations where you are dealing with only a few items, and/or those items don't need to be defined by their properties.

In the case of weblog archiving schemes, even Jonathan's current scheme is acceptable because the actual URI is secondary to any aliases it can be given, as well as information that is contained within the page itself. Webpages can come with their own key describing how to index and retrieve them. It's called meta data.

So, when a search engine comes along, it not only indexes the URI of a page, it also relationally attaches the meta information associated with the page, and gives it an alias based on the TITLE of the page.

When you or I talk about someone else's post, we usually attach our own alias (and hopefully some useful meta information) to the URI when we link to it from our own weblogs. It's more common (and useful) to say, "Did you see Jonathon's [a href="http://weblog.delacour.net/archives/000835.html" title="Jonathan Delacour - Meaningless Archiving meets Numerology"]recent post[/a].", or "I was recently reading [a href="http://weblog.delacour.net/archives/000835.html" title="Jonathan Delacour | Weblog Archives | Entry 835"]Meaningless Archiving meets Numerology[/a]". We don't usually write, or see, "I just finished reading [a href="http://weblog.delacour.net/archives/000835.html" ]000835.html[/a] and thought it was f**king brilliant!"

Jonathan's archived weblog entries are not categorized on his site by their sequential number. They are grouped by month and category, and each individual entry is titled, thereby also being useful in bookmarking a particular page. Browsers, like search engines, use a pages title as an alias for the URI.

MovableType itself doesn't restrict you to using the sequential numbering scheme to identify entries either. It uses the title, as well as the entry ID, to help you identify posts within the application.

I really don't to see all that much in the way of enhanced usability value by giving individual posts more specific names. The only place where I see it being of any value at all is, as John previously mentioned, in log files. But is that enough of a reason to change? In my opinion, no.

In gord's example URI (http://weblog.delacour.net/archives/2003/02/05/can_I_cite_you_on_that.html) it is purported that this tells us, "It was posted on February 5th, 2003 and is entitled 'can I cite you on that'.", but I would have to say ONLY if I understand the scheme that you've used. It could possibly mean May 2nd, 2003, or even page 2003, chapter 2, verse 5. Don't assume that the user is going to fully understand what scheme your URI is meant to infer.

As I said in the beginning, gord's point has merit, and it is neither better nor worse than the archiving schemes already in use, but I don't personally see the value in switching to it.

Posted by michael on 9 February 2003 (Comment Permalink)

" Any good filing/archiving system has a method for filing, and a method of retrieval. The default method of sequential numbering is neither better nor worse than a system where archived pages have names that are in some way tied to the content contained within."
--michael

I actually agree with you. The added benefit of the dified naming system is that given a "fish out of water", aka a url out of context, you can know much more about the post it points at. I am most probably not making much sense as last night was my birthday. What I am trying to say is that meaningful URLs are what I am after. I like metadata rich URLs. The more it tells me about what it points to, the more likely I am to make the right choice as to whether or not to follow it.

Posted by gord on 10 February 2003 (Comment Permalink)

Bad gord. Commenting before reading the whole comment.

Year/Month/Day is the ISO standard for dates I believe. I guess it is possible that a user might not know what it means. I just really like getting URLs in my e-mail, for example, and being able to tell alot about the post at which it points.

The numbering system works fine given that you are always follwing links in a hypertext document. But when you are presented with a "fish out of water", as described in my previous comment, you know nothing about the link you are about to follow. Why not have a system that is both human /and/ machine useful?

Posted by gord on 10 February 2003 (Comment Permalink)

I would have to again counter that the "fish out of water" URI is only understandable to someone who already understands the scheme you are using. So while to you or I, this may be a URI scheme that imparts information, the first time it's sent to Betty Sue Womack (ficticious), who isn't really web, or weblog-saavy, it has no meaning whatsoever.

In the context of creating a persistent URI scheme (Cool URIs don't change - http://www.w3.org/Provider/Style/URI ), I would say at the very least it's a good idea to include the date the entry went live since that is about the only thing which does not change, but titles can change, files can be moved, and file extensions can change. If you want to really make the URIs friendly, and by that I mean persistent over time, you'd drop the file extension altogether, wouldn't even bother specifying that the posts are in the "archives" directory, and give each post a simple page name that has no chance of changing. Using mod_rewrite you could set up a rule something like the following (I'm no mod_rewrite guru, so feel free to correct it if it's wrong):

RewriteRule ^/([0-9]+)/([0-9]+)/([0-9]+)/([a-zA-Z0-9]+) archives/$1/&2/$3/$4.html

which would point to the actual page leaving you free to use whatever directory structure or backend code you like. It wouldn't matter whether you use .html, .php, .asp, .xml, .cfm or whatever, and you're free to change the backend without worrying about breaking the URI. You're also free to move files to a different directory if you needed/wanted to. I still think using the entry title is an iffy proposition though.

Say you write a post and misspell a word in the title. It's possible, it happens, not everyone spellchecks their entries before posting them. Lets call this ficticious post "The Hipocracy of Weblogs", since I once corrected someone's spelling of "hypocrisy". You don't notice it for a few minutes, or even perhaps until someone emails you. You're now faced with the quandry of correcting the misspelling and breaking any links already pointing to the post, or leaving the error in place to preserve the URI. On a high traffic site posts can be linked to within minutes of going live, so this is a possible scenario.

Lets say you want to change a post title. It's possible, it happens. Phil even recently mentioned changing the title of one of his posts (SimpleComments - http://philringnalda.com/blog/2003/02/simplecomments.php ); don't remember if that actually happened or not (Phil?). So here again you are faced with the quandry of sticking with your original title regardless of whether or not it warrants changing.

The more I think about it, in the context of a persistent URI, the less I like using the title. I'm not going to say that the sequential numbering scheme is free from the dilemma of change, since this is a possibility for those that change their CMS tools and consolidate entries from multiple tools, but much like the custom RSS descriptions that some people are doing, creating a custom page name that is unchangeable sounds like the best solution to me.

Lastly, two things. A URI's usefulness is relative to who is using it, and in what context. If you want to
curb the "fish out of water" possibility, provide your users some type of "E-mail this post" link which will format the email in such a way as to give the URI context and meaning.

And please explain how using a page title is more machine useful than anything else?

Posted by michael on 10 February 2003 (Comment Permalink)

I seriously doubt that a page will be more read because it's named one thing over another. We read pages because a) we get to them through Google; b) we get to them through a site's own search; or c) we get to them through link's, which we hope have meaningful annotation, as well as some given context.

I personally never pay attention to an actual URL unless it's one of those obscenely long things you get sometimes from the bigger publication sites. If Jonathon wants to add processing burden to his server in order to change to some kind of 'meaningful' URLs, that's his choice. My philosophy is to each their own, but then, that's a rather different perspective on this whole thing, isn't it?

(Writing. Wow! We're here for writing! I thought we were here for the links and markup. )

Posted by Burningbird on 10 February 2003 (Comment Permalink)

While I wouldn't want to come across as fickle, Michael's last point -- Let's say you want to change a post title -- kind of blows the dirified file name argument out of the water for me, since this is exactly the kind of thing I'm likely to do (even months after an entry has been posted). The idea of having the numbered archive files in a YY/MM/DD folder structure remains attractive but do I want to add processing burden to my server? Perhaps not.

Bb, isn't the writing meant to provide a foundation for the superstructure of links and markup? Or is it the other way round?

Posted by Jonathon on 10 February 2003 (Comment Permalink)

This discussion is now closed. My thanks to everyone who contributed.

© Copyright 2007 Jonathon Delacour