wissel.net

Usability - Productivity - Business - The web - Singapore & Twins

Hyperlinks need to live forever - Blog edition


THE bummer mistake in any web revamp is a total disregard for page addresses. The maximum to be found is a nice 404 page with a notice that things have been revamped and the invitation to search. What a waste of human time and disregard for a site's users!
The links to the original page live outside the sites control and Jacob already stated in 1998 Pages need to live forever. So what could you do when swapping blog platforms?
If your new platform runs behind an Apache HTTP server (also known as IHS), there is mod_rewrite that allows you to alter incoming addresses (the old links) into the new destinations based on a pattern match ( other http servers have similar functions, but that's a story for another time).
HTTP knows 2 redirection codes:
  • 302 for temporary redirections
  • 301 for permanent ones.
You want to use the later, so at least the search engines update their links.
Now your new URL pattern most likely uses a different structure than the old one, so a simple Regex might not help for that transition. E.g. your existing format might be /myblog.nsf/d6plinks/ABCDEF while the new pattern would be /blog/2001/10/is-this-on.html.
For this case mod_rewrite provides the RewriteMap where you can use your old value (ABCDEF in our case) to find the new URL. Unfortunately mod_rewrite is very close to dark magic. It can be simple from a key/value lookup up to invoking an external program to get the result. For the key/value lookup you need make your key case insensitive, so all the possible case variations work. This is what I figured out:
RewriteEngine on
RewriteMap lowercase int:tolower
RewriteMap blog-map dbm:/var/www/blogmap.map
RewriteRule ^/myblog.nsf/d6plinks/(.*) /blog/${blog-map:${lowercase:$1}} [NC,R=301,L]
Let me pick that into pieces for you:
  1. RewriteEngine on
    This switches the rewrite engine on. It requires that mod_rewrite is loaded (check your documentation for that)
  2. RewriteMap lowercase int:tolower
    This enabled an internal conversion of the incoming string into its lower case format
  3. RewriteMap blog-map dbm:/var/www/blogmap.map
    This defines the actual lookup. The simplest case would be a text file with the key and result in one line separated by a space. However that might not perform well enough for larger numbers of links, so I choose a indexed table format. It is very easy to create, since the tool is included in the Apache install. I generated my translation list as text file and then invoked httxt2dbm -v -i /var/www/blogmap.txt -o /var/www/blogmap.map and the indexed file is created/updated
  4. RewriteRule ^/myblog.nsf/d6plinks/(.*) /blog/${blog-map:${lowercase:$1}} [NC,R=301,L]
    This is the rewrite rule with a nested set of parameters that first converts the key to lower case and then looks up the new URL. If a key isn't found it redirects to /blog/ which suits my needs, you might want to handle things different.
    In detail:
    1. ^/myblog.nsf/d6plinks/(.*) matches all links inside the d6plinks, the () "captures" ABCEDF (from our example), so it can be used in $1
    2. ${lowercase:$1} converts ABCDEF into abcdef
    3. ${blog-map: ... } finally looks it up in the map file
    4. [NC,R=301,L] are the switched governing the execution of the rewrite rule:
      • NC stands for NoCase. It allow to match /MyBlog.nsf/ /MYBLOG.NSF/ /myblog.NSF/ etc. It doesn't however convert the string
      • R=301 issues a permanent redirect response (default is 302, temporary)
      • L stops the evaluation of further redirection rules
As usual YMMV

Posted by on 05 December 2012 | Comments (4) | categories: Blog Software WebDevelopment

Comments

  1. posted by David Schaffer on Wednesday 05 December 2012 AD:
    Thanks for the nice tutorial on the rewrite rules. However, if a site is heavily reorganized there's often not much you can do with pattern matching. Sometimes you need individual redirects for the most popular pages and a good site map for the rest.
  2. posted by Stephan H. Wissel on Thursday 06 December 2012 AD:
    David,
    I respectfully disagree. All reorganisations I came across change substantially in the base path - and that is all you need. The MAP will then help you to match an arbitrary old page to the exact new page.
    AND the preservation of the old links needs to be an integrated part of the revamp considerations. Most commonly you actually CAN identify what pages are old by using a pattern. What never works is to use a simple patterns transformation to get from old to new. But that is eaxtly the point of this blog post: once you got hold of an old page you simply look it up.
    Of course that requires the creation of the mapping table in the revamp project. Without that mapping nothing will work.
    A variation of this approach would be to alter the 404 page and lookup if there is a replacement page.
  3. posted by Martin Leyrer on Thursday 06 December 2012 AD:
    Tim Berners-Lee in 1998: Cool URIs don't change
    { Link }

  4. posted by Martin Leyrer on Thursday 06 December 2012 AD:
    @David: mod_rewrite is voodoo. Damned cool voodoo, but still voodoo

    Using the RewriteMap directive you could even handcraft or pre-generate subtitution tables so you do not have to only rely on regular expresions. With that you could cover the most complex sites and have no excuse to send your users to a 404 page. Emoticon wink.gif