Tic, Tag, Toe

Or rather “tagging, tags, and blog tag policy” or even “what’s the best / most optimal tag nomenclature / syntax”. After redesigning the blog interface I decided to start to rationalise my tags – and to institute a ‘tag policy’.

Tag Policy

  1. Use “-” to delimit multi-word tags
  2. Use all lower case characters

But “Why ?”

For a long time I had been using the “+” symbol to link multi-word tags, but I found that Google Translate (which I use for the language translation capability, up on the top right of the page if you’re reading the blog at http://blogs.sun.com/eclectic/) was having problems processing URL’s which contain “+” or “%2B”.

Here’s a little table I whipped up documenting the issues I was coming up against using multi-word tags, after trying out a number of delimiters, not just “+”, against a variety of technology.

Delimiters tested were: “+”, “%2B”, “_”, ” “, “%20” and “-“. Sites / technology tested were: Roller Blogger (4.0-dev, the version we currently run http://blogs.sun.com on), Google Translate, Google Search, Technorati, Del.icio.us and Slynker.

“+” (plus sign) “%2B” (encoded plus sign) “_” (underscore character)
Roller Weblogger 4.0-dev Will save and retrieve posts which use tags with “+” in the editor
Will not resolve tags URL which use “+” (actually the main site will, but individual blogs can’t)
Will save and retrieve posts which use tags with “%2B” in the editor
Will resolve tags URL which use “%2B”
Will save and retrieve posts which use tags with “_” in the editor
Will resolve tags URL which use “_”
Google Search Will search and retrieve multi-word tags as they are written, i.e. with the “+”, search produces a small number of results because of the infrequency of using “+” to separate written words Will search and retrieve multi-word tags as they are written, i.e. with the “%2B”, search produces a small number of results because of the infrequency of using “%2B” to separate written words Will search and retrieve multi-word tags as they are written, i.e. with the “_”, search produces a small number of results because of the infrequency of using “_” to separate written words
Google Translate Attempts to resolve tags URL which use “+”, encoding the URL to use “%2B” instead (which Roller can serve, see above), then promptly fails Fails to resolve the correct URL to translate using “%2B” Resolves tags URL which use “_” and continues to translate them successfully
Technorati Resolves tag URLs which use “+” correctly
Replaces the “+” with ” ” and produces good results based upon that
Resolves tag URLs which use “%2B” correctly
Replaces “%2B” with ” ” and produces good results based upon that
Resolves tag URLs which use “_” correctly
Produces smaller, but not unreasonable, results, due of the infrequency of using “_” to separate written words
Del.iciou.ois Resolves tag URLs which use “+” correctly
Produces results based upon using “+”
Resolves tag URLs which use “%2B” correctly
Replaces “%2B” with “+” and produces results based upon using “+”
Resolves tag URLs which use “+” correctly
Produces results based upon using “+”
Slynker Fails to resolve “+”
Produces no results
Attempts to resolve tags URL which use “%2B”, encoding the URL to use “%252B” instead
Produces results based upon using “+”
Resolves tag URLs which use “_” correctly
Produces results based upon using “_”
” ” (space) “%20” (encoded space) “-” (minus sign)
Roller Weblogger 4.0-dev Will save posts which use tags with ” ” in the editor
Will not retrieve posts which use tags with ” ” in the editor, instead it separates the words, retrieving them all in alphabetical order
Will resolve tags URL which use ” “, encoding the URL to use “%20” instead
Will save and retrieve posts which use tags with “%20” in the editor
Will resolve tags URL which use “%20”
Will save and retrieve posts which use tags with “-” in the editor
Will resolve tags URL which use “-“
Google Search Will search and retrieve multi-word tags as they are written, i.e. with the ” “, search produces a large number of results Will search and retrieve multi-word tags as they are written, i.e. with the “%20”, search produces a small number of results because of the infrequency of using “%20” to separate written words Will search and retrieve multi-word tags as they are written, i.e. with the “-“, and will replace the “-” with ” ” as well, thus retrieving the most amount of related information
Google Translate Attempts to resolve tags URL which use ” “, encoding the URL to use “%20” instead (which Roller can serve, see above), then promptly fails Fails to resolve the correct URL to translate using “%20” Resolves tags URL which use “-” and continues to translate them successfully
Technorati Resolves tag URLs which use ” ” correctly, after re-encoding the URL with “%20”
Produces good results based upon using ” “
Resolves tag URLs which use “%20” correctly, replaces the “%20″ with ” ” and produces good results based upon that Resolves tag URLs which use “-” correctly
Produces smaller, but not unreasonable, results, due of the infrequency of using “-” to separate written words
Del.iciou.ois Resolves tag URLs which use ” ” correctly, after re-encoding the URL with “%20”
Produces results based upon using ” “
Resolves tag URLs which use “%20” correctly
Replaces “%20″ with ” ” and produces results based upon using ” “
Resolves tag URLs which use “-” correctly
Produces results based upon using “-“
Slynker Attempts to resolve tags URL which use ” “, encoding the URL to use “%20” instead
Produces results based upon using ” “
Resolves tag URLs which use “%20” correctly
Replaces “%20″ with ” ” and produces results based upon using ” “
Resolves tag URLs which use “_” correctly
Produces results based upon using “_”

As you’ve probably surmised by now the issue is actually about the convergence of two technologies, and the incompatibilities they currently have. Principally that of tagging blog posts (and other stuff too) and that of URL encoding. It is not due to the limitations differing web1.0 and web2.0 platforms have around tag syntax, specifically multi-word tags, but of the correct adherence of these platforms in there support of RFC 1738: Uniform Resource Locators (URL) specification.

The problem is that tagging generally uses a relatively free form syntax (driven mainly by the communities which use and propagate said tag nomenclature, or “Folksonomy”), when and where possible, but that URL encoding has a variety of reserved characters, which conflict with the characters used in tags.

Characters for special use in defining URL syntax include the following “Reserved Characters”, and should be encoded where possible (although as the data in the tables above prove even the encoded URLs fail to produce the expected, or required, results).

Character Hex Dec
 “$” (the dollar sign)
“&” (ampersand symbol)
“+” (plus sign)
“,” (comma symbol)
“/” (forward slash)
24
26
2B
2C
2F
36
38
43
44
47
Character Hex Dec
 “:” (the colon)
“;” (the semi-colon)
“=” (equal sign)
“?” (the question mark)
“@” (the ‘at’ symbol)
3A
3B
3D
3F
40
58
59
61
63
64

Given that the above are “Reserved Characters” when it comes to URL encoding, and that they include some of the most popular delimiters used by multi-word tags (specifically “+” which is used a great deal, especially on Technorati). And, as I have found in the investigation above, have a number of issues in being used both in multi-word tags and in URL encoding, I have decided to standardise on “-” as the multi-word tag delimiter of choice.

For me it has a number of advantages:

  1. saved and retrieved correctly in tags in the Roller edit post page
  2. the URL is encoded correctly in Roller too
  3. it resolves correctly whilst using Google Translate
  4. it returns all search results for both “-” and ” ” in Google Search – an unexpected bonus, in terms of returning search results (and thus being included in said search results)
  5. it returns reasonable results from Technorati, based upon “-“
  6. it returns reasonable results from Del.icio.us, based upon “-“
  7. it returns reasonable results from Slynker, based upon “-“

As to the issue of upper versus lower case, I have standardised on all lower case, as this has little effect in searches (outside of Technorati, which returns slightly differing results, albeit with a low delta between the results returned).

You may be able to see that I have started to retroactively replace the tags so far created with this new standard – however I have focused on the most popular tags for the time being, and I will continue to use this format from now on.

I found this article on “URL Encoding (or: ‘What are those “%20″ codes in URLs?’)” provided a nice overview of the issues of URL encoding, and of RFC 1738 itself.