Tic, Tag, Toe - Horkan

Or rather “tagging, tags, and blog tag policy” or even “what’s the best / most optimal tag nomenclature / syntax”. After redesigning the blog interface I decided to start to rationalise my tags – and to institute a ‘tag policy’.

Tag Policy

Use “-” to delimit multi-word tags
Use all lower case characters

But “Why ?”

For a long time I had been using the “+” symbol to link multi-word tags, but I found that Google Translate (which I use for the language translation capability, up on the top right of the page if you’re reading the blog at https://blogs.sun.com/eclectic/) was having problems processing URL’s which contain “+” or “%2B”.

Here’s a little table I whipped up documenting the issues I was coming up against using multi-word tags, after trying out a number of delimiters, not just “+”, against a variety of technology.

Delimiters tested were: “+”, “%2B”, “_”, ” “, “%20” and “-“. Sites / technology tested were: Roller Blogger (4.0-dev, the version we currently run https://blogs.sun.com on), Google Translate, Google Search, Technorati, Del.icio.us and Slynker.

	“+” (plus sign)	“%2B” (encoded plus sign)	“_” (underscore character)
Roller Weblogger 4.0-dev	Will save and retrieve posts which use tags with “+” in the editor Will not resolve tags URL which use “+” (actually the main site will, but individual blogs can’t)	Will save and retrieve posts which use tags with “%2B” in the editor Will resolve tags URL which use “%2B”	Will save and retrieve posts which use tags with “_” in the editor Will resolve tags URL which use “_”
Google Search	Will search and retrieve multi-word tags as they are written, i.e. with the “+”, search produces a small number of results because of the infrequency of using “+” to separate written words	Will search and retrieve multi-word tags as they are written, i.e. with the “%2B”, search produces a small number of results because of the infrequency of using “%2B” to separate written words	Will search and retrieve multi-word tags as they are written, i.e. with the “_”, search produces a small number of results because of the infrequency of using “_” to separate written words
Google Translate	Attempts to resolve tags URL which use “+”, encoding the URL to use “%2B” instead (which Roller can serve, see above), then promptly fails	Fails to resolve the correct URL to translate using “%2B”	Resolves tags URL which use “_” and continues to translate them successfully
Technorati	Resolves tag URLs which use “+” correctly Replaces the “+” with ” ” and produces good results based upon that	Resolves tag URLs which use “%2B” correctly Replaces “%2B” with ” ” and produces good results based upon that	Resolves tag URLs which use “_” correctly Produces smaller, but not unreasonable, results, due of the infrequency of using “_” to separate written words
Del.iciou.ois	Resolves tag URLs which use “+” correctly Produces results based upon using “+”	Resolves tag URLs which use “%2B” correctly Replaces “%2B” with “+” and produces results based upon using “+”	Resolves tag URLs which use “+” correctly Produces results based upon using “+”
Slynker	Fails to resolve “+” Produces no results	Attempts to resolve tags URL which use “%2B”, encoding the URL to use “%252B” instead Produces results based upon using “+”	Resolves tag URLs which use “_” correctly Produces results based upon using “_”

	” ” (space)	“%20” (encoded space)	“-” (minus sign)
Roller Weblogger 4.0-dev	Will save posts which use tags with ” ” in the editor Will not retrieve posts which use tags with ” ” in the editor, instead it separates the words, retrieving them all in alphabetical order Will resolve tags URL which use ” “, encoding the URL to use “%20” instead	Will save and retrieve posts which use tags with “%20” in the editor Will resolve tags URL which use “%20”	Will save and retrieve posts which use tags with “-” in the editor Will resolve tags URL which use “-“
Google Search	Will search and retrieve multi-word tags as they are written, i.e. with the ” “, search produces a large number of results	Will search and retrieve multi-word tags as they are written, i.e. with the “%20”, search produces a small number of results because of the infrequency of using “%20” to separate written words	Will search and retrieve multi-word tags as they are written, i.e. with the “-“, and will replace the “-” with ” ” as well, thus retrieving the most amount of related information
Google Translate	Attempts to resolve tags URL which use ” “, encoding the URL to use “%20” instead (which Roller can serve, see above), then promptly fails	Fails to resolve the correct URL to translate using “%20”	Resolves tags URL which use “-” and continues to translate them successfully
Technorati	Resolves tag URLs which use ” ” correctly, after re-encoding the URL with “%20” Produces good results based upon using ” “	Resolves tag URLs which use “%20” correctly, replaces the “%20″ with ” ” and produces good results based upon that	Resolves tag URLs which use “-” correctly Produces smaller, but not unreasonable, results, due of the infrequency of using “-” to separate written words
Del.iciou.ois	Resolves tag URLs which use ” ” correctly, after re-encoding the URL with “%20” Produces results based upon using ” “	Resolves tag URLs which use “%20” correctly Replaces “%20″ with ” ” and produces results based upon using ” “	Resolves tag URLs which use “-” correctly Produces results based upon using “-“
Slynker	Attempts to resolve tags URL which use ” “, encoding the URL to use “%20” instead Produces results based upon using ” “	Resolves tag URLs which use “%20” correctly Replaces “%20″ with ” ” and produces results based upon using ” “	Resolves tag URLs which use “_” correctly Produces results based upon using “_”

As you’ve probably surmised by now the issue is actually about the convergence of two technologies, and the incompatibilities they currently have. Principally that of tagging blog posts (and other stuff too) and that of URL encoding. It is not due to the limitations differing web1.0 and web2.0 platforms have around tag syntax, specifically multi-word tags, but of the correct adherence of these platforms in there support of RFC 1738: Uniform Resource Locators (URL) specification.

The problem is that tagging generally uses a relatively free form syntax (driven mainly by the communities which use and propagate said tag nomenclature, or “Folksonomy”), when and where possible, but that URL encoding has a variety of reserved characters, which conflict with the characters used in tags.

Characters for special use in defining URL syntax include the following “Reserved Characters”, and should be encoded where possible (although as the data in the tables above prove even the encoded URLs fail to produce the expected, or required, results).

Character	Hex	Dec
“$” (the dollar sign) “&” (ampersand symbol) “+” (plus sign) “,” (comma symbol) “/” (forward slash)	24 26 2B 2C 2F	36 38 43 44 47

Character	Hex	Dec
“:” (the colon) “;” (the semi-colon) “=” (equal sign) “?” (the question mark) “@” (the ‘at’ symbol)	3A 3B 3D 3F 40	58 59 61 63 64

Given that the above are “Reserved Characters” when it comes to URL encoding, and that they include some of the most popular delimiters used by multi-word tags (specifically “+” which is used a great deal, especially on Technorati). And, as I have found in the investigation above, have a number of issues in being used both in multi-word tags and in URL encoding, I have decided to standardise on “-” as the multi-word tag delimiter of choice.

For me it has a number of advantages:

saved and retrieved correctly in tags in the Roller edit post page
the URL is encoded correctly in Roller too
it resolves correctly whilst using Google Translate
it returns all search results for both “-” and ” ” in Google Search – an unexpected bonus, in terms of returning search results (and thus being included in said search results)
it returns reasonable results from Technorati, based upon “-“
it returns reasonable results from Del.icio.us, based upon “-“
it returns reasonable results from Slynker, based upon “-“

As to the issue of upper versus lower case, I have standardised on all lower case, as this has little effect in searches (outside of Technorati, which returns slightly differing results, albeit with a low delta between the results returned).

You may be able to see that I have started to retroactively replace the tags so far created with this new standard – however I have focused on the most popular tags for the time being, and I will continue to use this format from now on.

I found this article on “URL Encoding (or: ‘What are those “%20″ codes in URLs?’)” provided a nice overview of the issues of URL encoding, and of RFC 1738 itself.

Links for this article:

Recovered link: https://horkan.com/2008/01/08/tagging-tags-blog-tag-policy
Archived link: https://web.archive.org/web/20100715132403/https://blogs.sun.com/eclectic/entry/tagging_tags_blog_tag_policy
Original link: ~~https://blogs.sun.com/eclectic/entry/tagging_tags_blog_tag_policy~~