The Universe of Discourse
Sun, 26 Aug 2012

How to build a good tagging system
The world is full of shitty tagging systems, where the tags don't mean anything and you have no confidence that recipes involving duck will be tagged "duck". Despite the hard work of a few people on se.math, the tags there are not really useful for searching for what you want. The tags I put on my own bookmarks on pinboard are okay, but making them work is a lot of effort.

I have only seen one really tagging system that works really well. If not for this example, I might dismiss the whole idea of collaborative tagging. I have spent a lot of time thinking about what makes this system work where all the others have failed.

The example is Danbooru, a widely used image uploading and sharing site. A warning: Many, although by no means most of the images are pornographic. You can browse just the "safe" images at safebooru. Don't be tempted to dismiss this example just because it is frivolous. They do about ten things right that almost nobody else does right. TECHNICAL

There is no limit on number of tags an image can have

There is no limit on the number of tags you can apply to a single image. I'm having trouble constructing arguments in favor of this because it seems so obvious. If the image contains a sponge, and someone wants to tag it with "sponge", well, why on earth not? Maybe nobody else cares about sponges, but really, you never know, and what harm is there in having additional information, as long as it is accurate?

If you do a tag search for "bare feet", you want images contain bare feet, preferably all of them, and certainly not some random subset of them. If you get more than you can use, you can always include more tags in your query to cut down the results of the search.

But if there were a limit on the number of tags, say five, that an image could have, then users would be in a difficult position of deciding which five were most important; people would remove one tag to make room for another. There might be three perfect images in the database for "bare feet" plus "blue dress" plus "beach", but if "bare feet" has been removed from one, "blue dress" from another, and "beach" from the third in order to satisfy an arbitrary limit on the number of tags, you will never find the ones you want. And what if someone wants to find a picture of a person with bare feet wearing a necktie; what a shame if there is one and its "bare feet" tag has been removed to make room for "necktie".

The tag system might need limits on what tags can exist, or what should be applied at all. But if a tag accurately describes the image to which it is applied, it should be there. There should not be an arbitrary upper limit on the amount of descriptive information that can be associated with an image.

Tag implications

Ontologies are often hierarchical. Hair can be long; if long it can be worn in pigtails. One user might be interested in a search for pictures of people with long hair, and another only with pigtails. You don't want user A to miss out on the pictures with pigtails because they searched for "long hair" and not "long hair" + "pigtails". But you also don't want to have to manually add "long hair" to every "pigtails" picture.

Danbooru has a "tag implication" system where certain tags are deemed to "imply" others: when you add the tag "pigtails" to an image, the "long hair" tag is automatically added at the same time. When you add the tag "blue dress" to an image, "dress" is added also.

You might not be aware that three was consensus last month to create a new tag, "head covering", which applies to hats, helmets, headscarfs, and so on, and so you might not know that when you tag an image with "helmet" it should get "head covering" also. But there will be an alias, and the tag will be automatically appplied. There's a good chance that you will notice this, and you'll learn something useful about the ontology that will help you with later searches.

The tag implication system is nontrivial. Care has to be taken to prevent tag loops. Danbooru users spend a lot of time discussing tag implications and which ones should be set up. But they are an essential feature and worth the effort.

Tag aliases

Good, simple API

Tag wiki: easy to use, provides examples

Easily-viewed record of tag changes

Tag pools instead of personal tags

SOCIAL * Requirement that tags be *objective* * Good curation * Good *tools* for curation: when community agrees to merge two tags, the admin does it promptly * Community dedication to tagging CRITICISM * Image size tags are out of place

[Other articles in category /tech] permanent link

Rewriting published history in Git
My earlier article about my habits using Git attracted some comment, most of which was favorable. But one recurring comment was puzzlement about my seeming willingness to rewrite published history. In practice, this was not at all a problem, I think for three reasons:

  1. Rewriting published history is not nearly as confusing as people seem to think it will be.
  2. I worked in a very small shop with very talented developers, so the necessary communication was easy.
  3. Our repository setup and workflow were very well-designed and unusually effective, and made a lot of things easier, including this one.
This article is about item 3. Here's what they do at my previous workplace to avoid most of the annoyances of people rewriting published history.

If there are N developers, there are N+1 repositories.

There is a master repository to which only a few very responsible persons can push. It is understood that history in this repository should almost never be rewritten, only in the most exceptional circumstances. We usually call this master repository gitbox. It has only a couple of branches, typically master and deployed. You had better not push incomplete work to master, because if you do someone is likely to deploy it. When you deploy a new version from master, you advance deployed up to master to match.

In addition, each developer has their own semi-public repository, named after them, which everyone can read, but which nobody but them can write. Mine is mjd, and that's what we call it when discussing it, but my personal git configuration calls it origin. When I git push origin master I am pushing to this semi-public repo.

It is understood that this semi-public repository is my sandbox and I am free to rewrite whatever history I want in it. People building atop my branches in this repo, therefore, know that they should be prepared for me to rewrite the history they see there, or to contact me if they want me to desist for some reason.

When I get the changes in my own semi-public repository the way I want them, then I push the changes up to gitbox. Nothing is considered truly "published" until it is on the master repo.

When a junior programmer is ready to deploy to the master repository, they can't do it themselves, because they only have read access on the master. Instead, they publish to their own semi-private repository, and then notify a senior programmer to review the changes. The senior programmer will then push those changes to the master repository and deploy them.

The semi-public mjd repo has lots of benefits. I can rewrite my branches 53 times a day (and I do!) but nobody will care. Conversely, I don't need to know or care how much my co-workers vacillate.

If I do work from three or four different machines, I can use the mjd repo to exchange commits between them. At the end of the day I will push my work-in-progress up to the mjd repo, and then if I want to look at it later that evening, I can fetch the work-in-progress to my laptop or another home computer.

I can create and abandon many topic branches without cluttering up the master repository's history. If I want to send a change or a new test file to a co-worker, I can push it to mjd and then point them at the branch there.

A related note: There is a lot of FUD around the rewriting of published history. For example, the "gitinfo" robot on the #git IRC channel has a canned message:

Rewriting public history is a very bad idea. Anyone else who may have pulled the old history will have to git pull --rebase and even worse things if they have tagged or branched, so you must publish your humiliation so they know what to do. You will need to git push -f to force the push. The server may not allow this. See receive.denyNonFastForwards (git-config)

I think this grossly exaggerates the problems. Very bad! Humiliation! The server may deny you! But dealing with a rebased upstream branch is not very hard. It is at worst annoying: you have to rebase your subsequent work onto the rewritten branch and move any refs that pointed to that branch. If you don't have any subsequent work, you might still have to move refs, if you have any that point to it, but you might not have any.

[ Thanks to Rik Signes for helping me put this together. ]

[Other articles in category /prog] permanent link