|The Universe of Discourse|
12 recent entries
Sun, 26 Aug 2012
How to build a good tagging system
I have only seen one really tagging system that works really well. If not for this example, I might dismiss the whole idea of collaborative tagging. I have spent a lot of time thinking about what makes this system work where all the others have failed.
The example is Danbooru, a widely used image uploading and sharing site. A warning: Many, although by no means most of the images are pornographic. You can browse just the "safe" images at safebooru. Don't be tempted to dismiss this example just because it is frivolous. They do about ten things right that almost nobody else does right. TECHNICAL
There is no limit on number of tags an image can haveThere is no limit on the number of tags you can apply to a single image. I'm having trouble constructing arguments in favor of this because it seems so obvious. If the image contains a sponge, and someone wants to tag it with "sponge", well, why on earth not? Maybe nobody else cares about sponges, but really, you never know, and what harm is there in having additional information, as long as it is accurate?
If you do a tag search for "bare feet", you want images contain bare feet, preferably all of them, and certainly not some random subset of them. If you get more than you can use, you can always include more tags in your query to cut down the results of the search.
But if there were a limit on the number of tags, say five, that an image could have, then users would be in a difficult position of deciding which five were most important; people would remove one tag to make room for another. There might be three perfect images in the database for "bare feet" plus "blue dress" plus "beach", but if "bare feet" has been removed from one, "blue dress" from another, and "beach" from the third in order to satisfy an arbitrary limit on the number of tags, you will never find the ones you want. And what if someone wants to find a picture of a person with bare feet wearing a necktie; what a shame if there is one and its "bare feet" tag has been removed to make room for "necktie".
The tag system might need limits on what tags can exist, or what should be applied at all. But if a tag accurately describes the image to which it is applied, it should be there. There should not be an arbitrary upper limit on the amount of descriptive information that can be associated with an image.
Tag implicationsOntologies are often hierarchical. Hair can be long; if long it can be worn in pigtails. One user might be interested in a search for pictures of people with long hair, and another only with pigtails. You don't want user A to miss out on the pictures with pigtails because they searched for "long hair" and not "long hair" + "pigtails". But you also don't want to have to manually add "long hair" to every "pigtails" picture.
Danbooru has a "tag implication" system where certain tags are deemed to "imply" others: when you add the tag "pigtails" to an image, the "long hair" tag is automatically added at the same time. When you add the tag "blue dress" to an image, "dress" is added also.
You might not be aware that three was consensus last month to create a new tag, "head covering", which applies to hats, helmets, headscarfs, and so on, and so you might not know that when you tag an image with "helmet" it should get "head covering" also. But there will be an alias, and the tag will be automatically appplied. There's a good chance that you will notice this, and you'll learn something useful about the ontology that will help you with later searches.
The tag implication system is nontrivial. Care has to be taken to prevent tag loops. Danbooru users spend a lot of time discussing tag implications and which ones should be set up. But they are an essential feature and worth the effort.
Good, simple API
Tag wiki: easy to use, provides examples
Easily-viewed record of tag changes
Tag pools instead of personal tagsSOCIAL * Requirement that tags be *objective* * Good curation * Good *tools* for curation: when community agrees to merge two tags, the admin does it promptly * Community dedication to tagging CRITICISM * Image size tags are out of place
Rewriting published history in Git
If there are N developers, there are N+1 repositories.
There is a master repository to which only a few very responsible persons can push. It is understood that history in this repository should almost never be rewritten, only in the most exceptional circumstances. We usually call this master repository gitbox. It has only a couple of branches, typically master and deployed. You had better not push incomplete work to master, because if you do someone is likely to deploy it. When you deploy a new version from master, you advance deployed up to master to match.
In addition, each developer has their own semi-public repository, named after them, which everyone can read, but which nobody but them can write. Mine is mjd, and that's what we call it when discussing it, but my personal git configuration calls it origin. When I git push origin master I am pushing to this semi-public repo.
It is understood that this semi-public repository is my sandbox and I am free to rewrite whatever history I want in it. People building atop my branches in this repo, therefore, know that they should be prepared for me to rewrite the history they see there, or to contact me if they want me to desist for some reason.
When I get the changes in my own semi-public repository the way I want them, then I push the changes up to gitbox. Nothing is considered truly "published" until it is on the master repo.
When a junior programmer is ready to deploy to the master repository, they can't do it themselves, because they only have read access on the master. Instead, they publish to their own semi-private repository, and then notify a senior programmer to review the changes. The senior programmer will then push those changes to the master repository and deploy them.
The semi-public mjd repo has lots of benefits. I can rewrite my branches 53 times a day (and I do!) but nobody will care. Conversely, I don't need to know or care how much my co-workers vacillate.
If I do work from three or four different machines, I can use the mjd repo to exchange commits between them. At the end of the day I will push my work-in-progress up to the mjd repo, and then if I want to look at it later that evening, I can fetch the work-in-progress to my laptop or another home computer.
I can create and abandon many topic branches without cluttering up the master repository's history. If I want to send a change or a new test file to a co-worker, I can push it to mjd and then point them at the branch there.
A related note: There is a lot of FUD around the rewriting of published history. For example, the "gitinfo" robot on the #git IRC channel has a canned message:
Rewriting public history is a very bad idea. Anyone else who may have pulled the old history will have to git pull --rebase and even worse things if they have tagged or branched, so you must publish your humiliation so they know what to do. You will need to git push -f to force the push. The server may not allow this. See receive.denyNonFastForwards (git-config)I think this grossly exaggerates the problems. Very bad! Humiliation! The server may deny you! But dealing with a rebased upstream branch is not very hard. It is at worst annoying: you have to rebase your subsequent work onto the rewritten branch and move any refs that pointed to that branch. If you don't have any subsequent work, you might still have to move refs, if you have any that point to it, but you might not have any.
[ Thanks to Rik Signes for helping me put this together. ]