December 31st, 2003

breaking bad

The mathematics of sex

espresso_addict has posted a couple of times recently about the gender genie, which claims to predict a blog writer's gender on the basis of textual analysis. My mathematical curiosity is piqued by claimed success rate of 80% , compared with the almost 100% failure rate I have observed from my own and my online friends' testing of the system.

My suspicion is that the algorithm's developers have inadvertently produced a high level of false positives by

- testing more sampes of male writers than of female
- producing an algorithm which generates more male than female positives

If the trend were strong enough in both cases you could get an 80% success rate from something which is effectively nothing more than a random number generator.

I'm not suggesting any sinister motives behind this. The livejournal community in which we are embedded includes more female than male contributors, but is not prejudiced against males. Published fiction and non-fiction however includes more male than female contributors, and this is the testing environment that the researchers used. (Alternatively they may have developed the algorithm by working with their own friends and colleagues, again with some gender bias).

In contrast, we have tested this system by

- retaining the algorithm which generates high levels of 'male' predictions
- testing more samples of female writing

Thus producing a very high failure rate, from the same algorithm.

I suspect that this system has received so much publicity and so little critical comment ibecause it harmonises well with current obsessions, and because it allows us (like a horoscope for example) to read predictions about ourselves, which is always fun.