Pure
Fiel a Verdad
- Joined
- Dec 20, 2001
- Posts
- 15,135
Hi Black,
you said,
//With all the analysing one fact still remains: the determination of the gender is not correct in a lot of cases, 40%.//
This is not correct. See the excerpt from the 2002 paper
_Automatically Categorizing Written Texts by Author Gender_
by Moshe Koppel1 Shlomo Argamon2,1 Anat Rachel Shimoni1
http://www.cs.biu.ac.il/~koppel/papers/male-female-llc-final.pdf
One of the difficulties in obtaining greater accuracy overall is the difference between fiction and nonfiction.
These differences are generally greater than the difference between male and female writing
styles and thus training on fiction and non-fiction documents together actually harms results. When we
train together, accuracy on fiction test documents is 74.5% and on non-fiction is 79.7%. When we train
only on fiction documents (thus using a substantially smaller training set), results of 36-fold crossvalidation
(maintaining ten examples per fold) actually increase to 79.5%. Likewise, when training on
non-fiction only, accuracy on non-fiction test documents increased to 82.6%.
As to your statement,
//I wager because the premisses were biased to begin with. The difference between male and female writing? Tricky, tricky, tricky.//
I'm not sure what premisses you're thinking of. The hypothesis is that there are differences. The null hypothesis is 'no differences.' The hypothesis is not an assumption or a 'premise.'
It was supported to a fair degree.
Were there in fact no differences, then the listed words would fail to discriminate at all, whereas they do so 80% of the time.
By the way, I did write to Koppel about some of these issues, and he has stated that a version of the algorithm NOT using pronouns exists, and has expressed an interest in some aspects of our discussion.
Note to Gary,
//According to Lauren's poll of who enjoys which categories, women are partial to stories of male homosexuality, that is: erotica with no female characters. So suppose you've written such a story and want to give it maximum feminine reader appeal. Gender Genie will say you haven't hit the mark because you have no feminine pronouns. To improve the story you find ways to introduce feminine pronouns.//
With all due respect, there are several mistaken assumption here, the principal one being that female authorship implies female readership, and similarly for males.
You have chosen to use the Genie to check for appeal to females, and that was never its intention. So your statement I've bolded is not accurate at all. I tried a male/male story, and the Genie assesment was 'male authored.' Period. How enjoyable the story might be to literotica women is completely irrelevant; it's solely your conclusion if you mistakenly decide that the mark was missed. Females read books they know to be male authored all the time. (Similarly for males who read, say, the Harry Potter stories.) And again a mistake--or an assumption without evidence-- that female *readership* would be increased by introducing female pronouns.
//I don't think it's worth the trouble of analysing the algorithm, unless you're a computer programmer who wants to know how it works.//
OK, that's your personal view. I find dissecting it, esp. to eliminate the weakness associated with POV, to be interesting.
Further the bigger question mentioned by Koppel, the differences between fiction and non fiction writing, is also of interest to me, since I write both, and any diffs of the Koppel type are largely due to unconscious processes.
I quote from the 2002 paper:
An interesting phenomenon that is evident in Table 3 is that the differences between male and female usages of various features parallel more extreme differences between fiction and non-fiction: determiners, which are used more by men, are used more by all authors in non-fiction; pronouns and negation, which are used more by women, are used more by all authors in fiction.
The extreme differences between fiction and non-fiction suggest that distinguishing between the two genres ought to be an easier task than distinguishing between male and female authors. And indeed it is. Using the same corpus and same learning methodology as above on the fiction/non-fiction problem, ten runs of 56-fold cross-validation yields accuracy of 98%. Table 4 shows results for each of the three feature sets.
you said,
//With all the analysing one fact still remains: the determination of the gender is not correct in a lot of cases, 40%.//
This is not correct. See the excerpt from the 2002 paper
_Automatically Categorizing Written Texts by Author Gender_
by Moshe Koppel1 Shlomo Argamon2,1 Anat Rachel Shimoni1
http://www.cs.biu.ac.il/~koppel/papers/male-female-llc-final.pdf
One of the difficulties in obtaining greater accuracy overall is the difference between fiction and nonfiction.
These differences are generally greater than the difference between male and female writing
styles and thus training on fiction and non-fiction documents together actually harms results. When we
train together, accuracy on fiction test documents is 74.5% and on non-fiction is 79.7%. When we train
only on fiction documents (thus using a substantially smaller training set), results of 36-fold crossvalidation
(maintaining ten examples per fold) actually increase to 79.5%. Likewise, when training on
non-fiction only, accuracy on non-fiction test documents increased to 82.6%.
As to your statement,
//I wager because the premisses were biased to begin with. The difference between male and female writing? Tricky, tricky, tricky.//
I'm not sure what premisses you're thinking of. The hypothesis is that there are differences. The null hypothesis is 'no differences.' The hypothesis is not an assumption or a 'premise.'
It was supported to a fair degree.
Were there in fact no differences, then the listed words would fail to discriminate at all, whereas they do so 80% of the time.
By the way, I did write to Koppel about some of these issues, and he has stated that a version of the algorithm NOT using pronouns exists, and has expressed an interest in some aspects of our discussion.
Note to Gary,
//According to Lauren's poll of who enjoys which categories, women are partial to stories of male homosexuality, that is: erotica with no female characters. So suppose you've written such a story and want to give it maximum feminine reader appeal. Gender Genie will say you haven't hit the mark because you have no feminine pronouns. To improve the story you find ways to introduce feminine pronouns.//
With all due respect, there are several mistaken assumption here, the principal one being that female authorship implies female readership, and similarly for males.
You have chosen to use the Genie to check for appeal to females, and that was never its intention. So your statement I've bolded is not accurate at all. I tried a male/male story, and the Genie assesment was 'male authored.' Period. How enjoyable the story might be to literotica women is completely irrelevant; it's solely your conclusion if you mistakenly decide that the mark was missed. Females read books they know to be male authored all the time. (Similarly for males who read, say, the Harry Potter stories.) And again a mistake--or an assumption without evidence-- that female *readership* would be increased by introducing female pronouns.
//I don't think it's worth the trouble of analysing the algorithm, unless you're a computer programmer who wants to know how it works.//
OK, that's your personal view. I find dissecting it, esp. to eliminate the weakness associated with POV, to be interesting.
Further the bigger question mentioned by Koppel, the differences between fiction and non fiction writing, is also of interest to me, since I write both, and any diffs of the Koppel type are largely due to unconscious processes.
I quote from the 2002 paper:
An interesting phenomenon that is evident in Table 3 is that the differences between male and female usages of various features parallel more extreme differences between fiction and non-fiction: determiners, which are used more by men, are used more by all authors in non-fiction; pronouns and negation, which are used more by women, are used more by all authors in fiction.
The extreme differences between fiction and non-fiction suggest that distinguishing between the two genres ought to be an easier task than distinguishing between male and female authors. And indeed it is. Using the same corpus and same learning methodology as above on the fiction/non-fiction problem, ten runs of 56-fold cross-validation yields accuracy of 98%. Table 4 shows results for each of the three feature sets.
Last edited: