Writing styles of genders, ages, and authors: A PhD thesis

Discussion of natural languages, or language in general.
Post Reply
User avatar
Chuma
Avisaru
Avisaru
Posts: 387
Joined: Sat Oct 28, 2006 9:01 pm
Location: Hyperborea

Writing styles of genders, ages, and authors: A PhD thesis

Post by Chuma »

I'm finally finishing my thesis, yay! It's in computer science, but leaning heavily towards linguistics. The main focus is on text classification; that is, how can we get a machine to statistically analyse a text and figure out, for example, who wrote the text, the author's gender or age, whether the text is fact or fiction, or so on.

Highlights include:

- why most methods for classification are not as accurate as they think, and what we can do about it
- how to detect trolls on web forums
- whether it's possible to accurately guess a person's age, gender, profession, and astrological sign
- apparently women and children can be identified by the same stylistic patterns
- lots of pretty graphs

It will be printed by the end of next week, and I'll put it online for those who want to read it.

But if anyone wants to take a look right away, and maybe let me know if I've missed anything, that would be great. Can't really upload it while it's unfinished, but let me know and I can send it to you.

And thanks to all of you on the board for keeping me interested in language all these years!

User avatar
xxx
Lebom
Lebom
Posts: 94
Joined: Wed Dec 14, 2011 1:04 pm
Contact:

1984 ways to detect terrorism...

Post by xxx »

Bouh... another skynet agent...

User avatar
Chuma
Avisaru
Avisaru
Posts: 387
Joined: Sat Oct 28, 2006 9:01 pm
Location: Hyperborea

Re: Writing styles of genders, ages, and authors: A PhD thes

Post by Chuma »

Actually, I was originally funded by the military, who wanted to use the technology to spy on people on the internet. But that didn't work out, so I guess that makes me a rebel now?

Ars Lande
Avisaru
Avisaru
Posts: 382
Joined: Thu Oct 14, 2010 7:34 am
Location: Paris

Re: Writing styles of genders, ages, and authors: A PhD thes

Post by Ars Lande »

Chuma wrote:I'm finally finishing my thesis, yay! It's in computer science, but leaning heavily towards linguistics. The main focus is on text classification; that is, how can we get a machine to statistically analyse a text and figure out, for example, who wrote the text, the author's gender or age, whether the text is fact or fiction, or so on.

Highlights include:

- why most methods for classification are not as accurate as they think, and what we can do about it
- how to detect trolls on web forums
- whether it's possible to accurately guess a person's age, gender, profession, and astrological sign
- apparently women and children can be identified by the same stylistic patterns
- lots of pretty graphs

It will be printed by the end of next week, and I'll put it online for those who want to read it.

But if anyone wants to take a look right away, and maybe let me know if I've missed anything, that would be great. Can't really upload it while it's unfinished, but let me know and I can send it to you.

And thanks to all of you on the board for keeping me interested in language all these years!
Congratulations! I'm afraid I know pretty much nothing about corpus analysis - so I'd probably be a hindrance rather than a help. It does sound fascinating though, and I'll be happy to read it when it's ready.

I'd be interested in some highlights though :) so what can be detected by state-of-the-art techniques? Did the astrological sign thing work out?

Axiem
Avisaru
Avisaru
Posts: 260
Joined: Tue Oct 22, 2013 8:15 pm

Re: Writing styles of genders, ages, and authors: A PhD thes

Post by Axiem »

Ars Lande wrote:Did the astrological sign thing work out?
I will be thoroughly shocked if it's possible to guess horoscope sign from writing style.

User avatar
Chuma
Avisaru
Avisaru
Posts: 387
Joined: Sat Oct 28, 2006 9:01 pm
Location: Hyperborea

Re: Writing styles of genders, ages, and authors: A PhD thes

Post by Chuma »

It does indeed not seem to be possible. Age and gender can be done surprisingly well, but job is so far mostly unsuccessful, and astrological sign shows no sign of being possible.

User avatar
Salmoneus
Sanno
Sanno
Posts: 3197
Joined: Thu Jan 15, 2004 5:00 pm
Location: One of the dark places of the world

Re: Writing styles of genders, ages, and authors: A PhD thes

Post by Salmoneus »

How are you able to eliminate the sociological element?

That is, can you actually detect women, or are you just detecting people who act in a way that is stereotypical for women in modern Anglophone culture? The latter is obviously possible, but also not interesting. The former would be interesting, but I'm not sure methodologically how you could prove that that's what you were doing? I suppose for a start you would need to include samples from a wide range of cultures.

It does seem a surprisingly fascistic right-wing project you have there.
Blog: [url]http://vacuouswastrel.wordpress.com/[/url]

But the river tripped on her by and by, lapping
as though her heart was brook: Why, why, why! Weh, O weh
I'se so silly to be flowing but I no canna stay!

User avatar
mèþru
Smeric
Smeric
Posts: 1984
Joined: Thu Oct 29, 2015 6:44 am
Location: suburbs of Mrin
Contact:

Re: Writing styles of genders, ages, and authors: A PhD thes

Post by mèþru »

Well gender is a social construct. I imagine that the child detection also works with children of Western cultures only.
ìtsanso, God In The Mountain, may our names inspire the deepest feelings of fear in urkos and all his ilk, for we have saved another man from his lies! I welcome back to the feast hall kal, who will never gamble again! May the eleven gods bless him!
kårroť

Axiem
Avisaru
Avisaru
Posts: 260
Joined: Tue Oct 22, 2013 8:15 pm

Re: Writing styles of genders, ages, and authors: A PhD thes

Post by Axiem »

My assumption is that this—along with other things along these lines I've seen—is making an implicit boundary of general Anglophone culture. That is, instead of directly guessing the author's sex, it's measuring how much the author's style (etc.) match up with the ascertained trends in style (etc.) from a corpus sorted into two buckets.

In a roundabout way, it's seeing if people of the different categories (such as sexes or ages or horoscope signs) do tend to in general have distinct style (etc.) in their writing. Assuming it works like other things like it I've seen, it's more trend-analysis than proscriptively working from stereotypes.

I don't see how this is particularly right-wing at all, though.

User avatar
mèþru
Smeric
Smeric
Posts: 1984
Joined: Thu Oct 29, 2015 6:44 am
Location: suburbs of Mrin
Contact:

Re: Writing styles of genders, ages, and authors: A PhD thes

Post by mèþru »

Axiem wrote:I don't see how this is particularly right-wing at all, though.
Gathering information about people from their writings? Surveillance?It's not really right-wing. I think Salmoneus is thinking of authoritarianism, which can be both left or right.
ìtsanso, God In The Mountain, may our names inspire the deepest feelings of fear in urkos and all his ilk, for we have saved another man from his lies! I welcome back to the feast hall kal, who will never gamble again! May the eleven gods bless him!
kårroť

Vijay
Smeric
Smeric
Posts: 2244
Joined: Sat Feb 06, 2016 3:25 pm
Location: Austin, TX, USA

Re: Writing styles of genders, ages, and authors: A PhD thes

Post by Vijay »

Axiem wrote:I don't see how this is particularly right-wing at all, though.
Categorizing people into neat little demographic groups based on personal data collected and compiled from surveys is pretty characteristically right-wing and doesn't make sense given how diverse people actually are. (People don't really fit into such neat little demographic groups).

Axiem
Avisaru
Avisaru
Posts: 260
Joined: Tue Oct 22, 2013 8:15 pm

Re: Writing styles of genders, ages, and authors: A PhD thes

Post by Axiem »

By that logic, pretty much all of psychology and sociology is right-wing.

Vijay
Smeric
Smeric
Posts: 2244
Joined: Sat Feb 06, 2016 3:25 pm
Location: Austin, TX, USA

Re: Writing styles of genders, ages, and authors: A PhD thes

Post by Vijay »

Except that genders and ages aren't defined by writing style.

User avatar
Chuma
Avisaru
Avisaru
Posts: 387
Joined: Sat Oct 28, 2006 9:01 pm
Location: Hyperborea

Re: Writing styles of genders, ages, and authors: A PhD thes

Post by Chuma »

Well that took a surprisingly political turn.

The methods can definitely be used by governments, fascist or otherwise. And you can rest assured that various government agencies with hazy ideas of the right to privacy are researching the same things right now, along with IT companies and others, who have no intention of letting you know what information they are extracting or how. But this is publicly funded, publicly available research, that can be used by everyone according to their needs, so to speak. And naturally anyone who prefers not to be identified would need more or less the same kind of knowledge, used in reverse.

So, yes, there is a connection between the subject matter and serious political questions of surveillance and privacy, but the implication that my research would benefit oppressive governments seems rather misguided.
Salmoneus wrote:How are you able to eliminate the sociological element?
I'm fairly certain that's impossible, and I'm not sure why it would be useful. Presumably the differences between how (for example) men and women write are at least nearly entirely sociological, so eliminating them would leave... nothing?
Salmoneus wrote:Anglophone culture?
I've tested the system on three languages, all European, so probably a similar culture. Since the methods are all statistical, they can just as easily be applied to any language, although whether it would be more effective in some than others is a question for future work.
Salmoneus wrote:not interesting
I can only hope the grading committee doesn't feel the same. :)
Axiem wrote:it's more trend-analysis than proscriptively working from stereotypes
Yes, and while the immediate goal is to see if people can be identified, looking at the differences can be quite interesting from a sociological perspective. For example, I'm quoting a study by Lakoff, where he discusses at length how the writing differences between genders reflect the social differences. Except he doesn't use any data other than his own opinions...

That said, the topic is computational linguistics, not sociology. The actual, specific differences between genders, ages and individuals are just a curious side note.

Vijay
Smeric
Smeric
Posts: 2244
Joined: Sat Feb 06, 2016 3:25 pm
Location: Austin, TX, USA

Re: Writing styles of genders, ages, and authors: A PhD thes

Post by Vijay »

Chuma wrote:Well that took a surprisingly political turn.
All research is political. ;)

User avatar
Chuma
Avisaru
Avisaru
Posts: 387
Joined: Sat Oct 28, 2006 9:01 pm
Location: Hyperborea

Re: Writing styles of genders, ages, and authors: A PhD thes

Post by Chuma »

Yes, I just wonder what reactions my master's thesis about triangles would have drawn. Angry Christians calling it an attack on monogamy? Muslims upset that a triangle is half a Star of David? :D

Axiem
Avisaru
Avisaru
Posts: 260
Joined: Tue Oct 22, 2013 8:15 pm

Re: Writing styles of genders, ages, and authors: A PhD thes

Post by Axiem »

Vijay wrote:Except that genders and ages aren't defined by writing style.
No one's claiming that?

Again, it's saying "can we analyze trait X from samples provided by groups A and B, and use that information on a sample of unknown group to relatively reliably predict which group provided that sample?"

So if there are discernible trends that happen to split generally around gender lines—for example, 90% of the female samples in the corpus included at least three question marks, but only 10% of the male samples in the corpus did—then the question is, can we relatively reliably predict the gender of a sample—for example, if you encounter a sample with more than 3 question marks.

My point about the horoscopes is basically saying I don't think there can be discernible trends that split based on astrological sign. What this research basically does is asks the question "do people with a sign of Aries use the written language in a significantly different way from people with a sign of Taurus?".

Likewise, it's asking "do male people use the written language in a significantly different way from female people?". If they don't, then the analysis should support this by not being able to guess the gender authorship of a sample with any likelihood beyond standard guessing.



...at this point, I'm quite curious as to the content of the paper, I must admit. Like other research in this vein, did you produce a tool that one can provide a text and be given a guess as to the author's age/gender/etc?

User avatar
Chuma
Avisaru
Avisaru
Posts: 387
Joined: Sat Oct 28, 2006 9:01 pm
Location: Hyperborea

Re: Writing styles of genders, ages, and authors: A PhD thes

Post by Chuma »

Axiem wrote:What this research basically does is asks the question "do people with a sign of Aries use the written language in a significantly different way from people with a sign of Taurus?".
Right. And since it's fairly obvious that they don't (to any measurable degree), the main reason for including astrological sign is actually to check that the method doesn't report unreasonable claims.
Axiem wrote:did you produce a tool that one can provide a text and be given a guess as to the author's age/gender/etc?
In a sense, yes. I've made some programs that do that, but it's not really a user-friendly tool. I wouldn't recommend anyone else try using it. I've been thinking of developing it to something more distributable, but then someone would probably have to pay me for that. It also seems that, at least when it comes to categories like age and gender, context is important; the gender characteristics are different depending on what kind of tests you're looking at. So if you want to find out something about, say, a forum post(er), you'd need to compare with other data from forums, not texts from books or newspaper articles. So in effect, it's not enough to provide an unknown text, you also need to provide the suitable comparison data.

That, and the amount of data in the unknown text should be at least 10 000 words or so.

User avatar
alynnidalar
Avisaru
Avisaru
Posts: 491
Joined: Fri Aug 15, 2014 9:35 pm
Location: Michigan, USA

Re: Writing styles of genders, ages, and authors: A PhD thes

Post by alynnidalar »

Chuma wrote:It also seems that, at least when it comes to categories like age and gender, context is important
That's interesting. So, if you were to compare e.g. a forum post to a combined selection of texts, books, and newspaper articles, you'd get worse results than comparing simply forum posts alone? That makes sense, but it isn't something I realized before.

I'm curious about something else--did you end up with any data from trans people? I'm wondering how/if that might impact analysis.
I generally forget to say, so if it's relevant and I don't mention it--I'm from Southern Michigan and speak Inland North American English. Yes, I have the Northern Cities Vowel Shift; no, I don't have the cot-caught merger; and it is called pop.

User avatar
mèþru
Smeric
Smeric
Posts: 1984
Joined: Thu Oct 29, 2015 6:44 am
Location: suburbs of Mrin
Contact:

Re: Writing styles of genders, ages, and authors: A PhD thes

Post by mèþru »

Now, a new life goal (xkcd or smbc style): I will buy a newspaper and print articles in internet slang just to make data analysis easier.
ìtsanso, God In The Mountain, may our names inspire the deepest feelings of fear in urkos and all his ilk, for we have saved another man from his lies! I welcome back to the feast hall kal, who will never gamble again! May the eleven gods bless him!
kårroť

User avatar
Chuma
Avisaru
Avisaru
Posts: 387
Joined: Sat Oct 28, 2006 9:01 pm
Location: Hyperborea

Re: Writing styles of genders, ages, and authors: A PhD thes

Post by Chuma »

alynnidalar wrote:So, if you were to compare e.g. a forum post to a combined selection of texts, books, and newspaper articles, you'd get worse results than comparing simply forum posts alone?
Yes, pretty sure, although I haven't done that experiment myself. My main focus has been on identifying specific authors, so that's been a little hard to test.

One thing I have found is that more informal text is easier to identify, so forum text is the easiest. With enough data per author (tens of thousands of words), you can spot a forum writer out of hundreds, likely thousands, with near-perfect accuracy, using very simple methods.
alynnidalar wrote:I'm curious about something else--did you end up with any data from trans people?
Not that I'm aware of. The blog corpus I've analysed is only marked for two genders, and even if it did mark other options, there probably wouldn't be enough data to make a good comparison. But it's an interesting question.

User avatar
LinguistCat
Avisaru
Avisaru
Posts: 250
Joined: Thu Apr 13, 2006 7:24 pm
Location: Off on the side

Re: Writing styles of genders, ages, and authors: A PhD thes

Post by LinguistCat »

Assuming you stuck to binary trans people (trans men and trans women exclusively), my bet is they would pattern most closely with their gender identity but with some minor traits from their assigned at birth gender. It might be more difficult including nonbinary trans people, but it would be interesting to see if you could tell nonbinary people from both men and women. I suppose that would have to be another study though.
The stars are an ocean. Your breasts, are also an ocean.

User avatar
gach
Avisaru
Avisaru
Posts: 472
Joined: Mon Feb 17, 2003 11:03 am
Location: displaced from Helsinki

Re: Writing styles of genders, ages, and authors: A PhD thes

Post by gach »

Chuma wrote:
Salmoneus wrote:How are you able to eliminate the sociological element?
I'm fairly certain that's impossible, and I'm not sure why it would be useful. Presumably the differences between how (for example) men and women write are at least nearly entirely sociological, so eliminating them would leave... nothing?
I'm also not sure that there would be an particularly central non-sociological element to text classification. Clearly the sociological elements are at least the overwhelmingly most relevant factors for text classification for example in tasks like identifying a sample of texts produced by a given person or estimating the likelihoods that a given text was produced by any one person from a group of people. Whether these sociological factors then tell then anything about the physiology etc. of these people, is very much a secondary derivative result.

What sort of methodology do you use for the text classification? I'm assuming that you take some sort of a probabilistic approach.
Chuma wrote:
Axiem wrote:did you produce a tool that one can provide a text and be given a guess as to the author's age/gender/etc?
In a sense, yes. I've made some programs that do that, but it's not really a user-friendly tool. I wouldn't recommend anyone else try using it. I've been thinking of developing it to something more distributable, but then someone would probably have to pay me for that.
I find that it's pretty typical that when your research result is an algorithm, then that's what you publish and leave a sleek productisable implementation for any interested readers. You'll, of course, have your own code, but that's very likely tailored to meet the needs of your own research. Often there's no time or resources for anything else.

User avatar
Chuma
Avisaru
Avisaru
Posts: 387
Joined: Sat Oct 28, 2006 9:01 pm
Location: Hyperborea

Re: Writing styles of genders, ages, and authors: A PhD thes

Post by Chuma »

The basic approach is quite simple: For each candidate (a person or a category), count how often they use certain words, and compare that with the unknown text. There are various ways of making the comparison, which I'm not going into. One thing I look at is using elements of grammar instead of words; people have tried that before, and generally found that it's not that useful, but my results suggest that it has at least one significant advantage when identifying an author - it's less topic-dependent. I argue that most studies have not sufficiently accounted for topic dependence, which means that they have likely overestimated the accuracies of their methods.

My method gives a similarity rating for two compared texts, so you just have to look at which candidate is the most similar to the unknown text. An advantage over some other methods is that you can also use the similarity for more detailed results. Some methods can only answer "which is the most likely candidate", but this can also potentially answer for example "is this candidate in the top three", "is there any likely candidate at all", or, given an a priori probability, even "what is the probability of this being the correct candidate".

In this case, the algorithm isn't the main result. It's a simple algorithm, I've just tested it in new ways.

My thesis also includes some mostly unrelated research on automata theory, in which there are actual algorithms being presented.

Post Reply