Page 1 of 1

Writing styles of genders, ages, and authors: A PhD thesis

Posted: Thu Jun 01, 2017 7:07 am
by Chuma
I'm finally finishing my thesis, yay! It's in computer science, but leaning heavily towards linguistics. The main focus is on text classification; that is, how can we get a machine to statistically analyse a text and figure out, for example, who wrote the text, the author's gender or age, whether the text is fact or fiction, or so on.

Highlights include:

- why most methods for classification are not as accurate as they think, and what we can do about it
- how to detect trolls on web forums
- whether it's possible to accurately guess a person's age, gender, profession, and astrological sign
- apparently women and children can be identified by the same stylistic patterns
- lots of pretty graphs

It will be printed by the end of next week, and I'll put it online for those who want to read it.

But if anyone wants to take a look right away, and maybe let me know if I've missed anything, that would be great. Can't really upload it while it's unfinished, but let me know and I can send it to you.

And thanks to all of you on the board for keeping me interested in language all these years!

1984 ways to detect terrorism...

Posted: Fri Jun 02, 2017 11:12 am
by xxx
Bouh... another skynet agent...

Re: Writing styles of genders, ages, and authors: A PhD thes

Posted: Fri Jun 02, 2017 12:04 pm
by Chuma
Actually, I was originally funded by the military, who wanted to use the technology to spy on people on the internet. But that didn't work out, so I guess that makes me a rebel now?

Re: Writing styles of genders, ages, and authors: A PhD thes

Posted: Fri Jun 02, 2017 3:32 pm
by Ars Lande
Chuma wrote:I'm finally finishing my thesis, yay! It's in computer science, but leaning heavily towards linguistics. The main focus is on text classification; that is, how can we get a machine to statistically analyse a text and figure out, for example, who wrote the text, the author's gender or age, whether the text is fact or fiction, or so on.

Highlights include:

- why most methods for classification are not as accurate as they think, and what we can do about it
- how to detect trolls on web forums
- whether it's possible to accurately guess a person's age, gender, profession, and astrological sign
- apparently women and children can be identified by the same stylistic patterns
- lots of pretty graphs

It will be printed by the end of next week, and I'll put it online for those who want to read it.

But if anyone wants to take a look right away, and maybe let me know if I've missed anything, that would be great. Can't really upload it while it's unfinished, but let me know and I can send it to you.

And thanks to all of you on the board for keeping me interested in language all these years!
Congratulations! I'm afraid I know pretty much nothing about corpus analysis - so I'd probably be a hindrance rather than a help. It does sound fascinating though, and I'll be happy to read it when it's ready.

I'd be interested in some highlights though :) so what can be detected by state-of-the-art techniques? Did the astrological sign thing work out?

Re: Writing styles of genders, ages, and authors: A PhD thes

Posted: Fri Jun 02, 2017 4:27 pm
by Axiem
Ars Lande wrote:Did the astrological sign thing work out?
I will be thoroughly shocked if it's possible to guess horoscope sign from writing style.

Re: Writing styles of genders, ages, and authors: A PhD thes

Posted: Sat Jun 03, 2017 7:14 am
by Chuma
It does indeed not seem to be possible. Age and gender can be done surprisingly well, but job is so far mostly unsuccessful, and astrological sign shows no sign of being possible.

Re: Writing styles of genders, ages, and authors: A PhD thes

Posted: Sat Jun 03, 2017 9:44 am
by Salmoneus
How are you able to eliminate the sociological element?

That is, can you actually detect women, or are you just detecting people who act in a way that is stereotypical for women in modern Anglophone culture? The latter is obviously possible, but also not interesting. The former would be interesting, but I'm not sure methodologically how you could prove that that's what you were doing? I suppose for a start you would need to include samples from a wide range of cultures.

It does seem a surprisingly fascistic right-wing project you have there.

Re: Writing styles of genders, ages, and authors: A PhD thes

Posted: Sat Jun 03, 2017 10:25 am
by mèþru
Well gender is a social construct. I imagine that the child detection also works with children of Western cultures only.

Re: Writing styles of genders, ages, and authors: A PhD thes

Posted: Sat Jun 03, 2017 11:22 am
by Axiem
My assumption is that this—along with other things along these lines I've seen—is making an implicit boundary of general Anglophone culture. That is, instead of directly guessing the author's sex, it's measuring how much the author's style (etc.) match up with the ascertained trends in style (etc.) from a corpus sorted into two buckets.

In a roundabout way, it's seeing if people of the different categories (such as sexes or ages or horoscope signs) do tend to in general have distinct style (etc.) in their writing. Assuming it works like other things like it I've seen, it's more trend-analysis than proscriptively working from stereotypes.

I don't see how this is particularly right-wing at all, though.

Re: Writing styles of genders, ages, and authors: A PhD thes

Posted: Sat Jun 03, 2017 1:27 pm
by mèþru
Axiem wrote:I don't see how this is particularly right-wing at all, though.
Gathering information about people from their writings? Surveillance?It's not really right-wing. I think Salmoneus is thinking of authoritarianism, which can be both left or right.

Re: Writing styles of genders, ages, and authors: A PhD thes

Posted: Sat Jun 03, 2017 1:36 pm
by Vijay
Axiem wrote:I don't see how this is particularly right-wing at all, though.
Categorizing people into neat little demographic groups based on personal data collected and compiled from surveys is pretty characteristically right-wing and doesn't make sense given how diverse people actually are. (People don't really fit into such neat little demographic groups).

Re: Writing styles of genders, ages, and authors: A PhD thes

Posted: Sat Jun 03, 2017 1:58 pm
by Axiem
By that logic, pretty much all of psychology and sociology is right-wing.

Re: Writing styles of genders, ages, and authors: A PhD thes

Posted: Sat Jun 03, 2017 2:07 pm
by Vijay
Except that genders and ages aren't defined by writing style.

Re: Writing styles of genders, ages, and authors: A PhD thes

Posted: Sat Jun 03, 2017 6:46 pm
by Chuma
Well that took a surprisingly political turn.

The methods can definitely be used by governments, fascist or otherwise. And you can rest assured that various government agencies with hazy ideas of the right to privacy are researching the same things right now, along with IT companies and others, who have no intention of letting you know what information they are extracting or how. But this is publicly funded, publicly available research, that can be used by everyone according to their needs, so to speak. And naturally anyone who prefers not to be identified would need more or less the same kind of knowledge, used in reverse.

So, yes, there is a connection between the subject matter and serious political questions of surveillance and privacy, but the implication that my research would benefit oppressive governments seems rather misguided.
Salmoneus wrote:How are you able to eliminate the sociological element?
I'm fairly certain that's impossible, and I'm not sure why it would be useful. Presumably the differences between how (for example) men and women write are at least nearly entirely sociological, so eliminating them would leave... nothing?
Salmoneus wrote:Anglophone culture?
I've tested the system on three languages, all European, so probably a similar culture. Since the methods are all statistical, they can just as easily be applied to any language, although whether it would be more effective in some than others is a question for future work.
Salmoneus wrote:not interesting
I can only hope the grading committee doesn't feel the same. :)
Axiem wrote:it's more trend-analysis than proscriptively working from stereotypes
Yes, and while the immediate goal is to see if people can be identified, looking at the differences can be quite interesting from a sociological perspective. For example, I'm quoting a study by Lakoff, where he discusses at length how the writing differences between genders reflect the social differences. Except he doesn't use any data other than his own opinions...

That said, the topic is computational linguistics, not sociology. The actual, specific differences between genders, ages and individuals are just a curious side note.

Re: Writing styles of genders, ages, and authors: A PhD thes

Posted: Sat Jun 03, 2017 6:55 pm
by Vijay
Chuma wrote:Well that took a surprisingly political turn.
All research is political. ;)

Re: Writing styles of genders, ages, and authors: A PhD thes

Posted: Sat Jun 03, 2017 7:24 pm
by Chuma
Yes, I just wonder what reactions my master's thesis about triangles would have drawn. Angry Christians calling it an attack on monogamy? Muslims upset that a triangle is half a Star of David? :D

Re: Writing styles of genders, ages, and authors: A PhD thes

Posted: Sat Jun 03, 2017 7:28 pm
by Axiem
Vijay wrote:Except that genders and ages aren't defined by writing style.
No one's claiming that?

Again, it's saying "can we analyze trait X from samples provided by groups A and B, and use that information on a sample of unknown group to relatively reliably predict which group provided that sample?"

So if there are discernible trends that happen to split generally around gender lines—for example, 90% of the female samples in the corpus included at least three question marks, but only 10% of the male samples in the corpus did—then the question is, can we relatively reliably predict the gender of a sample—for example, if you encounter a sample with more than 3 question marks.

My point about the horoscopes is basically saying I don't think there can be discernible trends that split based on astrological sign. What this research basically does is asks the question "do people with a sign of Aries use the written language in a significantly different way from people with a sign of Taurus?".

Likewise, it's asking "do male people use the written language in a significantly different way from female people?". If they don't, then the analysis should support this by not being able to guess the gender authorship of a sample with any likelihood beyond standard guessing.



...at this point, I'm quite curious as to the content of the paper, I must admit. Like other research in this vein, did you produce a tool that one can provide a text and be given a guess as to the author's age/gender/etc?

Re: Writing styles of genders, ages, and authors: A PhD thes

Posted: Sat Jun 03, 2017 7:51 pm
by Chuma
Axiem wrote:What this research basically does is asks the question "do people with a sign of Aries use the written language in a significantly different way from people with a sign of Taurus?".
Right. And since it's fairly obvious that they don't (to any measurable degree), the main reason for including astrological sign is actually to check that the method doesn't report unreasonable claims.
Axiem wrote:did you produce a tool that one can provide a text and be given a guess as to the author's age/gender/etc?
In a sense, yes. I've made some programs that do that, but it's not really a user-friendly tool. I wouldn't recommend anyone else try using it. I've been thinking of developing it to something more distributable, but then someone would probably have to pay me for that. It also seems that, at least when it comes to categories like age and gender, context is important; the gender characteristics are different depending on what kind of tests you're looking at. So if you want to find out something about, say, a forum post(er), you'd need to compare with other data from forums, not texts from books or newspaper articles. So in effect, it's not enough to provide an unknown text, you also need to provide the suitable comparison data.

That, and the amount of data in the unknown text should be at least 10 000 words or so.

Re: Writing styles of genders, ages, and authors: A PhD thes

Posted: Sat Jun 03, 2017 8:56 pm
by alynnidalar
Chuma wrote:It also seems that, at least when it comes to categories like age and gender, context is important
That's interesting. So, if you were to compare e.g. a forum post to a combined selection of texts, books, and newspaper articles, you'd get worse results than comparing simply forum posts alone? That makes sense, but it isn't something I realized before.

I'm curious about something else--did you end up with any data from trans people? I'm wondering how/if that might impact analysis.

Re: Writing styles of genders, ages, and authors: A PhD thes

Posted: Sat Jun 03, 2017 9:02 pm
by mèþru
Now, a new life goal (xkcd or smbc style): I will buy a newspaper and print articles in internet slang just to make data analysis easier.

Re: Writing styles of genders, ages, and authors: A PhD thes

Posted: Sun Jun 04, 2017 6:23 am
by Chuma
alynnidalar wrote:So, if you were to compare e.g. a forum post to a combined selection of texts, books, and newspaper articles, you'd get worse results than comparing simply forum posts alone?
Yes, pretty sure, although I haven't done that experiment myself. My main focus has been on identifying specific authors, so that's been a little hard to test.

One thing I have found is that more informal text is easier to identify, so forum text is the easiest. With enough data per author (tens of thousands of words), you can spot a forum writer out of hundreds, likely thousands, with near-perfect accuracy, using very simple methods.
alynnidalar wrote:I'm curious about something else--did you end up with any data from trans people?
Not that I'm aware of. The blog corpus I've analysed is only marked for two genders, and even if it did mark other options, there probably wouldn't be enough data to make a good comparison. But it's an interesting question.

Re: Writing styles of genders, ages, and authors: A PhD thes

Posted: Sun Jun 04, 2017 11:52 am
by LinguistCat
Assuming you stuck to binary trans people (trans men and trans women exclusively), my bet is they would pattern most closely with their gender identity but with some minor traits from their assigned at birth gender. It might be more difficult including nonbinary trans people, but it would be interesting to see if you could tell nonbinary people from both men and women. I suppose that would have to be another study though.

Re: Writing styles of genders, ages, and authors: A PhD thes

Posted: Sun Jun 04, 2017 5:10 pm
by gach
Chuma wrote:
Salmoneus wrote:How are you able to eliminate the sociological element?
I'm fairly certain that's impossible, and I'm not sure why it would be useful. Presumably the differences between how (for example) men and women write are at least nearly entirely sociological, so eliminating them would leave... nothing?
I'm also not sure that there would be an particularly central non-sociological element to text classification. Clearly the sociological elements are at least the overwhelmingly most relevant factors for text classification for example in tasks like identifying a sample of texts produced by a given person or estimating the likelihoods that a given text was produced by any one person from a group of people. Whether these sociological factors then tell then anything about the physiology etc. of these people, is very much a secondary derivative result.

What sort of methodology do you use for the text classification? I'm assuming that you take some sort of a probabilistic approach.
Chuma wrote:
Axiem wrote:did you produce a tool that one can provide a text and be given a guess as to the author's age/gender/etc?
In a sense, yes. I've made some programs that do that, but it's not really a user-friendly tool. I wouldn't recommend anyone else try using it. I've been thinking of developing it to something more distributable, but then someone would probably have to pay me for that.
I find that it's pretty typical that when your research result is an algorithm, then that's what you publish and leave a sleek productisable implementation for any interested readers. You'll, of course, have your own code, but that's very likely tailored to meet the needs of your own research. Often there's no time or resources for anything else.

Re: Writing styles of genders, ages, and authors: A PhD thes

Posted: Sun Jun 04, 2017 5:39 pm
by Chuma
The basic approach is quite simple: For each candidate (a person or a category), count how often they use certain words, and compare that with the unknown text. There are various ways of making the comparison, which I'm not going into. One thing I look at is using elements of grammar instead of words; people have tried that before, and generally found that it's not that useful, but my results suggest that it has at least one significant advantage when identifying an author - it's less topic-dependent. I argue that most studies have not sufficiently accounted for topic dependence, which means that they have likely overestimated the accuracies of their methods.

My method gives a similarity rating for two compared texts, so you just have to look at which candidate is the most similar to the unknown text. An advantage over some other methods is that you can also use the similarity for more detailed results. Some methods can only answer "which is the most likely candidate", but this can also potentially answer for example "is this candidate in the top three", "is there any likely candidate at all", or, given an a priori probability, even "what is the probability of this being the correct candidate".

In this case, the algorithm isn't the main result. It's a simple algorithm, I've just tested it in new ways.

My thesis also includes some mostly unrelated research on automata theory, in which there are actual algorithms being presented.