Writer idiosyncrasies

Discussion of natural languages, or language in general.
Post Reply
User avatar
Chuma
Avisaru
Avisaru
Posts: 387
Joined: Sat Oct 28, 2006 9:01 pm
Location: Hyperborea

Writer idiosyncrasies

Post by Chuma »

I'm working on a research project about automatic author identification. There's a program that looks at a great number of texts by known authors, and then uses machine learning to guess who has written other texts. There are all sorts of features the program can look at in the texts, but the ones commonly used are rather "stupid" from a linguistic perspective. Typical examples include counts of words and counts of sequences of words. We are adding slightly more complicated things by looking at syntax.

But what I would like right now is ideas on more intelligent linguistic features, perhaps specific to English. What features would you use to identify a writer?

Counting words in general has the considerable disadvantage of often identifying subject rather than author, and it would be desirable to be able to identify an author even when he writes about a very different subject. My first idea is to use particular words which some people use much more than others, like "actually" and "basically", and dialectal or individual expressions like "off of" and "to not". Since we also analyse syntax, it doesn't have to be just words, it can also be syntactic structures. Pairs of synonyms (or equivalent structures) might also be a good feature.

Any ideas?
Last edited by Chuma on Fri Jun 01, 2012 4:11 am, edited 1 time in total.

User avatar
Gulliver
Avisaru
Avisaru
Posts: 433
Joined: Mon May 05, 2003 2:58 pm
Location: The West Country
Contact:

Re: Writer idiosyncrasies

Post by Gulliver »

Punctuation use? That's part of written language as much as anything else is. I tend to over-hyphenate, and quite often write slightly too long sentences, separated by awkward commas.

I'd also look for things like "different from"/"different to" and "try and"/"try to".

Words of a Germanic origin vs words of a Latin/Romance origin? Many almost-pairs have different connotations and could be preferred by a writer ("motherhood"/"maternity", "nowadays"/"at present").

Ars Lande
Avisaru
Avisaru
Posts: 382
Joined: Thu Oct 14, 2010 7:34 am
Location: Paris

Re: Writer idiosyncrasies

Post by Ars Lande »

There's still something to be said about counting words.

The idea is to compare a given work to comparable works, within the same genre and subject, and to search for words that are more commonly used than they are in the reference corpus.
You'd have the same problem with syntax: there is considerable variation in the syntax of an academic paper, and of a novel.

User avatar
Torco
Smeric
Smeric
Posts: 2372
Joined: Thu Aug 30, 2007 10:45 pm
Location: Santiago de Chile

Re: Writer idiosyncrasies

Post by Torco »

This is really uninspired, but I've noticed that authors tend to use specific sorts phrases, or phrasal templates: stuff like "he *verb*-ed [something] sardonically". I think that looking with special interest at the beggining and ending of a phrase could be interesting in this manner:

A good way to go about this would be to use some sort of clustering algorithm, most likely some sort of hierarchical one, on phrases, feeding the clustering the first and last word of a phrase and have it identify commonalities. this could be generalized to searching for sequences of parts of speech, something like "the most common structure in A is pronoun-verb-preposition-noun-adverb", whereas that same concatenation almost never appears in B.

Authors not only use different words, but they also use them differently. Perhaps you can, I mean, distinguish author A from author B not from the relative frequency of the word "perhaps", but from the environment that word appears in: maybe someone uses 94% of their "perhaps" right after a quotation mark, whereas another uses it mostly before adverbs and adjectives, a la "gazing at her bossom in a deep, perhaps religious trance".

Christopher Schröder
Avisaru
Avisaru
Posts: 310
Joined: Wed Jul 30, 2008 6:05 pm

Re: Writer idiosyncrasies

Post by Christopher Schröder »

If we count the Differences between these two Paragraphs —

There was nobody inside but a miserable shoeless criminal, who had been taken up for playing the flute, and who, the offence against society having been clearly proved, had been very properly committed by Mr. Fang to the House of Correction for one month; with the appropriate and amusing remark that since he had so much breath to spare, it would be more wholesomely expended on the treadmill than in a musical instrument. He made no answer: being occupied mentally bewailing the loss of the flute, which had been confiscated for the use of the county: so Nancy passed on to the next cell, and knocked there.

— Charles Dickens, Oliver Twist

Then she began to set the room in order, for it was the sitting-room as well as the kitchen. She shook the mats out at the front-door and put them straight; the hearthrug was a rabbit-skin. She dusted the clock and the ornaments on the mantelpiece, and she polished and rubbed the tables and chairs.

— Beatrix Potter, The Tale of the Pie and the Patty Pan

— we have the following.

1. Sentence Length — though separated by Semicolons, the Boldface Text in the first Example Paragraph reads as a single Sentence, and is longer than the entire second.

2. Sentence Structure — Dickens's Sentences are much more complicated than Potter's, and, though he separates them by Punctiation (Colons and Semicolons) which would now often be considered to start a new Sentence, the above Paragraph reads as though it contained only two, as opposed to the four in Potter's.

3. Punctuation — the former Text, having been written in 1838, is punctuated very differently — using many more Colons than the latter (and most modern Texts), and Semicolons are used where Commas probably would be today — from the latter, which was first printed in 1905, and which is much closer to modern English Punctuation, though, being a little more than a hundred Years old, has some Differences, as in the Hyphenation of "front-door".

4. Commentary — Dickens more than once offers his personal Thoughts, often sarcastically, on the Actions of the Characters he mentions, cf. "very properly committed", "appropriate and amusing remark"; Potter does not do this, instead presenting the Actions in Order and without Commentary.

5. Vocabulary — though they both tend towards Words of Germanic Origin, Dickens has the more elevated Vocabulary, using Words such as "expend, confiscate, appropriate", whereas Potter would probably have used "use up", "take away", and "suitable"; neither, however, employs uncommon Vocabulary, and neither is overly-colloquial.

In other Parts of the Texts, we also have —

6. Addressing the Audience — Dickens does this more than once, Potter does not.

7. Length — This can be useful if Authors have written multiple Works of similar Length, which is true for these two Authors — Dickens tended towards lengthily Novels which can take some Weeks to read, whereas Potter wrote Stories which may be read in a few Minutes.

A few other Things that come to Mind —

8. Spelling — may be as simple as "Color" v. "Colour", though even modern Editions of older Texts may show certain intelligible, but now nonstandard, Spellings; Frances Burney had "choak" and "controul" for "choke" and "control", Jane Austen sometimes also had "choak", and "teaze" for "tease"; for later Texts, as has already been mentioned, the various regional Differences in English Usage could also be helpful.

9. Capitalisation and Typesetting — if you took the Name off this Post, you would probably know I wrote it, because all the Nouns in it begin in Uppercase; this is not common past the early Eighteenth Century, and is often regularised out of older Texts by Editors, limiting its Helpfulness, as are certain older Typesetting Practices, such as the Use of the long "s".

10. Tense of Narration — Rather self-explanatory.

11. Descriptions — Some Authors — Anne Radcliffe comes to mind — engage in lengthily Descriptions which may ramble on for several Pages.

12. Register — Touched on a Bit with Vocabulary, but Register probably deserves to be mentioned on its own; some Authors write far more formally than others; along with this, various regional Usages.
"Think only of the past as its remembrance gives you pleasure."
-Jane Austen, [i]Pride and Prejudice[/i]

User avatar
Salmoneus
Sanno
Sanno
Posts: 3197
Joined: Thu Jan 15, 2004 5:00 pm
Location: One of the dark places of the world

Re: Writer idiosyncrasies

Post by Salmoneus »

This is exactly what computers are bad at. You have to spot the patterns that link texts by the same author - but you can't enumerate what those patterns might be (beyond the really simple ones) before you've examined the texts at hand.
Blog: [url]http://vacuouswastrel.wordpress.com/[/url]

But the river tripped on her by and by, lapping
as though her heart was brook: Why, why, why! Weh, O weh
I'se so silly to be flowing but I no canna stay!

Wattmann
Avisaru
Avisaru
Posts: 352
Joined: Mon Jan 23, 2012 4:50 am

Re: Writer idiosyncrasies

Post by Wattmann »

Salmoneus wrote:This is exactly what computers are bad at. You have to spot the patterns that link texts by the same author - but you can't enumerate what those patterns might be (beyond the really simple ones) before you've examined the texts at hand.
We still don't have a computer that parses natural language flawlessly, which is what is limiting us.
Warning: Recovering bilingual, attempting trilinguaility. Knowledge of French left behind in childhood. Currently repairing bilinguality. Repair stalled. Above content may be a touch off.

User avatar
finlay
Sumerul
Sumerul
Posts: 3600
Joined: Mon Dec 22, 2003 12:35 pm
Location: Tokyo

Re: Writer idiosyncrasies

Post by finlay »

Christopher Schröder wrote:If we count the Differences between these two Paragraphs —
please, this capitalization thing is getting tiresome.

hwhatting
Smeric
Smeric
Posts: 2315
Joined: Fri Sep 13, 2002 2:49 am
Location: Bonn, Germany

Re: Writer idiosyncrasies

Post by hwhatting »

On punctuation, spelling, capitalisation - these are things that are most liable to be corrected by editors, so one cannot be sure whether certain idiosyncrasies are the autor's or are dictated by the house rules of a publishing house. These items are also most frequently up-dated / modernised in new editions of classical authors, so for them to be diagnostic, the program needs to be fed text from original editions.

User avatar
Chuma
Avisaru
Avisaru
Posts: 387
Joined: Sat Oct 28, 2006 9:01 pm
Location: Hyperborea

Re: Writer idiosyncrasies

Post by Chuma »

Gulliver: Those are some great examples. Punctuation use - or at least punctuation frequency - has bee used with great success. Those synonyms are exactly what I'm talking about - I just need a big list of them. I wonder if there is such a list available somewhere?
Ars Lande wrote:The idea is to compare a given work to comparable works, within the same genre and subject, and to search for words that are more commonly used than they are in the reference corpus.
You'd have the same problem with syntax: there is considerable variation in the syntax of an academic paper, and of a novel.
That's true, and we're not about to give up on word counting; it is quite effective. More inexplicable is the fact that sequences of characters is also very effective; you can identify an author by the fact that his a's are often followed by b's, apparently.

Anyway - some features are more prone to this problem than others. We can find texts that are more or less in the same genre (we've looked at blog posts, for example), but the same subject is trickier. If a guy writes about his cat one day and his dog the next, the word frequencies could be quite different, at least for "cat" and "dog", but the syntax is probably similar.

CS: You have some more good examples there. I should be able to find a long list of dialect differences - both spelling ones like "color" vs. "colour" and actual different words. It will perhaps mainly separate writers into two groups, but that's not a bad start.

Sal: True, but if we're talking about vast amounts of text by hundreds of authors, it is something humans are also bad at. The precision of the computer programs to date isn't overwhelming, but big enough to be of some use. I don't have the figures right here. The computer scientists involved have a tendency to dismiss any theory of linguistics and claim that it's all a matter of algorithms and statistics, so I'd like to prove them wrong and show that linguistic knowledge is helpful when dealing with language.

Wattman: Yes, when we look at the syntax, we usually use the interpretation of the syntax which the parser has given us, so the fact that the parser isn't perfect is a problem. But there are parsers that do well enough to be useful.

Hwhatting: Good point. We are mostly looking at recent texts found on the internet, though.

User avatar
Gulliver
Avisaru
Avisaru
Posts: 433
Joined: Mon May 05, 2003 2:58 pm
Location: The West Country
Contact:

Re: Writer idiosyncrasies

Post by Gulliver »

Chuma wrote:Gulliver: Those are some great examples. Punctuation use - or at least punctuation frequency - has bee used with great success. Those synonyms are exactly what I'm talking about - I just need a big list of them. I wonder if there is such a list available somewhere?
Thank you. I am amazing.

I have had a quick look (because doing other people's work is clearly better than doing my own) but I can't find such a list. It would probably be easier to look for Latinate/French or Germanic morphemes than whole words. Bound morphemes used in word-building, as opposed to free morphemes, might be quicker to isolate and might show a trend more readily. I could be wrong, though.

Romance: -ity, -ate, -tion, -ize, re-, im-, al, -ial etc
Germanic: -hood, -ship, -ish, -er etc

That's how I would do it, if I had to use an automated way. Obviously, you'd miss some with a slightly squiffy spelling, and there are several words would throw up false-positives (gumption contains -tion but came into English through Scots and is of uncertain origin, -ling is a Germanic diminutive suffix (starling, sapling, hatchling, earthling) but would be confused with the ling in revealing), and there are some words that don't have an equivalent so there is only really one choice (try talking about the reunification of Germany without sounding overly verbose and clumsy). You'd probably need an exception list.

User avatar
Chuma
Avisaru
Avisaru
Posts: 387
Joined: Sat Oct 28, 2006 9:01 pm
Location: Hyperborea

Re: Writer idiosyncrasies

Post by Chuma »

Feel free to let me know what your job is, and maybe I can do that for you in return. :)

It seems that a very important factor is choosing features that are common, otherwise the statistics aren't effective (unless you have huge amounts of data, but that's rarely the case). So bound morphemes could be a good idea. I might try, as a start, to just check for some affixes, and then maybe I'll look into an exception list, or something.

Actually, I might just check all prefixes and suffixes up to a certain length - that is, prefixes and suffixes in the formal language sense, so a "prefix" is any string of letters that makes up the start of a word. I don't think anyone has done that yet.

User avatar
Gulliver
Avisaru
Avisaru
Posts: 433
Joined: Mon May 05, 2003 2:58 pm
Location: The West Country
Contact:

Re: Writer idiosyncrasies

Post by Gulliver »

What kind of corpus are you using? You said web-based, but does that mean blogs or news articles or what?

Many newspaper articles are proof-read, which would reduce the usefulness of punctuation-counting. However, my uncle, who used to proofread professionally, lamented the decline in punctuation standards recently so this might not be true any longer.

Other ideas: does the writer write Monday or monday, do they refer to their Mum or their mum or their Mom or their mom? Different from or different to? Is i a word on its own, or do they use the standard I? Capitalisation (capitalization?) might be another indicative marker.
Chuma wrote:Feel free to let me know what your job is, and maybe I can do that for you in return. :)
MA basic project work. It's not even assessed, but I'm taking ages over it. If it were assessed, I might be more inclined to do it properly.

User avatar
Chuma
Avisaru
Avisaru
Posts: 387
Joined: Sat Oct 28, 2006 9:01 pm
Location: Hyperborea

Re: Writer idiosyncrasies

Post by Chuma »

Indeed. I think it's based on blogs, but I'm not sure exactly. Maybe we have several corpses. Will check.

My project is supposed to be mainly dealing with syntax, so it would be nice to find some good syntactic features. But "different to/from" could perhaps be seen as a syntactic feature, sort of, since we need syntax to figure out whether the two words have that particular relation - "it was different in a way from the other one" does, but "the house looked different from the other side" does not. Might be able to do something with that.

Post Reply