Idea for a Data Schema for Conlangs

Substantial postings about constructed languages and constructed worlds in general. Good place to mention your own or evaluate someone else's. Put quick questions in C&C Quickies instead.
Post Reply
snappdragon
Sanci
Sanci
Posts: 72
Joined: Fri Oct 14, 2016 5:56 pm
Location: Searching for $15 so I can get the PCK

Idea for a Data Schema for Conlangs

Post by snappdragon »

I am both a language nerd and a computer nerd. I had an idea that would put them together. I'm contemplating creating an XML (or JSON) schema for storing conlangs in a computer-readable format. This would be useful for applications like PolyGlot, or if someone made a website that acted as a conlang database (I'm not counting conlang.wikia.com! Sorry!) that needed an easy way to store the conlangs.

However, there are a few issues to think about beforehand:
  • Phonetics - How should phones/phonemes, tones, diacritics, stress, timing, etc. be expressed in the file? One idea is to have the IPA symbol as an "ID" of sorts and use either words or numbers to store the meaty phonological information (place/manner, position/height) as separate attributes for phones/phonemes. Other things need some thought.
  • Morphology - How would morphological rules be formatted?
  • Syntax - How would sentence, clause, and phrase structuring rules be formatted?
  • Semantics and Pragmatics - How would we even store this??
  • Lexicon - In what format should the words be stored in?
While I do have some ideas for how to pull this together, It'd be smart to get some feedback first (and to know if such a thing would even be necessary). Have a good day.
~$ snappdragon
Linguistic novice, worldbuilding newbie. Also, wants to be a game developer.

Travis B.
Sumerul
Sumerul
Posts: 3570
Joined: Mon Jun 20, 2005 12:47 pm
Location: Milwaukee, US

Re: Idea for a Data Schema for Conlangs

Post by Travis B. »

To do this would require representing a Turing-complete language in the format of XML or JSON, which is an utterly horrible idea that should never be attempted. (Yes, I am referring to XSLT here, which shows you why Turing-complete XML is a bad idea.)
Dibotahamdn duthma jallni agaynni ra hgitn lakrhmi.
Amuhawr jalla vowa vta hlakrhi hdm duthmi xaja.
Irdro. Irdro. Irdro. Irdro. Irdro. Irdro. Irdro.

snappdragon
Sanci
Sanci
Posts: 72
Joined: Fri Oct 14, 2016 5:56 pm
Location: Searching for $15 so I can get the PCK

Re: Idea for a Data Schema for Conlangs

Post by snappdragon »

Travis B. wrote:To do this would require representing a Turing-complete language in the format of XML or JSON, which is an utterly horrible idea that should never be attempted. (Yes, I am referring to XSLT here, which shows you why Turing-complete XML is a bad idea.)
The computer isn't actually reading/using the language. It's just a way of storing it in a way that it can easily manipulate. I'm sorry if that wasn't clear :p
~$ snappdragon
Linguistic novice, worldbuilding newbie. Also, wants to be a game developer.

Travis B.
Sumerul
Sumerul
Posts: 3570
Joined: Mon Jun 20, 2005 12:47 pm
Location: Milwaukee, US

Re: Idea for a Data Schema for Conlangs

Post by Travis B. »

But even if you are merely representing things like dictionaries, how do you do so in a universal fashion; e.g. entries may need declension/conjugation tables, but how does one specify a universal fashion of representing them (especially since the manner in which words can inflect is limited only by what is possible within the morphology of human languages)?
Dibotahamdn duthma jallni agaynni ra hgitn lakrhmi.
Amuhawr jalla vowa vta hlakrhi hdm duthmi xaja.
Irdro. Irdro. Irdro. Irdro. Irdro. Irdro. Irdro.

snappdragon
Sanci
Sanci
Posts: 72
Joined: Fri Oct 14, 2016 5:56 pm
Location: Searching for $15 so I can get the PCK

Re: Idea for a Data Schema for Conlangs

Post by snappdragon »

Travis B. wrote:But even if you are merely representing things like dictionaries, how do you do so in a universal fashion; e.g. entries may need declension/conjugation tables, but how does one specify a universal fashion of representing them (especially since the manner in which words can inflect is limited only by what is possible within the morphology of human languages)?
This is why one gets suggestions and feedback, so they may figure these things out.

Since I'll most likely be using XML, one possibility would be to define the inflection tables as elements inside of that word's definition element (which would have the word, part of speech, and meaning as attributes). Or some other way, if I can think of a better one.
~$ snappdragon
Linguistic novice, worldbuilding newbie. Also, wants to be a game developer.

User avatar
Salmoneus
Sanno
Sanno
Posts: 3197
Joined: Thu Jan 15, 2004 5:00 pm
Location: One of the dark places of the world

Re: Idea for a Data Schema for Conlangs

Post by Salmoneus »

snappdragon wrote:
Travis B. wrote:But even if you are merely representing things like dictionaries, how do you do so in a universal fashion; e.g. entries may need declension/conjugation tables, but how does one specify a universal fashion of representing them (especially since the manner in which words can inflect is limited only by what is possible within the morphology of human languages)?
This is why one gets suggestions and feedback, so they may figure these things out.

Since I'll most likely be using XML, one possibility would be to define the inflection tables as elements inside of that word's definition element (which would have the word, part of speech, and meaning as attributes). Or some other way, if I can think of a better one.
That doesn't seem to address the issues Travis is raising, just the computer-trivia aspect.

For instance, what is in a declension table? Values for each combination of relevent categories. What are the relevent categories? Language dependent! So you couldn't just fill in a default table; each language would use different tables of different sizes with different categories in them. [not to mention that for some languages, where there may be millions of potential word forms, tables are an incredibly inefficient way of storing that data].
And that's the EASY bit. Because then you have to define what those categories mean, and that's language dependent and very complicated ("dative" in one language may have little in common with "dative" in another).

We have already invented a maximally efficient way of storing language descriptions on computers. It's called "a word processor document." Those 1,000-page grammars aren't written for fun, and it's not just bad management that they're often organised in totally different ways from one another - it's because a simple, universal system to objectively and succinctly describe languages is not possible.
Blog: [url]http://vacuouswastrel.wordpress.com/[/url]

But the river tripped on her by and by, lapping
as though her heart was brook: Why, why, why! Weh, O weh
I'se so silly to be flowing but I no canna stay!

User avatar
Curlyjimsam
Lebom
Lebom
Posts: 205
Joined: Wed Dec 29, 2004 11:57 am
Location: Elsewhere
Contact:

Re: Idea for a Data Schema for Conlangs

Post by Curlyjimsam »

A unified way of representing information about a language might have some uses in spite of its drawbacks. Yes, we might end up with somewhat partial descriptions (though all descriptions of natural or pseudo-natural languages are partial descriptions), but that's not to say the idea is entirely without merit. People could still write up descriptions of their languages in more traditional ways as well. I can imagine what snappdragon is suggesting here being useful if, say, you wanted to compare related properties of lots of different languages quickly, without having to leaf through the grammars to find the right bit in each one.

(Admittedly, I'm not sure how much application doing that sort of thing would have, but you can imagine a few - an interest in seeing what sort of properties constructed languages have and how this compares to real-world languages, a desire to know which properties are overused so they can be avoided, etc.)

Lolinder
Niš
Niš
Posts: 1
Joined: Tue May 11, 2010 5:52 pm

Re: Idea for a Data Schema for Conlangs

Post by Lolinder »

Summary: While XML and JSON schemas may not be flexible enough for this, in my (admittedly non-expert) opinion it is possible (and will be necessary) to create a computer-readable markup language to represent natural-language grammars.

Salmoneus: I don't think snappdragon's goal here is to create a maximally efficient way to store the information on a computer. Indeed, you're probably right that the grammar texts are as efficient in terms of hard-drive space as we're likely to get. But what snappdragon is talking about is representing the grammars in a way that is usable by a computer, which is a totally different question.

The "each language has its own totally unique grammar format" approach works well for human consumption. But off the top of my head, here are a few (admittedly complicated and at the moment far-off) applications for the kind of thing that snappdragon is talking about:
  • Applications like PolyGlot (already mentioned)
  • Translation software: if you can design a comprehensive way to represent the rules of a grammar, you wouldn't need to build a new program for each new language you wanted to add to, say, Google Translate. I could take a language file for a conlang, upload it to a program like Google Translate, and it could run the translations. (This one would be very complicated to implement successfully, given the problem of ambiguity, but still doable eventually)
  • Linguistic typology: like we're already doing with DNA and phylogenetic trees, if you can represent grammars and lexicons in a computer-readable format, you could write machine learning algorithms to determine relationships between languages, generate hypothetical ancestor languages, etc. A lot of data could be mined that, without a computer, is inaccessible.
And I'm sure there are more. The point is, computers are a key part of any field in the 21st century, and so computer-readable formats are going to be crucial to any field, linguistics included. I'd actually be surprised if there isn't already a project working on this.

If we accept the necessity of this, then the question becomes: how? The concerns raised are valid when it comes to XML and JSON schemas: they aren't meant for something as complicated and fluid as the definition of a human language. But that doesn't mean there cannot exist a computer language for it.

I haven't (yet) put a lot of thought into what the various obstacles might be (and I'm only a very amateur linguist), but regarding the specific concern about declension tables: yes, the relevant categories are language dependent. So let the grammar define, at the beginning, what the relevant categories are. Then you can use those categories throughout the rest of the file, because you defined them for the computer. The same approach could be applied to the definition of Dative, and anything else. This is where XML falls very short, but it's mechanically no different than defining new objects in any OOP language.

This would, admittedly, get very tedious for very common language features, but could easily be handled like a library in a general-purpose computer language. Your language is nominative-accusative? Great, we have an import for that!

All that said, I'll readily admit this isn't an easy task. But I've been looking for an opportunity to design a domain-specific computer language, and this seems like a very interesting challenge. How about you guys present us with the obstacles to a uniform representation, and snappdragon and I can work on overcoming them? :)

Axiem
Avisaru
Avisaru
Posts: 260
Joined: Tue Oct 22, 2013 8:15 pm

Re: Idea for a Data Schema for Conlangs

Post by Axiem »

Are natural languages recursively enumerable?

User avatar
alice
Avisaru
Avisaru
Posts: 707
Joined: Wed Oct 30, 2002 4:43 pm
Location: Three of them

Re: Idea for a Data Schema for Conlangs

Post by alice »

To put this idea into perspective, I've been trying something similar just for phonology, and even that's proving to be a major headache.
Zompist's Markov generator wrote:it was labelled" orange marmalade," but that is unutterably hideous.

User avatar
xxx
Lebom
Lebom
Posts: 94
Joined: Wed Dec 14, 2011 1:04 pm
Contact:

imhoderate

Post by xxx »

We risk losing the main ...
The immoderate use of computers begins to have even negative repercussions on the conceptualization in science ...
How much more for this ephemeral art which tries to bring us closer to ourselves, and which has no last end in the real external world ...

snappdragon
Sanci
Sanci
Posts: 72
Joined: Fri Oct 14, 2016 5:56 pm
Location: Searching for $15 so I can get the PCK

Re: imhoderate

Post by snappdragon »

xxx wrote:We risk losing the main ...
The immoderate use of computers begins to have even negative repercussions on the conceptualization in science ...
How much more for this ephemeral art which tries to bring us closer to ourselves, and which has no last end in the real external world ...
Uhm... the only thing I can respond to this is #ExistentialPhilosophy
~$ snappdragon
Linguistic novice, worldbuilding newbie. Also, wants to be a game developer.

snappdragon
Sanci
Sanci
Posts: 72
Joined: Fri Oct 14, 2016 5:56 pm
Location: Searching for $15 so I can get the PCK

Re: Idea for a Data Schema for Conlangs

Post by snappdragon »

Lolinder wrote:Summary: While XML and JSON schemas may not be flexible enough for this, in my (admittedly non-expert) opinion it is possible (and will be necessary) to create a computer-readable markup language to represent natural-language grammars.

Salmoneus: I don't think snappdragon's goal here is to create a maximally efficient way to store the information on a computer. Indeed, you're probably right that the grammar texts are as efficient in terms of hard-drive space as we're likely to get. But what snappdragon is talking about is representing the grammars in a way that is usable by a computer, which is a totally different question.

The "each language has its own totally unique grammar format" approach works well for human consumption. But off the top of my head, here are a few (admittedly complicated and at the moment far-off) applications for the kind of thing that snappdragon is talking about:
  • Applications like PolyGlot (already mentioned)
  • Translation software: if you can design a comprehensive way to represent the rules of a grammar, you wouldn't need to build a new program for each new language you wanted to add to, say, Google Translate. I could take a language file for a conlang, upload it to a program like Google Translate, and it could run the translations. (This one would be very complicated to implement successfully, given the problem of ambiguity, but still doable eventually)
  • Linguistic typology: like we're already doing with DNA and phylogenetic trees, if you can represent grammars and lexicons in a computer-readable format, you could write machine learning algorithms to determine relationships between languages, generate hypothetical ancestor languages, etc. A lot of data could be mined that, without a computer, is inaccessible.
And I'm sure there are more. The point is, computers are a key part of any field in the 21st century, and so computer-readable formats are going to be crucial to any field, linguistics included. I'd actually be surprised if there isn't already a project working on this.

If we accept the necessity of this, then the question becomes: how? The concerns raised are valid when it comes to XML and JSON schemas: they aren't meant for something as complicated and fluid as the definition of a human language. But that doesn't mean there cannot exist a computer language for it.

I haven't (yet) put a lot of thought into what the various obstacles might be (and I'm only a very amateur linguist), but regarding the specific concern about declension tables: yes, the relevant categories are language dependent. So let the grammar define, at the beginning, what the relevant categories are. Then you can use those categories throughout the rest of the file, because you defined them for the computer. The same approach could be applied to the definition of Dative, and anything else. This is where XML falls very short, but it's mechanically no different than defining new objects in any OOP language.

This would, admittedly, get very tedious for very common language features, but could easily be handled like a library in a general-purpose computer language. Your language is nominative-accusative? Great, we have an import for that!

All that said, I'll readily admit this isn't an easy task. But I've been looking for an opportunity to design a domain-specific computer language, and this seems like a very interesting challenge. How about you guys present us with the obstacles to a uniform representation, and snappdragon and I can work on overcoming them? :)
Yes. Just. Yes. I'm going to message you shortly. Be prepared :)
~$ snappdragon
Linguistic novice, worldbuilding newbie. Also, wants to be a game developer.

User avatar
xxx
Lebom
Lebom
Posts: 94
Joined: Wed Dec 14, 2011 1:04 pm
Contact:

Re: imhoderate

Post by xxx »

snappdragon wrote:Uhm... the only thing I can respond to this is #ExistentialPhilosophy
what about leaving art/philosophy/science to engineers...

Post Reply