Idea for a Data Schema for Conlangs

snappdragon · Post by **snappdragon** » Thu Mar 30, 2017 4:42 pm

I am both a language nerd and a computer nerd. I had an idea that would put them together. I'm contemplating creating an XML (or JSON) schema for storing conlangs in a computer-readable format. This would be useful for applications like PolyGlot, or if someone made a website that acted as a conlang database (I'm not counting conlang.wikia.com! Sorry!) that needed an easy way to store the conlangs.

However, there are a few issues to think about beforehand:

Phonetics - How should phones/phonemes, tones, diacritics, stress, timing, etc. be expressed in the file? One idea is to have the IPA symbol as an "ID" of sorts and use either words or numbers to store the meaty phonological information (place/manner, position/height) as separate attributes for phones/phonemes. Other things need some thought.
Morphology - How would morphological rules be formatted?
Syntax - How would sentence, clause, and phrase structuring rules be formatted?
Semantics and Pragmatics - How would we even store this??
Lexicon - In what format should the words be stored in?

While I do have some ideas for how to pull this together, It'd be smart to get some feedback first (and to know if such a thing would even be necessary). Have a good day.

Travis B. · Post by **Travis B.** » Thu Mar 30, 2017 5:02 pm

To do this would require representing a Turing-complete language in the format of XML or JSON, which is an utterly horrible idea that should never be attempted. (Yes, I am referring to XSLT here, which shows you why Turing-complete XML is a bad idea.)

snappdragon · Post by **snappdragon** » Thu Mar 30, 2017 5:28 pm

Travis B. wrote:To do this would require representing a Turing-complete language in the format of XML or JSON, which is an utterly horrible idea that should never be attempted. (Yes, I am referring to XSLT here, which shows you why Turing-complete XML is a bad idea.)

The computer isn't actually reading/using the language. It's just a way of storing it in a way that it can easily manipulate. I'm sorry if that wasn't clear :p

Travis B. · Post by **Travis B.** » Thu Mar 30, 2017 5:33 pm

But even if you are merely representing things like dictionaries, how do you do so in a universal fashion; e.g. entries may need declension/conjugation tables, but how does one specify a universal fashion of representing them (especially since the manner in which words can inflect is limited only by what is possible within the morphology of human languages)?

snappdragon · Post by **snappdragon** » Thu Mar 30, 2017 7:47 pm

Travis B. wrote:But even if you are merely representing things like dictionaries, how do you do so in a universal fashion; e.g. entries may need declension/conjugation tables, but how does one specify a universal fashion of representing them (especially since the manner in which words can inflect is limited only by what is possible within the morphology of human languages)?

This is why one gets suggestions and feedback, so they may figure these things out.

Since I'll most likely be using XML, one possibility would be to define the inflection tables as elements inside of that word's definition element (which would have the word, part of speech, and meaning as attributes). Or some other way, if I can think of a better one.

Salmoneus · Post by **Salmoneus** » Fri Mar 31, 2017 6:13 am

snappdragon wrote:
Travis B. wrote:But even if you are merely representing things like dictionaries, how do you do so in a universal fashion; e.g. entries may need declension/conjugation tables, but how does one specify a universal fashion of representing them (especially since the manner in which words can inflect is limited only by what is possible within the morphology of human languages)?
This is why one gets suggestions and feedback, so they may figure these things out.

Since I'll most likely be using XML, one possibility would be to define the inflection tables as elements inside of that word's definition element (which would have the word, part of speech, and meaning as attributes). Or some other way, if I can think of a better one.

That doesn't seem to address the issues Travis is raising, just the computer-trivia aspect.

For instance, what is in a declension table? Values for each combination of relevent categories. What are the relevent categories? Language dependent! So you couldn't just fill in a default table; each language would use different tables of different sizes with different categories in them. [not to mention that for some languages, where there may be millions of potential word forms, tables are an incredibly inefficient way of storing that data].
And that's the EASY bit. Because then you have to define what those categories mean, and that's language dependent and very complicated ("dative" in one language may have little in common with "dative" in another).

We have already invented a maximally efficient way of storing language descriptions on computers. It's called "a word processor document." Those 1,000-page grammars aren't written for fun, and it's not just bad management that they're often organised in totally different ways from one another - it's because a simple, universal system to objectively and succinctly describe languages is not possible.

Curlyjimsam · Post by **Curlyjimsam** » Fri Mar 31, 2017 1:52 pm

A unified way of representing information about a language might have some uses in spite of its drawbacks. Yes, we might end up with somewhat partial descriptions (though all descriptions of natural or pseudo-natural languages are partial descriptions), but that's not to say the idea is entirely without merit. People could still write up descriptions of their languages in more traditional ways as well. I can imagine what snappdragon is suggesting here being useful if, say, you wanted to compare related properties of lots of different languages quickly, without having to leaf through the grammars to find the right bit in each one.

(Admittedly, I'm not sure how much application doing that sort of thing would have, but you can imagine a few - an interest in seeing what sort of properties constructed languages have and how this compares to real-world languages, a desire to know which properties are overused so they can be avoided, etc.)

Lolinder · Post by **Lolinder** » Tue May 02, 2017 10:56 am

Summary: While XML and JSON schemas may not be flexible enough for this, in my (admittedly non-expert) opinion it is possible (and will be necessary) to create a computer-readable markup language to represent natural-language grammars.

Salmoneus: I don't think snappdragon's goal here is to create a maximally efficient way to store the information on a computer. Indeed, you're probably right that the grammar texts are as efficient in terms of hard-drive space as we're likely to get. But what snappdragon is talking about is representing the grammars in a way that is usable by a computer, which is a totally different question.

The "each language has its own totally unique grammar format" approach works well for human consumption. But off the top of my head, here are a few (admittedly complicated and at the moment far-off) applications for the kind of thing that snappdragon is talking about:

Applications like PolyGlot (already mentioned)
Translation software: if you can design a comprehensive way to represent the rules of a grammar, you wouldn't need to build a new program for each new language you wanted to add to, say, Google Translate. I could take a language file for a conlang, upload it to a program like Google Translate, and it could run the translations. (This one would be very complicated to implement successfully, given the problem of ambiguity, but still doable eventually)
Linguistic typology: like we're already doing with DNA and phylogenetic trees, if you can represent grammars and lexicons in a computer-readable format, you could write machine learning algorithms to determine relationships between languages, generate hypothetical ancestor languages, etc. A lot of data could be mined that, without a computer, is inaccessible.

And I'm sure there are more. The point is, computers are a key part of any field in the 21st century, and so computer-readable formats are going to be crucial to any field, linguistics included. I'd actually be surprised if there isn't already a project working on this.

If we accept the necessity of this, then the question becomes: how? The concerns raised are valid when it comes to XML and JSON schemas: they aren't meant for something as complicated and fluid as the definition of a human language. But that doesn't mean there cannot exist a computer language for it.

I haven't (yet) put a lot of thought into what the various obstacles might be (and I'm only a very amateur linguist), but regarding the specific concern about declension tables: yes, the relevant categories are language dependent. So let the grammar define, at the beginning, what the relevant categories are. Then you can use those categories throughout the rest of the file, because you defined them for the computer. The same approach could be applied to the definition of Dative, and anything else. This is where XML falls very short, but it's mechanically no different than defining new objects in any OOP language.

This would, admittedly, get very tedious for very common language features, but could easily be handled like a library in a general-purpose computer language. Your language is nominative-accusative? Great, we have an import for that!

All that said, I'll readily admit this isn't an easy task. But I've been looking for an opportunity to design a domain-specific computer language, and this seems like a very interesting challenge. How about you guys present us with the obstacles to a uniform representation, and snappdragon and I can work on overcoming them?

Axiem · Post by **Axiem** » Tue May 02, 2017 11:47 am

Are natural languages recursively enumerable?

alice · Post by **alice** » Tue May 02, 2017 2:26 pm

To put this idea into perspective, I've been trying something similar just for phonology, and even that's proving to be a major headache.

xxx · Post by **xxx** » Thu May 18, 2017 4:06 pm

We risk losing the main ...
The immoderate use of computers begins to have even negative repercussions on the conceptualization in science ...
How much more for this ephemeral art which tries to bring us closer to ourselves, and which has no last end in the real external world ...

snappdragon · Post by **snappdragon** » Thu May 18, 2017 5:37 pm

xxx wrote:We risk losing the main ...
The immoderate use of computers begins to have even negative repercussions on the conceptualization in science ...
How much more for this ephemeral art which tries to bring us closer to ourselves, and which has no last end in the real external world ...

Uhm... the only thing I can respond to this is #ExistentialPhilosophy

snappdragon · Post by **snappdragon** » Thu May 18, 2017 5:39 pm

Lolinder wrote:Summary: While XML and JSON schemas may not be flexible enough for this, in my (admittedly non-expert) opinion it is possible (and will be necessary) to create a computer-readable markup language to represent natural-language grammars.

Salmoneus: I don't think snappdragon's goal here is to create a maximally efficient way to store the information on a computer. Indeed, you're probably right that the grammar texts are as efficient in terms of hard-drive space as we're likely to get. But what snappdragon is talking about is representing the grammars in a way that is usable by a computer, which is a totally different question.

The "each language has its own totally unique grammar format" approach works well for human consumption. But off the top of my head, here are a few (admittedly complicated and at the moment far-off) applications for the kind of thing that snappdragon is talking about:

Applications like PolyGlot (already mentioned)

Translation software: if you can design a comprehensive way to represent the rules of a grammar, you wouldn't need to build a new program for each new language you wanted to add to, say, Google Translate. I could take a language file for a conlang, upload it to a program like Google Translate, and it could run the translations. (This one would be very complicated to implement successfully, given the problem of ambiguity, but still doable eventually)

Linguistic typology: like we're already doing with DNA and phylogenetic trees, if you can represent grammars and lexicons in a computer-readable format, you could write machine learning algorithms to determine relationships between languages, generate hypothetical ancestor languages, etc. A lot of data could be mined that, without a computer, is inaccessible.
And I'm sure there are more. The point is, computers are a key part of any field in the 21st century, and so computer-readable formats are going to be crucial to any field, linguistics included. I'd actually be surprised if there isn't already a project working on this.

If we accept the necessity of this, then the question becomes: how? The concerns raised are valid when it comes to XML and JSON schemas: they aren't meant for something as complicated and fluid as the definition of a human language. But that doesn't mean there cannot exist a computer language for it.

I haven't (yet) put a lot of thought into what the various obstacles might be (and I'm only a very amateur linguist), but regarding the specific concern about declension tables: yes, the relevant categories are language dependent. So let the grammar define, at the beginning, what the relevant categories are. Then you can use those categories throughout the rest of the file, because you defined them for the computer. The same approach could be applied to the definition of Dative, and anything else. This is where XML falls very short, but it's mechanically no different than defining new objects in any OOP language.

This would, admittedly, get very tedious for very common language features, but could easily be handled like a library in a general-purpose computer language. Your language is nominative-accusative? Great, we have an import for that!

All that said, I'll readily admit this isn't an easy task. But I've been looking for an opportunity to design a domain-specific computer language, and this seems like a very interesting challenge. How about you guys present us with the obstacles to a uniform representation, and snappdragon and I can work on overcoming them?

Yes. Just. Yes. I'm going to message you shortly. Be prepared

xxx · Post by **xxx** » Sat May 20, 2017 1:00 am

snappdragon wrote:Uhm... the only thing I can respond to this is #ExistentialPhilosophy

what about leaving art/philosophy/science to engineers...

zompist bboard

Idea for a Data Schema for Conlangs

Idea for a Data Schema for Conlangs

Re: Idea for a Data Schema for Conlangs

Re: Idea for a Data Schema for Conlangs

Re: Idea for a Data Schema for Conlangs

Re: Idea for a Data Schema for Conlangs

Re: Idea for a Data Schema for Conlangs

Re: Idea for a Data Schema for Conlangs

Re: Idea for a Data Schema for Conlangs

Re: Idea for a Data Schema for Conlangs

Re: Idea for a Data Schema for Conlangs

imhoderate

Re: imhoderate

Re: Idea for a Data Schema for Conlangs

Re: imhoderate