Summary: While XML and JSON schemas may not be flexible enough for this, in my (admittedly non-expert) opinion it is possible (and will be necessary) to create a computer-readable markup language to represent natural-language grammars.
Salmoneus: I don't think snappdragon's goal here is to create a maximally efficient way to
store the information
on a computer. Indeed, you're probably right that the grammar texts are as efficient in terms of hard-drive space as we're likely to get. But what snappdragon is talking about is representing the grammars in a way that is
usable by a computer, which is a totally different question.
The "each language has its own totally unique grammar format" approach works well for human consumption. But off the top of my head, here are a few (admittedly complicated and at the moment far-off) applications for the kind of thing that snappdragon is talking about:
- Applications like PolyGlot (already mentioned)
- Translation software: if you can design a comprehensive way to represent the rules of a grammar, you wouldn't need to build a new program for each new language you wanted to add to, say, Google Translate. I could take a language file for a conlang, upload it to a program like Google Translate, and it could run the translations. (This one would be very complicated to implement successfully, given the problem of ambiguity, but still doable eventually)
- Linguistic typology: like we're already doing with DNA and phylogenetic trees, if you can represent grammars and lexicons in a computer-readable format, you could write machine learning algorithms to determine relationships between languages, generate hypothetical ancestor languages, etc. A lot of data could be mined that, without a computer, is inaccessible.
And I'm sure there are more. The point is, computers are a key part of any field in the 21st century, and so computer-readable formats are going to be crucial to
any field, linguistics included. I'd actually be surprised if there isn't already a project working on this.
If we accept the necessity of this, then the question becomes: how? The concerns raised are valid when it comes to XML and JSON schemas: they aren't meant for something as complicated and fluid as the definition of a human language. But that doesn't mean there cannot exist a computer language for it.
I haven't (yet) put a lot of thought into what the various obstacles might be (and I'm only a very amateur linguist), but regarding the specific concern about declension tables: yes, the relevant categories are language dependent. So let the grammar define, at the beginning, what the relevant categories are. Then you can use those categories throughout the rest of the file, because you defined them for the computer. The same approach could be applied to the definition of Dative, and anything else. This is where XML falls very short, but it's mechanically no different than defining new objects in any OOP language.
This would, admittedly, get very tedious for very common language features, but could easily be handled like a library in a general-purpose computer language. Your language is nominative-accusative? Great, we have an import for that!
All that said, I'll readily admit this isn't an easy task. But I've been looking for an opportunity to design a domain-specific computer language, and this seems like a very interesting challenge. How about you guys present us with the obstacles to a uniform representation, and snappdragon and I can work on overcoming them?