Haedus SCA - Bugfix (01/24)
Posted: Sun Nov 03, 2013 5:53 pm
EDIT: Updated Link - updated documentation and additional features to follow in a few days, hopefully.
This is the successor to my old ASCA program that nobody asked for. It is, however, much better tested - I've been working professionally as a computational linguist for the last year and have learned a lot about software design.
I don't have a clever name for it, but it's an SCA and it's part of a set of tools I'm developing for future dissertation research. Let's just call it HTS for now. I'm interested in people's feedback on usability, documentation, and any problems - as stated above, I needed to develop this as part of another project, but I wanted to make it available for public use.
Download Here (2014.01.24) - contains jar, an example batch file, and example PIE to Kuma-Koban lexicon and rules. If you have an ASCA rules file, it will be compatible with this, though because HTS handles combining characters differently, the intent might be slightly off.
I'll try to be as concise as possibe here, but there is one really critical thing to understand before using this tool: HTS does not manipulate strings.
It is designed to operate on phonetic segments, and handles combining diacritics and modifier letters intelligently. A rule targeting p will not affect pʰ, ever, under any circumstances. Symbols like <ʰ>, and others like <ˣ>, <ʶ>, combining diacritics, alphanumeric sub- and superscripts, are understood to not be characters on their own, but modifiers on another (preceding) character.
Important Caveat: all diacritics are will currently combine with the segments they follow, so pre-aspiration as denoted ʰp is not possibly at the moment, though it will be once support for phonetic features is added (though it will need to be pre-defined).
Also, rules and lexicon entries are normalized using canonical composition in unicode, so you don't have to worry about manually converting between forms with combining accents and their precomposed equivalents.
Running HTS
This is a stand-alone Java application - you will need the JRE installed at the moment, but there are no external dependenes. Run it with a script or from the command line. HTS needs three arguments: lexicon, rules, and output, in that order.
Rules
The rule format is supposed to be pretty intuitive. You can add comments using % anywhere in a line - anything after % will be ignored. The following characters are reserved and should only be used in commands:
HTS uses spaces to delimit lists, but is insensitive to the number of spaces (or tabs). This lets you format your rules into nice blocks if you like.
Variables are simple - unlike in ASCA, there are not restrictions on naming. The following block demonstrates some of what can be done:
As you can see, a variable is assigned with a label, the = operator, and a space-delimited list of segments or variables. Also take node of K and KY. When HTS checks for variable names, it will always match the longest one first. So basically, if you also have a variable called Y and you think it's going to ever show up after K and don't want it confused with KY, give it a different name.
You can also modify variables by just redefining them, or you can add a symbol to a variables like this:
This will have the effect of adding s to C.
Transformation rules are a bit more complicated. Unlike programs that used the // notation, HST uses > for transformation, and / to denote the condition:
As in other rule languages, # indicates a word-boundary. In conditions, you can used regular expressions. The semantics of ?, *, and + are traditional, but HTS uses sets with {} rather than pipes or square brackets.
You can delete characters using 0:
As you can see in the first line, any number of elements on the left can be deleted by a single 0 on the right.
As in the following examples, you can combine segments and variables:
You can also use two variables in a transformation, but this is discouraged in most cases. In principal, you can do the following, but there are usually better ways to write an equivalent rule.
In conditions, you can use sets. Like in ASCA, they are denoted by curly braces {}. As in other places, the items listed inside are delimited by spaces. A condtion _{C X #} will match if the symbol is followed by C, X, or a word-boundary.
Regular Expressions
The first thing to remember is that in HTS, parentheses do not mean "optional". I'll probably make this configurable in a later version, but for now parentheses are used when you need to apply ?, *, or + to a series of segments or variables, or to separate potentially-conflicting variable names like these: (K)(Y), K(Y), (K)Y.
If you are not familiar with regular expressions, the semantics here should be fairly straightforward. ? indicates that the preceding expression (group, segment, or variable) can be matched zero or one times - this is equivalent to the use of parenthese to make something optional. * will match the preceding expression zero or more times - using variables from our examples, C* indicates any number of consonants. + is similar, matching an expression one or more times.
Not that you'd be likely to need to, but I built HTS to parse expressions into state machines, so you have a lot of power in writing conditions (provided you don't need back-references). For example, you can write conditions like the following:
though gods protect you if you need the first, and the second has redundant nesting.
Future Development
I've been developing HTS with hooks in the code for future additions, like being able to load additional rules or variable definitions from inside another rule file, or reading and writing lexicons in the same way.
Also, I'd like to add support for compound conditions, like and possibly also exceptions for blocking conditions.
The next step is integrating features into the rule system. The segment-based approach I've taken already support this and is used by sequence alignment code that I might release soon as well - this uses dynamic programming and a hybrid articulatory-perceptual feature system to align sequences as a first stem in researching semi-automated reconstruction.
This is the successor to my old ASCA program that nobody asked for. It is, however, much better tested - I've been working professionally as a computational linguist for the last year and have learned a lot about software design.
I don't have a clever name for it, but it's an SCA and it's part of a set of tools I'm developing for future dissertation research. Let's just call it HTS for now. I'm interested in people's feedback on usability, documentation, and any problems - as stated above, I needed to develop this as part of another project, but I wanted to make it available for public use.
Download Here (2014.01.24) - contains jar, an example batch file, and example PIE to Kuma-Koban lexicon and rules. If you have an ASCA rules file, it will be compatible with this, though because HTS handles combining characters differently, the intent might be slightly off.
I'll try to be as concise as possibe here, but there is one really critical thing to understand before using this tool: HTS does not manipulate strings.
It is designed to operate on phonetic segments, and handles combining diacritics and modifier letters intelligently. A rule targeting p will not affect pʰ, ever, under any circumstances. Symbols like <ʰ>, and others like <ˣ>, <ʶ>, combining diacritics, alphanumeric sub- and superscripts, are understood to not be characters on their own, but modifiers on another (preceding) character.
Important Caveat: all diacritics are will currently combine with the segments they follow, so pre-aspiration as denoted ʰp is not possibly at the moment, though it will be once support for phonetic features is added (though it will need to be pre-defined).
Also, rules and lexicon entries are normalized using canonical composition in unicode, so you don't have to worry about manually converting between forms with combining accents and their precomposed equivalents.
Running HTS
This is a stand-alone Java application - you will need the JRE installed at the moment, but there are no external dependenes. Run it with a script or from the command line. HTS needs three arguments: lexicon, rules, and output, in that order.
Code: Select all
java -jar toolboxSCA.jar PIE_lex.txt PIE-PKK.rules PKK_lex.txtThe rule format is supposed to be pretty intuitive. You can add comments using % anywhere in a line - anything after % will be ignored. The following characters are reserved and should only be used in commands:
Code: Select all
# % _ / > ( ) { } * +? =Variables are simple - unlike in ASCA, there are not restrictions on naming. The following block demonstrates some of what can be done:
Code: Select all
Q = kʷʰ kʷ gʷ
K = kʰ k g
KY = cʰ c ɟ
P = pʰ p b
T = tʰ t d
[PLOSIVE] = P T KY K Q
[OBSTRUENT] = [PLOSIVE] sYou can also modify variables by just redefining them, or you can add a symbol to a variables like this:
Code: Select all
C = C sTransformation rules are a bit more complicated. Unlike programs that used the // notation, HST uses > for transformation, and / to denote the condition:
Code: Select all
h₁ h₂ h₃ h₄ > ʔ x ɣ ʕ
bʰ dʰ ǵʰ gʰ gʷʰ > pʰ tʰ ḱʰ kʰ kʷʰ
% GRASSMANN'S LAW
CH = pʰ tʰ cʰ kʰ
J = b d ɟ g
CH > J / _R?VV?C*CH
y w > i u / _{C X #}You can delete characters using 0:
Code: Select all
xa xə > 0 / [LongV]L_
X > 0 / _{C #}
X > 0 / C_
ʔ > 0 / #_As in the following examples, you can combine segments and variables:
Code: Select all
r̩X l̩X > ə̄r ə̄l / _{C #}
r̩X l̩X > ər əl / _VCode: Select all
NX > Nə / #_CRegular Expressions
The first thing to remember is that in HTS, parentheses do not mean "optional". I'll probably make this configurable in a later version, but for now parentheses are used when you need to apply ?, *, or + to a series of segments or variables, or to separate potentially-conflicting variable names like these: (K)(Y), K(Y), (K)Y.
If you are not familiar with regular expressions, the semantics here should be fairly straightforward. ? indicates that the preceding expression (group, segment, or variable) can be matched zero or one times - this is equivalent to the use of parenthese to make something optional. * will match the preceding expression zero or more times - using variables from our examples, C* indicates any number of consonants. + is similar, matching an expression one or more times.
Not that you'd be likely to need to, but I built HTS to parse expressions into state machines, so you have a lot of power in writing conditions (provided you don't need back-references). For example, you can write conditions like the following:
Code: Select all
_{ab* (cd?)+ ((ae)*f)+}tr
_{ab {cd xy} ef}trFuture Development
I've been developing HTS with hooks in the code for future additions, like being able to load additional rules or variable definitions from inside another rule file, or reading and writing lexicons in the same way.
Also, I'd like to add support for compound conditions, like
Code: Select all
X > 0 / _{C #} OR C_The next step is integrating features into the rule system. The segment-based approach I've taken already support this and is used by sequence alignment code that I might release soon as well - this uses dynamic programming and a hybrid articulatory-perceptual feature system to align sequences as a first stem in researching semi-automated reconstruction.