Haedus SCA - Bugfix (01/24)

Morrígan · Post by **Morrígan** » Sun Nov 03, 2013 5:53 pm

EDIT: Updated Link - updated documentation and additional features to follow in a few days, hopefully.

This is the successor to my old ASCA program that nobody asked for. It is, however, much better tested - I've been working professionally as a computational linguist for the last year and have learned a lot about software design.
I don't have a clever name for it, but it's an SCA and it's part of a set of tools I'm developing for future dissertation research. Let's just call it HTS for now. I'm interested in people's feedback on usability, documentation, and any problems - as stated above, I needed to develop this as part of another project, but I wanted to make it available for public use.

Download Here (2014.01.24) - contains jar, an example batch file, and example PIE to Kuma-Koban lexicon and rules. If you have an ASCA rules file, it will be compatible with this, though because HTS handles combining characters differently, the intent might be slightly off.

I'll try to be as concise as possibe here, but there is one really critical thing to understand before using this tool: HTS does not manipulate strings.

It is designed to operate on phonetic segments, and handles combining diacritics and modifier letters intelligently. A rule targeting p will not affect pʰ, ever, under any circumstances. Symbols like <ʰ>, and others like <ˣ>, <ʶ>, combining diacritics, alphanumeric sub- and superscripts, are understood to not be characters on their own, but modifiers on another (preceding) character.

Important Caveat: all diacritics are will currently combine with the segments they follow, so pre-aspiration as denoted ʰp is not possibly at the moment, though it will be once support for phonetic features is added (though it will need to be pre-defined).

Also, rules and lexicon entries are normalized using canonical composition in unicode, so you don't have to worry about manually converting between forms with combining accents and their precomposed equivalents.

Running HTS
This is a stand-alone Java application - you will need the JRE installed at the moment, but there are no external dependenes. Run it with a script or from the command line. HTS needs three arguments: lexicon, rules, and output, in that order.

Code: Select all

java -jar toolboxSCA.jar PIE_lex.txt PIE-PKK.rules PKK_lex.txt

Rules
The rule format is supposed to be pretty intuitive. You can add comments using % anywhere in a line - anything after % will be ignored. The following characters are reserved and should only be used in commands:

Code: Select all

# % _ / > ( ) { } * +? =

HTS uses spaces to delimit lists, but is insensitive to the number of spaces (or tabs). This lets you format your rules into nice blocks if you like.

Variables are simple - unlike in ASCA, there are not restrictions on naming. The following block demonstrates some of what can be done:

Code: Select all

Q  = kʷʰ kʷ gʷ
K  = kʰ  k  g
KY = cʰ  c  ɟ
P  = pʰ  p  b
T  = tʰ  t  d
[PLOSIVE]   = P T KY K Q
[OBSTRUENT] = [PLOSIVE] s

As you can see, a variable is assigned with a label, the = operator, and a space-delimited list of segments or variables. Also take node of K and KY. When HTS checks for variable names, it will always match the longest one first. So basically, if you also have a variable called Y and you think it's going to ever show up after K and don't want it confused with KY, give it a different name.
You can also modify variables by just redefining them, or you can add a symbol to a variables like this:

Code: Select all

C = C s

This will have the effect of adding s to C.

Transformation rules are a bit more complicated. Unlike programs that used the // notation, HST uses > for transformation, and / to denote the condition:

Code: Select all

h₁ h₂ h₃ h₄ > ʔ x ɣ ʕ
bʰ dʰ ǵʰ gʰ gʷʰ > pʰ tʰ ḱʰ kʰ kʷʰ
% GRASSMANN'S LAW
CH = pʰ tʰ cʰ kʰ
J  = b  d  ɟ  g
CH > J / _R?VV?C*CH
y w > i u / _{C X #}

As in other rule languages, # indicates a word-boundary. In conditions, you can used regular expressions. The semantics of ?, *, and + are traditional, but HTS uses sets with {} rather than pipes or square brackets.

You can delete characters using 0:

Code: Select all

xa xə > 0 / [LongV]L_
X > 0 / _{C #}
X > 0 / C_
ʔ  > 0 / #_

As you can see in the first line, any number of elements on the left can be deleted by a single 0 on the right.

As in the following examples, you can combine segments and variables:

Code: Select all

r̩X l̩X > ə̄r ə̄l   / _{C #}
r̩X l̩X > ər əl / _V

You can also use two variables in a transformation, but this is discouraged in most cases. In principal, you can do the following, but there are usually better ways to write an equivalent rule.

Code: Select all

NX > Nə / #_C

In conditions, you can use sets. Like in ASCA, they are denoted by curly braces {}. As in other places, the items listed inside are delimited by spaces. A condtion _{C X #} will match if the symbol is followed by C, X, or a word-boundary.

Regular Expressions
The first thing to remember is that in HTS, parentheses do not mean "optional". I'll probably make this configurable in a later version, but for now parentheses are used when you need to apply ?, *, or + to a series of segments or variables, or to separate potentially-conflicting variable names like these: (K)(Y), K(Y), (K)Y.

If you are not familiar with regular expressions, the semantics here should be fairly straightforward. ? indicates that the preceding expression (group, segment, or variable) can be matched zero or one times - this is equivalent to the use of parenthese to make something optional. * will match the preceding expression zero or more times - using variables from our examples, C* indicates any number of consonants. + is similar, matching an expression one or more times.

Not that you'd be likely to need to, but I built HTS to parse expressions into state machines, so you have a lot of power in writing conditions (provided you don't need back-references). For example, you can write conditions like the following:

Code: Select all

_{ab* (cd?)+ ((ae)*f)+}tr
_{ab {cd xy} ef}tr

though gods protect you if you need the first, and the second has redundant nesting.

Future Development
I've been developing HTS with hooks in the code for future additions, like being able to load additional rules or variable definitions from inside another rule file, or reading and writing lexicons in the same way.

Also, I'd like to add support for compound conditions, like

Code: Select all

X > 0 / _{C #} OR C_

and possibly also exceptions for blocking conditions.

The next step is integrating features into the rule system. The segment-based approach I've taken already support this and is used by sequence alignment code that I might release soon as well - this uses dynamic programming and a hybrid articulatory-perceptual feature system to align sequences as a first stem in researching semi-automated reconstruction.

Herra Ratatoskr · Post by **Herra Ratatoskr** » Mon Nov 04, 2013 9:18 pm

Sweet! I've been looking for a good SCA, and hopefully this will cover what I need it to do. Thanks for the work you put into this.

Bristel · Post by **Bristel** » Mon Nov 04, 2013 9:20 pm

Is it supposed to work with Mac? I tried opening the jar file but nothing happens.

Morrígan · Post by **Morrígan** » Mon Nov 04, 2013 9:50 pm

Herra Ratatoskr wrote:Sweet! I've been looking for a good SCA, and hopefully this will cover what I need it to do. Thanks for the work you put into this.

Thanks, I hope you find it useable.

Bristel wrote:Is it supposed to work with Mac? I tried opening the jar file but nothing happens.

Yes, as long as you have the JRE 1.6 or later installed - and you almost certainly do - it will work. It sounds like you are trying to execute the jar directly, which isn't the correct action here. You need to run it from the console, or use a shell script / batch file like the one I provided. If you don't run it through the console, you won't see what the error message is.

Bristel · Post by **Bristel** » Tue Nov 05, 2013 12:22 am

Morrígan wrote:
Bristel wrote:Is it supposed to work with Mac? I tried opening the jar file but nothing happens.
Yes, as long as you have the JRE 1.6 or later installed - and you almost certainly do - it will work. It sounds like you are trying to execute the jar directly, which isn't the correct action here. You need to run it from the console, or use a shell script / batch file like the one I provided. If you don't run it through the console, you won't see what the error message is.

Ah ok, thanks. I'll try to figure it out from there. I'm not familiar with the Terminal program, but I can get my roommate to help, he's the Unix/Mac guy in my house. LOL

Basilius · Post by **Basilius** » Wed Nov 06, 2013 12:47 pm

It looks better than most SCA's I've seen previously...

However, from the description I couldn't infer (perhaps because I lack focus) how I can do one of the following two things.

(1) Add a secondary stress mark to every third unstressed syllable nucleus in an arbitrarily long sequence of posttonic syllables, counting left-to-right.

(2) Delete the nucleus of every odd-numbered syllable in an arbitrarily long sequence of contiguous syllables previously marked as subject to reduction, counting right-to-left*.

*Essentially, Havlík’s Law.

?

Basilius · Post by **Basilius** » Wed Nov 06, 2013 1:08 pm

...Oh, and something more fundamental.

Does it support negative categories (sets)? Like, how can I write "a string of any number of symbols which are neither vowels nor alveolar plosives"?

Also,

Morrígan wrote:Important Caveat: all diacritics are will currently combine with the segments they follow, so pre-aspiration as denoted ʰp is not possibly at the moment, though it will be once support for phonetic features is added (though it will need to be pre-defined).

Also, rules and lexicon entries are normalized using canonical composition in unicode, so you don't have to worry about manually converting between forms with combining accents and their precomposed equivalents.

Ideally, I'd prefer to be able to switch these two features off.

Morrígan · Post by **Morrígan** » Wed Nov 06, 2013 2:13 pm

Basilius wrote:(1) Add a secondary stress mark to every third unstressed syllable nucleus in an arbitrarily long sequence of posttonic syllables, counting left-to-right.

Assuming a CV(N) syllable, I think the following should work - it turns out to be less straightforward than I originally thought. Also, you'd need to either define variables, or just list the individual symbols

Code: Select all

[Unstressed] > [Secondary] / #(CVN?)(CVN?)C_
[Unstressed] > [Secondary] / #(CVN?CVN?)+(CVN?)C_

Basilius wrote:(2) Delete the nucleus of every odd-numbered syllable in an arbitrarily long sequence of contiguous syllables previously marked as subject to reduction, counting right-to-left.

Using v for a marked vowel,

Code: Select all

v > 0 / C_N?(CvN?CvN?)*#

I haven't tested these, but they would certainly be my first try.

Does it support negative categories (sets)? Like, how can I write "a string of any number of symbols which are neither vowels nor alveolar plosives"?

Actually no, nor can you use the "any character" . symbol. I don't think this would be difficult to add, however.

Ideally, I'd prefer to be able to switch these two features off.

Yeah, I think that's reasonable, and easily implemented. I've thought about a number of flags that could be added to the beginning of a rules file, including the commands that load lexicons.

ObsequiousNewt · Post by **ObsequiousNewt** » Wed Nov 06, 2013 4:12 pm

What's the best way to implement cheshirization, like this?

Morrígan · Post by **Morrígan** » Wed Nov 06, 2013 4:45 pm

ObsequiousNewt wrote:What's the best way to implement cheshirization, like this?

I have no idea what you are referring to, and I don't understand the rules described there.

Basilius · Post by **Basilius** » Thu Nov 07, 2013 1:42 pm

Morrígan wrote:
Basilius wrote:(1) Add a secondary stress mark to every third unstressed syllable nucleus in an arbitrarily long sequence of posttonic syllables, counting left-to-right.
Assuming a CV(N) syllable, I think the following should work - it turns out to be less straightforward than I originally thought. Also, you'd need to either define variables, or just list the individual symbols
Code: Select all
[Unstressed] > [Secondary] / #(CVN?)(CVN?)C_
[Unstressed] > [Secondary] / #(CVN?CVN?)+(CVN?)C_

Wait... I read the first line as saying "replace [Unstressed] with [Secondary] in the third syllable counting from word beginning".

And the second, as "replace [Unstressed] with [Secondary] in every syllable whose number is 2n+2, counting from word beginning".

If I'm wrong, how do you write *these*?

OTOH while looking at your examples, I realized the power of your scripting language! OK, let me try...

Assuming that all nuclei are vowels, and accents are marked with spacing acute (main) and spacing grave (secondary) put after the vowel, I'd write "add secondary accent to the next vowel after the stressed one" like this:

Code: Select all

0 > ` / ´[Nonvowel]*[Vowel]_

- and "add secondary accent to the third vowel after the stressed one" like this:

Code: Select all

0 > ` / ´[Nonvowel]*[Vowel][Nonvowel]*[Vowel][Nonvowel]*[Vowel]_

- and "add secondary accent to every third vowel after the stressed one" like this:

Code: Select all

0 > ` / ´([Nonvowel]*[Vowel][Nonvowel]*[Vowel][Nonvowel]*[Vowel])+_

Do you think this will work?

(Also, I hope there are no recursive bits in your algorithm for "0 > [something]" can be also applied to places where [something] has already been added, potentially driving the program into an endless cycle; but I'm sure you've taken care of this...)

Basilius wrote:(2) Delete the nucleus of every odd-numbered syllable in an arbitrarily long sequence of contiguous syllables previously marked as subject to reduction, counting right-to-left.
Using v for a marked vowel,
Code: Select all
v > 0 / C_N?(CvN?CvN?)*#

I read this one as "delete every v whose number, counting from word end, is odd"; nearly what I meant, but it won't delete a vowel separated from end of word by a syllable whose vowel is not reducible. Correct?

My own attempt:

Code: Select all

0 > ; / v_[nonvowel]*([truevowel] #) 
% ";" is a placeholder to mark ends of v-syllable strings, [truevowel] is a [vowel] other than "v".
v > 0 / _([nonvowel]*v[nonvowel]*v)*[nonvowel]*(; #)
; > 0 / _

Will this work?

But I'm impressed anyway. Your operators are powerful!

Does it support negative categories (sets)? Like, how can I write "a string of any number of symbols which are neither vowels nor alveolar plosives"?
Actually no, nor can you use the "any character" . symbol. I don't think this would be difficult to add, however.

In my experience, negative categories (and a couple operators enabling the user to merge/subtract categories) are extremely useful. Things like "^V-´" ("neither vowel nor accent mark") appear in, like, every second line in my SC's.

You may want to look at how GSCA treats this. (And so far GSCA has been the only Really Working SCA for me... although it has, mmmmmmm, some undocumented peculiarities...)

Ideally, I'd prefer to be able to switch these two features off.
Yeah, I think that's reasonable, and easily implemented. I've thought about a number of flags that could be added to the beginning of a rules file, including the commands that load lexicons.

That would be cool!

Also, may I mention a feature I've missed since ever? The ability to redefine the pre-set bits (categories etc.) in the middle of my SC file. Like, the full list of vowels appearing in one set of my SC's can be, like, dozens, so checking its completeness after each minor revision is a pain; OTOH suppose I can just redefine "V" as the vowel inventory I have at *this* particular stage of my language's development - it would be so much easier to keep things consistent... Also, suppose a few last lines in my SC file just produce a practical transcription which uses "y" for /j/...

alice · Post by **alice** » Thu Nov 07, 2013 2:24 pm

This is not very helpful, but my SCA already does most of this since its last rewrite... unfortunately it's embedded in a much bigger program for handling vocabulary in general.

Morrígan · Post by **Morrígan** » Thu Nov 07, 2013 2:55 pm

Basilius wrote:
Morrígan wrote:
Basilius wrote:(1) Add a secondary stress mark to every third unstressed syllable nucleus in an arbitrarily long sequence of posttonic syllables, counting left-to-right.
Assuming a CV(N) syllable, I think the following should work - it turns out to be less straightforward than I originally thought. Also, you'd need to either define variables, or just list the individual symbols
Code: Select all
[Unstressed] > [Secondary] / #(CVN?)(CVN?)C_
[Unstressed] > [Secondary] / #(CVN?CVN?)+(CVN?)C_
Wait... I read the first line as saying "replace [Unstressed] with [Secondary] in the third syllable counting from word beginning".
And the second, as "replace [Unstressed] with [Secondary] in every syllable whose number is 2n+2, counting from word beginning".

I'm pretty sure I messed up here, actually. The second rule would mean another two syllables, I think.

Basilius wrote:
Code: Select all
0 > ` / ´[Nonvowel]*[Vowel]_
- and "add secondary accent to the third vowel after the stressed one" like this:
Code: Select all
0 > ` / ´[Nonvowel]*[Vowel][Nonvowel]*[Vowel][Nonvowel]*[Vowel]_
- and "add secondary accent to every third vowel after the stressed one" like this:
Code: Select all
0 > ` / ´([Nonvowel]*[Vowel][Nonvowel]*[Vowel][Nonvowel]*[Vowel])+_
Do you think this will work?

You can't actually use 0 like this right now, but I've considered adding it - there is room in the code.

Basilius wrote:
Basilius wrote:(2) Delete the nucleus of every odd-numbered syllable in an arbitrarily long sequence of contiguous syllables previously marked as subject to reduction, counting right-to-left.
Using v for a marked vowel,
Code: Select all
v > 0 / C_N?(CvN?CvN?)*#
I read this one as "delete every v whose number, counting from word end, is odd"; nearly what I meant, but it won't delete a vowel separated from end of word by a syllable whose vowel is not reducible. Correct?

Correct - I originally thought that's what you'd meant, but forgot to change it after I'd realized.

Basilius wrote:My own attempt:

Code: Select all

0 > ; / v_[nonvowel]*([truevowel] #) 
% ";" is a placeholder to mark ends of v-syllable strings, [truevowel] is a [vowel] other than "v".
v > 0 / _([nonvowel]*v[nonvowel]*v)*[nonvowel]*(; #)
; > 0 / _

Will this work?

Sets have {} not (). But otherwise (aside from 0 >

yeah, that looks right for what you described.

Basilius wrote:In my experience, negative categories (and a couple operators enabling the user to merge/subtract categories) are extremely useful. Things like "^V-´" ("neither vowel nor accent mark") appear in, like, every second line in my SC's.

I'm not totally comfortable with doing this myself. I just tend to write rules that are as specific as possible - which is also why I designed HTS to deal with segments, so that rules would map as directly as possible to phonetic processes.

Basilius wrote:You may want to look at how GSCA treats this. (And so far GSCA has been the only Really Working SCA for me... although it has, mmmmmmm, some undocumented peculiarities...)

I considered using this, but I found the syntax upsetting.

Basilius · Post by **Basilius** » Fri Nov 08, 2013 3:28 pm

Morrígan wrote:
Basilius wrote:In my experience, negative categories (and a couple operators enabling the user to merge/subtract categories) are extremely useful. Things like "^V-´" ("neither vowel nor accent mark") appear in, like, every second line in my SC's.
I'm not totally comfortable with doing this myself. I just tend to write rules that are as specific as possible - which is also why I designed HTS to deal with segments, so that rules would map as directly as possible to phonetic processes.

I think I understand you. But..

In the SC's for a conlang of mine, I have this rule:

* Pharyngealization becomes suprasegmental, spreading left-to right from every pharyngeal(ized) segment until it runs into either word boundary or one of two segments that block its spread (X and Y).

Writing this without a way to say "any symbol other than X or Y" would be a pain.

Morrígan · Post by **Morrígan** » Fri Nov 08, 2013 4:07 pm

Basilius wrote:
Morrígan wrote:
Basilius wrote:In my experience, negative categories (and a couple operators enabling the user to merge/subtract categories) are extremely useful. Things like "^V-´" ("neither vowel nor accent mark") appear in, like, every second line in my SC's.
I'm not totally comfortable with doing this myself. I just tend to write rules that are as specific as possible - which is also why I designed HTS to deal with segments, so that rules would map as directly as possible to phonetic processes.
I think I understand you. But..

In the SC's for a conlang of mine, I have this rule:

* Pharyngealization becomes suprasegmental, spreading left-to right from every pharyngeal(ized) segment until it runs into either word boundary or one of two segments that block its spread (X and Y).

Writing this without a way to say "any symbol other than X or Y" would be a pain.

Agreed, there are some really good reasons to include that, and I don't think it will be difficult to support. Another thing I might add is a way to have rules be processed from right-to-left (default is always left-to-right), on a per-rule basis.

There are a few tweaks like this I can squeeze in pretty quickly. I can't do feature for a while, since there are some conceptual issues I need to work out. I should have time this weekend to make most of the improvements you've suggested. Most of these are things I left room for in the code.

My only high-level concern is that I model phonetics in a way that pretends that suprasegmentals don't exist. Everything occurs at the level of segments.

KathTheDragon · Post by **KathTheDragon** » Fri Nov 08, 2013 5:26 pm

Morrígan wrote:Another thing I might add is a way to have rules be processed from right-to-left (default is always left-to-right), on a per-rule basis.

This.

Herra Ratatoskr · Post by **Herra Ratatoskr** » Wed Dec 11, 2013 12:33 am

In the mood for a little tech support?

I'm trying to run a few changes, and I'm getting an error:

Code: Select all

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0 
    at org.haedusfc.soundchnage.commands.Condition.<init><Condition.java:49>
    at org.haedusfc.soundchange.commands.Rule.<init><Rule.java:49>
    at org.haedusfc.soundchange.SoundChangeApplier.processLexicon<SoundChangeApplier.java:78>
    at org.haedusfc.soundchange.Main.main<Main.java:36>

I'm guessing I probably messed something up in my rule file, but I'm not sure what to look for. Would you like me to upload the rule file, or does the error message suggest where I might have gone wrong?

Morrígan · Post by **Morrígan** » Wed Dec 11, 2013 9:55 am

Herra Ratatoskr wrote:In the mood for a little tech support?

I'm trying to run a few changes, and I'm getting an error:
Code: Select all
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0 
    at org.haedusfc.soundchnage.commands.Condition.<init><Condition.java:49>
    at org.haedusfc.soundchange.commands.Rule.<init><Rule.java:49>
    at org.haedusfc.soundchange.SoundChangeApplier.processLexicon<SoundChangeApplier.java:78>
    at org.haedusfc.soundchange.Main.main<Main.java:36>
I'm guessing I probably messed something up in my rule file, but I'm not sure what to look for. Would you like me to upload the rule file, or does the error message suggest where I might have gone wrong?

This is probably a bug in my code, and probably not anything that you did wrong. usually any mistakes you make will produce an intelligible error message saying which of your rules is bad. It's almost certainly that I failed to account for some case and it blew up.

[REDACTED]

EDIT: Wait, I had a thought. Do you have a rule that ends with "/ _" for 'unconditioned'? I just created a test, and it looks like I forgot to make sure this was OK to do. I intended the following to be equivalent

Code: Select all

a > b / _
a > b

but it looks like I forgot to make sure the former was handled correctly.

Also as a general development note, I did figure out how to implement the "anything not [X OR Y]" functionality, but have only started implementing it. Not sure when I can get a version out, things have been a bit nuts lately and I've spent a lot of my free time playing two characters simultaneously in EVE...

Herra Ratatoskr · Post by **Herra Ratatoskr** » Wed Dec 11, 2013 4:09 pm

That was it. My changes are working now. Just curious, will you ever be open sourcing this? I'd love to be able to play around with this a bit.

Morrígan · Post by **Morrígan** » Tue Dec 17, 2013 10:13 am

Herra Ratatoskr wrote:Just curious, will you ever be open sourcing this? I'd love to be able to play around with this a bit.

At some point relatively soon, yes. It has always been my intention to provide open-source tools when I came up with idea of the Haedus* group. At the moment though, the SCA code shares a Maven module with research that is unpublished and may not be for several years, so I need to split these up, possibly into completely different projects, though it might make versioning a pain.

*That is, Haedus: Fabrica Codicis or Haedus FC, which I realized is a slightly unfortunate name, as it is definitively not a football club.

Morrígan · Post by **Morrígan** » Mon Dec 23, 2013 10:01 am

So, I decided to actually look at the code for the first time in weeks and fixed that bug, so you can legally write unconditioned rules with or without a "/ _"

If I have the time or energy, I'll see if I can continue working on getting negatives in the code. I refactored the state-machine code to allow it, but it's going to take some more effort to amend the parser used to generate the machines. Additionally, getting some of the other features, like controlling segmentation and normalization are a little harder to do than anticipated because of the structure of the code, but I've looked into that and started taking steps to make changes.

Herra Ratatoskr · Post by **Herra Ratatoskr** » Tue Dec 24, 2013 4:49 pm

Sounds like fun

I'm looking forward to it!

chris_notts · Post by **chris_notts** » Thu Dec 26, 2013 12:36 pm

Morrígan wrote:So, I decided to actually look at the code for the first time in weeks and fixed that bug, so you can legally write unconditioned rules with or without a "/ _"

If I have the time or energy, I'll see if I can continue working on getting negatives in the code. I refactored the state-machine code to allow it, but it's going to take some more effort to amend the parser used to generate the machines. Additionally, getting some of the other features, like controlling segmentation and normalization are a little harder to do than anticipated because of the structure of the code, but I've looked into that and started taking steps to make changes.

You implemented your regular expressions using state machines then? When I wrote HaSC, my implementation of regular expressions was basically via a simple recursive function that walked through a syntax tree of the regular expression matching as it went. The theoretical worst case performance of this is terrible, but I found that because words are relatively short, for any sound change I could come up with performance was acceptable. And the code is also a lot easier to write, modify and add new features to than a more sophisticated approach would be. It also makes implementing features such as back-references trivial which are a nightmare in other approaches.

Of course, using a lazy language (Haskell) also makes the performance of this approach more acceptable than it would be written naively in Java.

A snippet of the code is here to demonstrate what I mean (some of it is messy, but it works so I leave it alone):

Code: Select all

-- define how to match anything
pMatch Any currState = [currState {msRem = tail currRem, msIsStart = False} | not (null currRem)]
  where currRem = msRem currState

-- define how to match against an individual phone
pMatch (Match incOthers fets capture) currState = [newState | matches]
    where 
        newState = currState {msRem = tail currRem, msFetCap = I.union capVars currFetCap, 
                              msIsStart = False }
        matches = not (null currRem) && phoneMatches currFetCap incOthers fets phoneVals
        capVars = populateCaptures capture phoneVals

        phoneVals = getPhoneVals (head currRem)        
        currRem = msRem currState 
        currFetCap = msFetCap currState

-- define how to not match a pattern
pMatch (Not nAdvance pattern) currState 
    = [currState { msRem = newRem, msIsStart = currIsStart && nAdvance == 0 } | matches] 
    where
        matches = null (pMatch pattern currState) && length currWord >= nAdvance
        currIsStart = msIsStart currState
        currWord = msRem currState
        newRem = drop nAdvance currWord

-- define how to match a word boundary
pMatch WordStart currState = [currState | msIsStart currState]
pMatch WordEnd currState   = [currState | null (msRem currState)]

-- define how to match a sequence
pMatch (Sequence xs) currState = L.foldl' genNewStates [currState] xs    
  where genNewStates currStates currPattern = L.concatMap (pMatch currPattern) currStates

-- define how to capture a pattern match
pMatch (Capture id pattern) currState = map update (pMatch pattern currState)
  where    
    update x = x {msPhoneCap = I.insert id diff phoneCap, 
                  msMatchCap = I.insert id (toMatch diff) matchCap} 
        where diff = Sequence $ listDiff (msRem x)
              phoneCap = msPhoneCap x
              matchCap = msMatchCap x    

    initRem      = msRem currState    
    initLength   = length initRem
    listDiff x   = take (initLength - length x) initRem 

-- match alternatives
pMatch (Alt xs) currState = L.concatMap (`pMatch` currState) xs
    
-- define how to match repetitions. This is a replacement for an older implementation
-- which was a lot harder to understand.       
pMatch (Rep greedy minRep maxRep pattern) currState
    = concat $ if greedy then reverse matches else matches
    where
        matches = take (maxRep - minRep + 1) .
                  drop minRep .                    
                  takeWhile (not . null) . 
                  iterate genNewState $ [currState]

        genNewState = L.concatMap (pMatch pattern)

Morrígan · Post by **Morrígan** » Thu Dec 26, 2013 3:03 pm

chris_notts wrote: You implemented your regular expressions using state machines then? When I wrote HaSC, my implementation of regular expressions was basically via a simple recursive function that walked through a syntax tree of the regular expression matching as it went. The theoretical worst case performance of this is terrible, but I found that because words are relatively short, for any sound change I could come up with performance was acceptable. And the code is also a lot easier to write, modify and add new features to than a more sophisticated approach would be. It also makes implementing features such as back-references trivial which are a nightmare in other approaches.

I kind of adapted the approach used by Thompson.You're right though that it may be overkill, since the domain isn't likely to produce worst-case machines. When I was developing ASCA, I took an approach more like yours, only I did a terrible job of it.

In part, I wanted to do a full DFA implementation because it would be more robust and should be able to accommodate possible future requirements, and because it seemed like a fun project. Parsing and matching is done in under 500 lines of code. I actually use expression trees as part of the translation process - maybe it really would be easier (and possible quicker in most cases) to use this directly.

What I have isn't too complicated really, since adding things like negatives requires making changes to the code which translates a given condition into an expression tree, and the expression tree into a machine.

chris_notts · Post by **chris_notts** » Thu Dec 26, 2013 4:20 pm

Morrígan wrote: I kind of adapted the approach used by Thompson.You're right though that it may be overkill, since the domain isn't likely to produce worst-case machines. When I was developing ASCA, I took an approach more like yours, only I did a terrible job of it.

One reason I like Haskell is that it makes walking trees etc. very easy because of pattern matching. If you want to write an interpreter and don't care too much about high performance, it is very easy in Haskell.

In part, I wanted to do a full DFA implementation because it would be more robust and should be able to accommodate possible future requirements, and because it seemed like a fun project. Parsing and matching is done in under 500 lines of code. I actually use expression trees as part of the translation process - maybe it really would be easier (and possible quicker in most cases) to use this directly.

What I have isn't too complicated really, since adding things like negatives requires making changes to the code which translates a given condition into an expression tree, and the expression tree into a machine.

Negation in the general case needs a little thinking about. In regular expressions, the most common case is generally negating a class of characters (= individual tokens). This is easy: match any character not in the list and advance by a single token. In HaSC you can negate any pattern, which makes it a bit more complicated. For example, how many characters should the following expression consume: (C+)~ ? Or what about (a | ba | cama)~ ? My answer was that the negation should advance by the length of the minimal possible match for the expression being negated. So:

(C+)~ = advance by 1 token if the next token is not a consonant (i.e the same as C~)
(a | ba | cama)~ = advance by 1 token if the next tokens are not 'a', 'ba', or 'cama', because the minimal match is 'a' which is 1 token long

Another tricky issue with negation is backreferences and captured values. For example:

((1=C)1)~

(1=C)1 means "match two identical consonants". This is a tricky one because in order to correctly calculate the minimum length of the pattern as a whole, you would need to know the minimum length of 1, which means storing it. But the minimum length of a backreference might not even be locally known if the backreference comes from another part of the sound change, rather than within the negated pattern itself. Therefore, currently HaSC does not correctly compute the minimum match length when backreferences are involved, and instead assumes any backreference has a minimum match length of zero, even though in this case clearly the correct behaviour should be to advance two tokens on failure to match.

Another issue is that any backreference captured in a negated pattern is lost. Suppose the next two tokens are "ba", for example. ((1=C)1)~ will successfully match the first character before failing on the second, so the negation as a whole succeeds. But the binding of 1 to "b" will be lost because of the way negation is implemented, and because I think there's a danger in allowing the users access to backreferences which might not be bound to anything. E.g. if the next two tokens were "ae", the pattern match would have failed immediately and 1 would never be bound, so whether 1 has any value at all would vary from word to word.

EDIT: you could just have negative lookahead instead and consuming nothing on failure to match, but I think that's more confusing for users than negatives also consuming tokens

zompist bboard

Haedus SCA - Bugfix (01/24)

Haedus SCA - Bugfix (01/24)

Re: Haedus Toolbox SCA

Re: Haedus Toolbox SCA

Re: Haedus Toolbox SCA

Re: Haedus Toolbox SCA

Re: Haedus Toolbox SCA

Re: Haedus Toolbox SCA

Re: Haedus Toolbox SCA

Re: Haedus Toolbox SCA

Re: Haedus Toolbox SCA

Re: Haedus Toolbox SCA

Re: Haedus Toolbox SCA

Re: Haedus Toolbox SCA

Re: Haedus Toolbox SCA

Re: Haedus Toolbox SCA

Re: Haedus Toolbox SCA

Re: Haedus Toolbox SCA

Re: Haedus Toolbox SCA

Re: Haedus Toolbox SCA

Re: Haedus Toolbox SCA

Re: Haedus Toolbox SCA

Re: Haedus Toolbox SCA

Re: Haedus Toolbox SCA

Re: Haedus Toolbox SCA

Re: Haedus Toolbox SCA