Raul Miller: Regular Expressions in K


Notes

It's also possible to implement regular expression support in K. Regular expression implementations tend to involve arrays and finite state machines, so K is almost an ideal language for working with regular expressions.

One approach to regular expressions involves "deterministic" state transitions (where the state transition corresponds to a character from some string of characters), and "non-deterministic" state transitions (where the state transition happens without any corresponding character).

In this code, I initially repreent a regular expression as an array where the first dimension represents a "state index", the second dimension is indexed by character number, and the third dimension represents the set of states which are reached from that "state index" and that "character number".

The state machine itself represents a set of states. The initial state is 0 (which is always a "non-deterministic" state), and the final state is the one with the highest numbered state index.

Here's an example regular expression matching some floating point numbers:

num: rep[re[,"0123456789"]]
sign: maybe[re[,"+-"]]
fp: fix seq[(sign;num;maybe[seq[(re[,"."];num)]];maybe[seq[(re[,"Ee"];sign;num)]])]

match[fp; "1"] 		/ 1
match[fp; "3.1415e0"] 	/ 1
match[fp; "dog"] 	/ 0

This would display most of the interesting state transitions:

fp[;256,_ic " 1-e"]

Note that this example (fp), while relatively simple to implement, is very bulky. It's possible to use more compact representations. For example, instead of indexing by character index, you could identify character classes where all characters within a class are treated identically.

And you might get rid of all states which have a non-deterministic state transition. It's worthwhile thinking about why this could work. The most important thing to note: the only "non-deterministic state" I care about is state 0 -- that's where I find the initial deterministic states.

So Shrink creates an alternate regular expression data representation which is a three element array:

[0] the list of starting states
[1] index by character number to get character class
[2] first dimension is indexed by character class
    second dimension is indexed by (new) state index
    third dimension is list of (new) states

And, Match uses this alternative representation.

Thus:

FP: Shrink fp
Match[FP; "1"]		/ 1
Match[FP; "3.1415e0"]	/ 1
Match[FP; "dog"]	/ 0

It's possible to extract subexpressions, as well. I've not yet implemented this, but here's a design sketch:

Regular expressions have associated with them a dictionary of subexpressions (initially empty). Each subexpression also has an offset corresponding to the terminal state for that subexpression.

When combining regular expressions, this offset needs to be updated.

When matching regular expressions, use Next\ instead of Next/ so that the terminal states associated with subexpressions can be located.

Run the corresponding subexpression backwards from the character for that terminal state to find the corresponding begining.