Go to the first, previous, next, last section, table of contents.


Rx Interface

[FIXME: this is taken from Gary and Mark's quick summaries and should be reviewed and expanded. Rx is pretty stable, so could already be done!]

Guile includes an interface to Tom Lord's Rx library (currently only to POSIX regular expressions). Use of the library requires a two step process: compile a regular expression into an efficient structure, then use the structure in any number of string comparisons.

For example, given the regular expression `abc.' (which matches any string containing `abc' followed by any single character):

guile> (define r (regcomp "abc."))
guile> r
#<rgx abc.>
guile> (regexec r "abc")
#f
guile> (regexec r "abcd")
#((0 . 4))
guile>

The definitions of regcomp and regexec are as follows:

primitive: regcomp pattern [flags]
Compile the regular expression pattern using POSIX rules. Flags is optional and should be specified using symbolic names:
Variable: REG_EXTENDED
use extended POSIX syntax
Variable: REG_ICASE
use case-insensitive matching
Variable: REG_NEWLINE
allow anchors to match after newline characters in the string and prevents . or [^...] from matching newlines.

The logior procedure can be used to combine multiple flags. The default is to use POSIX basic syntax, which makes + and ? literals and \+ and \? operators. Backslashes in pattern must be escaped if specified in a literal string e.g., "\\(a\\)\\?".

primitive: regexec regex string [match-pick] [flags]

Match string against the compiled POSIX regular expression regex. match-pick and flags are optional. Possible flags (which can be combined using the logior procedure) are:

Variable: REG_NOTBOL
The beginning of line operator won't match the beginning of string (presumably because it's not the beginning of a line)

Variable: REG_NOTEOL
Similar to REG_NOTBOL, but prevents the end of line operator from matching the end of string.

If no match is possible, regexec returns #f. Otherwise match-pick determines the return value:

#t or unspecified: a newly-allocated vector is returned, containing pairs with the indices of the matched part of string and any substrings.

"": a list is returned: the first element contains a nested list with the matched part of string surrounded by the the unmatched parts. Remaining elements are matched substrings (if any). All returned substrings share memory with string.

#f: regexec returns #t if a match is made, otherwise #f.

vector: the supplied vector is returned, with the first element replaced by a pair containing the indices of the matched portion of string and further elements replaced by pairs containing the indices of matched substrings (if any).

list: a list will be returned, with each member of the list specified by a code in the corresponding position of the supplied list:

a number: the numbered matching substring (0 for the entire match).

#\<: the beginning of string to the beginning of the part matched by regex.

#\>: the end of the matched part of string to the end of string.

#\c: the "final tag", which seems to be associated with the "cut operator", which doesn't seem to be available through the posix interface.

e.g., (list #\< 0 1 #\>). The returned substrings share memory with string.

Here are some other procedures that might be used when using regular expressions:

primitive: compiled-regexp? obj
Test whether obj is a compiled regular expression.

primitive: regexp->dfa regex [flags]

primitive: dfa-fork dfa

primitive: reset-dfa! dfa

primitive: dfa-final-tag dfa

primitive: dfa-continuable? dfa

primitive: advance-dfa! dfa string


Go to the first, previous, next, last section, table of contents.