diff options
Diffstat (limited to 'subprojects/rerex/README.md')
-rw-r--r-- | subprojects/rerex/README.md | 158 |
1 files changed, 158 insertions, 0 deletions
diff --git a/subprojects/rerex/README.md b/subprojects/rerex/README.md new file mode 100644 index 00000000..a2571cee --- /dev/null +++ b/subprojects/rerex/README.md @@ -0,0 +1,158 @@ +Rerex +===== + +Rerex (REally REgular eXpressions) is a simple and efficient NFA-based regular +expression implementation in C99. + +Rerex parses a regular expression into an NFA, then simulates execution using +Thompson's algorithm, so matching is linear in the size of the input. For +details on the theory behind this approach, see [Programming Techniques: +Regular expression search algorithm], the classic paper by Ken Thompson. +[Regular Expression Matching Can Be Simple And Fast] is also a good read, which +compares this approach with those used in most modern regular expression +implementations. The matching algorithm in Rerex is effectively the same as +the one described there, though with a slightly different implementation. +Parsing and construction are completely different, using a recursive-descent +parser which constructs the automata on the fly. + +This is a minimalist implementation with a limited feature set which is mainly +suitable for basic applications like checking markup or programming language +syntax. It does, however, strive to be clear and high-quality code which is a +bit more fully-featured than a simple example, and suitable for real-world use. +For example, character set expressions are supported, and meaningful errors are +reported with the offset of the problematic character. + +The generated automata are not always optimal, though there are a few simple +optimizations that avoid generating redundant states. Though there is room for +improvement there (at the cost of additional complexity), matching performs +relatively well. + +Implementation +-------------- + +The implementation is clean and warning-free C99, tested and known to work with +at least clang, GCC, and MSVC on POSIX, MacOS, and Windows. It weighs in at +under 600 [SLOC][], which can be compiled to about 5k on x86. + +Additionally, a simple test suite is included which covers 100% of the code. +It is run continuously on all of the above mentioned platforms on x64, as well +as in Linux on 32 and 64 bit ARM via emulation. + +Supported Grammar +----------------- + +Rerex supports a basic subset of classic regular expression syntax that is +supported by almost all implementations. The grammar is written here in [XML +EBNF notation][] with terminals in uppercase: + + DOT ::= '.' + OPERATOR ::= '*' | '+' | '?' + SPECIAL ::= DOT | OPERATOR | '(' | ')' | '[' | ']' | '^' | '{' | '|' | '}' + ESCAPABLE ::= SPECIAL | '-' + ESCAPE ::= '\' SPECIAL + CHAR ::= ESCAPE | [#x20-#x7E] - SPECIAL + ELEMENT ::= ([#x20-#x7E] - ']') | ('\' ']') + Range ::= ELEMENT | ELEMENT '-' ELEMENT + Set ::= Range | Range Set + Atom ::= CHAR | DOT | '(' Expr ')' | '[' Set ']' + Factor ::= Atom | Atom OPERATOR + Term ::= Factor | Factor Term + Expr ::= Term | Term '|' Expr + +Notable Limitations +------------------- + +Rerex is not meant to replace fully-featured regular expression libraries, so +the feature set is somewhat limited. The most glaring omissions include: + + - Only supports printable ASCII input + - Only supports anchored matching (no back references or group extraction) + - Only reads from a null-terminated string (not, for example, files) + - No support for counted replication with `{}` + +Should I Use This? +------------------ + +If you just want to understand how regular expressions work, and have some +clean code that's easy to understand and tinker with, sure. + +If you need a minimal regular expression matching implementation in portable C +for basic ASCII tasks, maybe. + +If you need a fully-featured regular expression implementation for +international text with group extraction and other more advanced features, +probably not. + +In any case, if you do, letting me know or linking to this page would be +appreciated. + +Dependencies +------------ + +None, except the C standard library. + +Building +-------- + +Rerex consists only of a single header and configuration-free source file, so +building it with any compiler manually should be straightforward. It should, +in theory, work just about anywhere. + +A [Meson][] build definition is included which can be used to do a proper +system installation with a `pkg-config` file, generate IDE projects, run the +tests, and so on. For example, the library and tests can be built and run like +so: + + meson setup -Dtests=true build + cd build + ninja test + +See the [Meson documentation][] for more details on using Meson. + +Usage +----- + +Rerex has a small API based around two objects: a _pattern_, which is an +immutable compiled representation of a regular expression, and a _matcher_ +which can be used to match strings against a pattern. See [the +header](include/rerex/rerex.h) for details, there is no external documentation +aside from this README. + +This simple example function uses everything in the API to provide a high-level +string-based match function: + +```c +bool matches(const char* regexp, const char* string) +{ + RerexPattern* pattern = NULL; + size_t end = 0; + RerexStatus st = rerex_compile(regexp, &end, &pattern); + + if (st) { + fprintf(stderr, "(string):%zu: error: %s\n", end, rerex_strerror(st)); + return false; + } + + RerexMatcher* matcher = rerex_new_matcher(pattern); + bool matches = rerex_match(matcher, string); + + rerex_free_matcher(matcher); + rerex_free_pattern(pattern); + + return matches; +} +``` + + -- David Robillard <d@drobilla.net> + +[XML EBNF notation]: http://www.w3.org/TR/REC-xml/#sec-notation + +[SLOC]: https://en.wikipedia.org/wiki/Source_lines_of_code + +[Meson]: https://mesonbuild.com/ + +[Meson documentation]: https://mesonbuild.com/Quick-guide.html + +[Programming Techniques: Regular expression search algorithm]: https://dl.acm.org/doi/10.1145/363347.363387 + +[Regular Expression Matching Can Be Simple And Fast]: https://swtch.com/~rsc/regexp/regexp1.html |