aboutsummaryrefslogtreecommitdiffstats
path: root/subprojects/rerex/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'subprojects/rerex/README.md')
-rw-r--r--subprojects/rerex/README.md158
1 files changed, 158 insertions, 0 deletions
diff --git a/subprojects/rerex/README.md b/subprojects/rerex/README.md
new file mode 100644
index 00000000..a2571cee
--- /dev/null
+++ b/subprojects/rerex/README.md
@@ -0,0 +1,158 @@
+Rerex
+=====
+
+Rerex (REally REgular eXpressions) is a simple and efficient NFA-based regular
+expression implementation in C99.
+
+Rerex parses a regular expression into an NFA, then simulates execution using
+Thompson's algorithm, so matching is linear in the size of the input. For
+details on the theory behind this approach, see [Programming Techniques:
+Regular expression search algorithm], the classic paper by Ken Thompson.
+[Regular Expression Matching Can Be Simple And Fast] is also a good read, which
+compares this approach with those used in most modern regular expression
+implementations. The matching algorithm in Rerex is effectively the same as
+the one described there, though with a slightly different implementation.
+Parsing and construction are completely different, using a recursive-descent
+parser which constructs the automata on the fly.
+
+This is a minimalist implementation with a limited feature set which is mainly
+suitable for basic applications like checking markup or programming language
+syntax. It does, however, strive to be clear and high-quality code which is a
+bit more fully-featured than a simple example, and suitable for real-world use.
+For example, character set expressions are supported, and meaningful errors are
+reported with the offset of the problematic character.
+
+The generated automata are not always optimal, though there are a few simple
+optimizations that avoid generating redundant states. Though there is room for
+improvement there (at the cost of additional complexity), matching performs
+relatively well.
+
+Implementation
+--------------
+
+The implementation is clean and warning-free C99, tested and known to work with
+at least clang, GCC, and MSVC on POSIX, MacOS, and Windows. It weighs in at
+under 600 [SLOC][], which can be compiled to about 5k on x86.
+
+Additionally, a simple test suite is included which covers 100% of the code.
+It is run continuously on all of the above mentioned platforms on x64, as well
+as in Linux on 32 and 64 bit ARM via emulation.
+
+Supported Grammar
+-----------------
+
+Rerex supports a basic subset of classic regular expression syntax that is
+supported by almost all implementations. The grammar is written here in [XML
+EBNF notation][] with terminals in uppercase:
+
+ DOT ::= '.'
+ OPERATOR ::= '*' | '+' | '?'
+ SPECIAL ::= DOT | OPERATOR | '(' | ')' | '[' | ']' | '^' | '{' | '|' | '}'
+ ESCAPABLE ::= SPECIAL | '-'
+ ESCAPE ::= '\' SPECIAL
+ CHAR ::= ESCAPE | [#x20-#x7E] - SPECIAL
+ ELEMENT ::= ([#x20-#x7E] - ']') | ('\' ']')
+ Range ::= ELEMENT | ELEMENT '-' ELEMENT
+ Set ::= Range | Range Set
+ Atom ::= CHAR | DOT | '(' Expr ')' | '[' Set ']'
+ Factor ::= Atom | Atom OPERATOR
+ Term ::= Factor | Factor Term
+ Expr ::= Term | Term '|' Expr
+
+Notable Limitations
+-------------------
+
+Rerex is not meant to replace fully-featured regular expression libraries, so
+the feature set is somewhat limited. The most glaring omissions include:
+
+ - Only supports printable ASCII input
+ - Only supports anchored matching (no back references or group extraction)
+ - Only reads from a null-terminated string (not, for example, files)
+ - No support for counted replication with `{}`
+
+Should I Use This?
+------------------
+
+If you just want to understand how regular expressions work, and have some
+clean code that's easy to understand and tinker with, sure.
+
+If you need a minimal regular expression matching implementation in portable C
+for basic ASCII tasks, maybe.
+
+If you need a fully-featured regular expression implementation for
+international text with group extraction and other more advanced features,
+probably not.
+
+In any case, if you do, letting me know or linking to this page would be
+appreciated.
+
+Dependencies
+------------
+
+None, except the C standard library.
+
+Building
+--------
+
+Rerex consists only of a single header and configuration-free source file, so
+building it with any compiler manually should be straightforward. It should,
+in theory, work just about anywhere.
+
+A [Meson][] build definition is included which can be used to do a proper
+system installation with a `pkg-config` file, generate IDE projects, run the
+tests, and so on. For example, the library and tests can be built and run like
+so:
+
+ meson setup -Dtests=true build
+ cd build
+ ninja test
+
+See the [Meson documentation][] for more details on using Meson.
+
+Usage
+-----
+
+Rerex has a small API based around two objects: a _pattern_, which is an
+immutable compiled representation of a regular expression, and a _matcher_
+which can be used to match strings against a pattern. See [the
+header](include/rerex/rerex.h) for details, there is no external documentation
+aside from this README.
+
+This simple example function uses everything in the API to provide a high-level
+string-based match function:
+
+```c
+bool matches(const char* regexp, const char* string)
+{
+ RerexPattern* pattern = NULL;
+ size_t end = 0;
+ RerexStatus st = rerex_compile(regexp, &end, &pattern);
+
+ if (st) {
+ fprintf(stderr, "(string):%zu: error: %s\n", end, rerex_strerror(st));
+ return false;
+ }
+
+ RerexMatcher* matcher = rerex_new_matcher(pattern);
+ bool matches = rerex_match(matcher, string);
+
+ rerex_free_matcher(matcher);
+ rerex_free_pattern(pattern);
+
+ return matches;
+}
+```
+
+ -- David Robillard <d@drobilla.net>
+
+[XML EBNF notation]: http://www.w3.org/TR/REC-xml/#sec-notation
+
+[SLOC]: https://en.wikipedia.org/wiki/Source_lines_of_code
+
+[Meson]: https://mesonbuild.com/
+
+[Meson documentation]: https://mesonbuild.com/Quick-guide.html
+
+[Programming Techniques: Regular expression search algorithm]: https://dl.acm.org/doi/10.1145/363347.363387
+
+[Regular Expression Matching Can Be Simple And Fast]: https://swtch.com/~rsc/regexp/regexp1.html