Tag Archives: regular expressions

NLT is complete

After an ugly start — a bug which was introduced with bugfix for another bug — I managed to go straight to the last missing piece in NLT, table driven lexer. It does not mean NLT is finished, but it is complete — all structural features are there and the table for lexer was the last one.

User can specify to compile all regex and string patterns while building a lexer into one big table. When the lexer runs there are no longer repetitions, classes, ranges, just a table with transitions. I was afraid of handling Unicode ranges, but after reading “Regular Expression Matching in the Wild” by Russ Cox I knew what to do, especially I was not building general purpose regex engine, but simplified multi regex one (no back references for example).

Without any optimization at all it is twice as fast as previous approach — checking patterns one by one during runtime. And I didn’t notice any difference in time when building the lexer, which is rather good news.

From now on, NLT is more for maintenance — fixing bugs, optimization, better error reporting, and so on.


Hard time with optimization

It is about time to think about NLT performance — the results I saw in profiler output really surprised me. I thought that parser is the culprit of poor performance (I tested regular LALR), but except for creating huge logs I was mistaken. Running regular expressions against input (lexer) takes most of the time.

Another surprise is how a tiny line can be a major drag — converting enum to int takes about 9% of total time. This is actually due to flaws in C# — firstly lack of enum constraint on generic types. All I have is some struct (despite I know it is an enum really) so I have to cast data to object and then back to int. In effect I have unboxing just after boxing, which should lead to no-op — but it does not ¹.

All in all, I managed to shave running time to around 30% of original performance — I will keep enum as it is now, for regular expressions I am thinking about writing my own engine capable of matching multiple regexes at the same time (the whole purpose of the lexer is to pick up the longest match). Since it is a challenge not to be taken lightly (time) I moved to other issues.

I am not a fan of using ad-hoc symbols, but I see its purpose — it is easier to explain some production without introducing extra code. NLT grammar looked too heavy for my taste, so I added support for literals used as terminals in parser productions. One can write:

expr Expr -> e1:expr "+" e2:expr

instead of introducing beforehand appropriate terminal like “PLUS”.

¹ C# non-boxing conversion of generic enum to int?.

Tagged , , ,