Category Archives: NLT

Bitter retreat — unfolding aliases

Some time ago I wrote about new way to compare states in DFA. You could take for example two states:

expr -> PLUS . expr { func1 };
expr -> MINUS . expr { func1 }; 

and argue they can be merged in DFA because among other things, they have exactly the same action. Well, I was wrong — sure, the code is shared, the code is the same, but the execution path (flow) can vary. And the difference relies on such basic thing as action parameters — if data associated with PLUS and MINUS are not used the merger is possible. If they are used (most likely) you have to create separate states.

So the correction to the action part is this — if the action is the same and we didn’t read any of the parameters used in the action, we can merge.

Such modification makes merger criterion much weaker and it is not possible to stop aliases from exploding when unfolding them. Thus I had to revert almost completely my entire work. Well, life…

Tagged ,

Taming the conflicts — aliases

Some time ago I noticed that even innocent rule which makes the grammar more clear and terse can, surprisingly, introduce conflicts. This made me thinking — if the grammar in its essence is ambiguous, that’s OK, but when it is all about rewriting rules such behaviour is simply unacceptable. It is like forbidding creating temporary variables in programming languages.

I approached this issue as far as it solves the problem — namely I look for identity aliases:

elem -> namedElem | anonElem;

and unfold them. Production:

id -> LBRACE elem RBRACE { $elem };


id -> LBRACE namedElem RBRACE { $namedElem };
id -> LBRACE anonElem RBRACE { $anonElem };

Obviously there is room for improvements here (for starters unfolding wrappers like `id` above as well) but for now it is enough (Skila is waiting).

Nevertheless that move caused new problem — unfolding aliases multiplied productions and all tables exploded. Skila parser went from 874 entries to 2214, and even if I didn’t mind, Mono refused to run such behemoth. Unfolding for sure is operation that should be possible to do, so if I have an explosion of nodes it means the node merging is too picky. So I revised how DFA is created.

When merging DFA nodes it is always said, all the states of two nodes have to be equal. So far I didn’t read a definition of states equality other than comparison of production and number of symbols read (SLR(0) state equality). Consider such states:

comp -> LPAREN . expr RPAREN { func1 };
expr -> . LPAREN NUM RPAREN { func2 };

Different productions, different number of symbols read — it is easy to see they are not equal. But what about:

expr -> PLUS . expr { func1 };
expr -> MINUS .  expr { func2 };

Also not equal, but this “natural” notion of equality is deceptive — because there is a lot of wasted space. Production actually should not matter. In fact the symbol on the left hand side matters, the symbols that are left to be read too, and the action user entered.

States in both previous examples are not equal indeed, but the following ones are:

expr -> PLUS . expr { func1 };
expr -> MINUS .  expr { func1 }; // note: the same function

Ok, equal might be too strong word — they are compatible, but is is enough for merger. They are expecting the same reduction, they target the same node after reduction, they are waiting for the same input, and the action which will be executed is the same. This is all we need (currently I added the number of the read symbols to the mix, because of error recovery mechanism — as soon I fix this I will get rid of this factor as well).

And the results? With new merge algorithm and unaliasing turned off, the number of entries was reduced by 10%. After enabling unfolding aliases I got almost 1% additional gain. In raw numbers it went from 874 to 782 nodes.

All in all not bad, automatically removing conflicts in the grammar led to reduction of the table entries. Code is uploaded so you can take it for a test drive.

Tagged , , ,

When GLR is looking at the horizon

Technically GLR works fine, but for me it is annoying that I have to postpone some decisions until run time. Consider such Skila code:

abstract class...
static def... // function
static let... // field

and grammar for it:

class_mod = ε | class_mod ABSTRACT | class_mod STATIC | ...
class = class_mod CLASS ...
func_mod = ε | func_mod ABSTRACT | func_mod STATIC | ...
func = func_mod DEF ...
field = STATIC? LET 

This creates a shift-reduce/reduce-reduce conflict and GLR when parsing `abstract override static ...` forks on each keyword waiting for the right moment to kill incorrect parse trees. Could we do better? Yes — we know in advance that we are waiting for `class`, `def` or `let` keywords. All we have to do is to incorporate that knowledge into generated parser and instead of forking the tree we will check the incoming data.

To achieve this we have to make just small modifications to the parser generator — along with the first sets and follow sets we add two new ones. Cover set for given symbol is a bag of symbols which appeared on the right hand side of the productions for this symbol (directly or indirectly). For example cover set for `func` would contain among others things `DEF`, `func_mod` and also `ABSTRACT` as well. The second new set is — surprise, surprise — horizon set. It is very similar to the follow set, but unlike it, it does not mindlessly tell us what can show up right after given symbol (like for `class_mod` — `ABSTRACT`). Horizon set ignores such reductions/expansions as shown in the first and the third line of grammar, it focuses solely on true reductions of the symbol. Such reduction, that it cannot be recursively triggered again. So for `class_mod` horizon set has only one symbol — `CLASS`.

Having first sets, follow sets and cover sets, computing horizon sets is easy — ignore all productions for which given symbol covers left hand side (LHS) of the production, and for other productions take first symbols of what comes right after our symbol. If there is too little data, take the rest from the follow set of LHS of the current production. Or in other words — it is exactly the same algorithm as for the follow sets, only this time we narrow down the productions with help of cover set and because we have follow sets computed we don’t have to bother with recurrent computations, we just grab the needed data.

For each non-terminal horizon set is checked against its cover set — if those two sets overlap, horizon set is scrapped as unusable. Further computations are performed only in case all involved horizon sets are usable (non-empty).

Armed with all four sets we are ready to resolve ambiguities — in case of reduce-reduce conflict we check whether the horizons overlap, if not, we can use those data in run time. Shift-reduce resolution is a bit more elaborate — having production in shift state we process symbol after symbol from RHS testing if there is an overlap between cover set for current symbol and horizon for reduction on one hand and on the other — between LHS of reduction (when the verification of entire production was unsuccessful we make one final check — instead of using cover set for current symbol we use after-lookaheads). As previously, no overlap means we resolved the conflict — this time using not only horizon for reduction action, but also adding the non-overlapping symbol as a cut off marker to avoid excessive reading of the input.

In our example to resolve reduce-reduce conflict in run time `class_mod` and `func_mod` would run over stream of input symbols until they find their horizons, `CLASS` or `DEF` respectively — which one is found first, appropriate reduction wins. However because we have shift production in the mix we stop short on `LET` (this is cut off for both reduction actions), and if this happens shift action wins.

The code is uploaded, so take a look if you are interested in implementation.

Tagged , , , ,

The moment of truth — code generation

Yesterday I decided enough is enough and it is time to focus on code generation. I cannot say currently Skila is feature rich, but it has all the core elements I think they are essentials for the start — self type, extension methods, enums, default methods in interfaces, lambdas, covariance, variadic parameters with limits, and generics. Any new feature will be added when the compiler will get more mature and I will have actual working code at my first platform of choice — PHP.

Of course I am not busy only with Skila but with NLT as well — I’ve just uploaded new version of this suite, after half of year of work it was about the time.


NLT is complete

After an ugly start — a bug which was introduced with bugfix for another bug — I managed to go straight to the last missing piece in NLT, table driven lexer. It does not mean NLT is finished, but it is complete — all structural features are there and the table for lexer was the last one.

User can specify to compile all regex and string patterns while building a lexer into one big table. When the lexer runs there are no longer repetitions, classes, ranges, just a table with transitions. I was afraid of handling Unicode ranges, but after reading “Regular Expression Matching in the Wild” by Russ Cox I knew what to do, especially I was not building general purpose regex engine, but simplified multi regex one (no back references for example).

Without any optimization at all it is twice as fast as previous approach — checking patterns one by one during runtime. And I didn’t notice any difference in time when building the lexer, which is rather good news.

From now on, NLT is more for maintenance — fixing bugs, optimization, better error reporting, and so on.


GLR is here

Two wrong approaches with optimization later and GLR parser is ready for download.

In my case the last missing step was upgrading LALR parser to MLR — I was following great explanation given by David R. Tribble (of course all errors in implementations are mine). One thing that caused my confusion was the term “lookahead” — it seems a lot of people using it for two different things, and so I introduced “after lookahead” and “next lookahead” terms to clarify which is which. The first one is used when building a parser to denote terminals which can show up after entire production is consumed (so it is irrelevant how much of the production is processed). The “next lookahead” is used when parsing — it denotes terminals which can show next in input stream (for reduce states of productions those two sets are identical).

Besides that implementing MLR parser was easier than LALR one — you don’t have to compute follow sets, there is less work with NFA and DFA.

As for optimization I went with pre-computing first sets for entire chunks of the symbols present in productions and with stamping NFA states whenever their after-lookaheads get updated. It allowed me to get to the point where lexer is again the bottleneck of entire processing.

I don’t know how about you, but I think I deserved Grand Imperial Porter. I will get another one after implementing multi-regex.

Tagged , , , , ,

A reminder from the closed source world

Now and then I get stern reminder to stick with open source when it comes to base system. I have to replicate the system, it refuses to do so, I have to install the updates, it cannot find root partition, and so on. 16 years ago I switched from Microsoft Windows to SUSE Linux and I found it is plagued with bugs as well. The difference is with Linux I can boot the computer from the pendrive, I can run console, I can actually diagnose the problem — same story with programs. And when I need something badly I can modify the source code, compile it and voilà I have my custom tool.

Around week ago I finished GLR parser as part of NLT suite. The only annoyance is it is too slow to publish it, so I need to profile it. And then Visual Studio profiler stopped working, I switched to PerfView and instead usual 3% of broken stacks I got something more like 80%. Just my luck — go figure the bottleneck now.

And still, I am grateful for this refreshing experience, it is good to know what closed source programs stands for — reinstall everything or buy newer versions and pray the problem will go away. Right…

Tagged , , ,

17 March, 2014 — Compilers course again

Without any doubt I highly recommend enrolling on Compilers course by Prof. Alex Aiken. Since I had some time to make preparations I decided to include skeleton project for NLT generator, and the way COOL assignments are organized made me refine some features of NLT.

Among others there is one which existence derives from limitation of C# — you cannot define enum which inherits from the other enum despite the fact the enum in C# is strong alias of int (and you can inherit classes containing ints as you like). In other words if you have to define constant integer values gradually you have to use old boring ints (and lose value safety).

I am far from reaching enums in Skila but I won’t make the same mistake. Anyway, both frameworks — COOL C# and NLT — are uploaded and waiting for you!

Tagged , , , , , ,

Placeholders in parser rules

Inspired by a question at StackOverflow site I started new subproject in NLT — a set of ready to use file readers. The first was PBXProj format and when writing the syntax rules for it I quickly became tired of repeating the same names over and over. PBXProj is rather simple format (similar to JSON), and the rules are short, mostly with unique symbols:

dict_list -> (COMMA- d:dict)+ { d };

But why introduce a name for an object when in code symbol name would be a perfect fit? You know, “opportunity” is just another name for the “problem”.

dict_list -> (COMMA- dict)+ { $dict };

Once NLT spots a placeholder it tries to bind it to symbol (symbol has to be unique and anonymous, i.e. without object name specified). Once it does, it adds implicit object name for the symbol and uses it in the code.

I don’t recommend using placeholders in long syntax rules, but in short ones it is the way to go.

Tagged ,

Hard time with optimization

It is about time to think about NLT performance — the results I saw in profiler output really surprised me. I thought that parser is the culprit of poor performance (I tested regular LALR), but except for creating huge logs I was mistaken. Running regular expressions against input (lexer) takes most of the time.

Another surprise is how a tiny line can be a major drag — converting enum to int takes about 9% of total time. This is actually due to flaws in C# — firstly lack of enum constraint on generic types. All I have is some struct (despite I know it is an enum really) so I have to cast data to object and then back to int. In effect I have unboxing just after boxing, which should lead to no-op — but it does not ¹.

All in all, I managed to shave running time to around 30% of original performance — I will keep enum as it is now, for regular expressions I am thinking about writing my own engine capable of matching multiple regexes at the same time (the whole purpose of the lexer is to pick up the longest match). Since it is a challenge not to be taken lightly (time) I moved to other issues.

I am not a fan of using ad-hoc symbols, but I see its purpose — it is easier to explain some production without introducing extra code. NLT grammar looked too heavy for my taste, so I added support for literals used as terminals in parser productions. One can write:

expr Expr -> e1:expr "+" e2:expr

instead of introducing beforehand appropriate terminal like “PLUS”.

¹ C# non-boxing conversion of generic enum to int?.

Tagged , , ,