Monthly Archives: November 2013

More on repeated groups

From time to time the vast difference between numbers surprises me — take for example LALR(1) and LALR(2). It is just one symbol more, but this means going from object to collection of objects. The internal algorithm also is way more complex that one could expected from the “2-1” difference.

Same story with repetitions — one-or-more:

(COMMA- id:identifier)+

was pretty easy to implement with straightforward methods “seed” and “append”. Zero-or-more:

(COMMA- id:identifier)*

reuses previous implementation by reformulating the problem to “nothing or one-or-more” (tell the mathematician to pull out the nail half-way in a wall and a guy will hammer it down entirely just to reduce the case to well-known problem). But two-or-more:

(COMMA- id:identifier)++

made me struggle with it. You have to register new names for the objects (you cannot use “id” twice in the same block of code), you have to skip ignored symbol (here “COMMA”) just once, etc. So once simple “seed” looks more like a monster now with plenty of conditions.

And why not go further and add general repetitions with limits (like in regex “X{3,5}”)? I didn’t have time to thoroughly think about it but my impression is that with one symbol as lookahead there is only a difference between 0, 1, and 2 initial repetitions. With higher limits the control should be performed as semantic check, not syntax one.

Another feature requested by real-life case was naming arbitrary symbols within repeated sequence — let’s say you have a space-separated elements and you have to keep track of the spaces, now it is possible:

seq_list -> (s:space- e:elem)+ { new Seq(s,e) };

While working on new example for parsing chemical formulas (a problem recurring on various forums) I started wondering how big is memory hunger of LR parsers. Is it still a problem for modern computers? You know writing shift/reduce resolution is fun once or twice, but when you see “stupid” conflicts you wish your parser could be a little smarter. After some search I found a promising approach — not LR, not LALR, but MLR. From what I understood MLR collapses nodes as LALR does, but if it leads to a conflict it keeps given nodes intact, in LR fashion.

Tempting, tempting…

Tagged ,

Operator constraints — coping with ambiguity

Lately I worked on solving the ambiguity problem with NLT and I considered adding weight to shift or reduce move to allow both, but promote only one. I’ve already implemented it but luckily I came up with more elegant solution — decorating productions with arbitrary label on one hand and forbidding given labels to be used in reduce move on the other. Clean and powerful, just I like it.

Sounds interesting? This time I won’t write more about it here, because I’ve posted an article at Code Project covering this topic. Have a good read!

Tagged ,

Polishing precedence

Introducing generics (or template — at this stage the implementation does not matter) was like opening Pandora box. Benign angle brackets added so much local ambiguity that using regular NLT features I couldn’t parse the code (OK, I admit, I try to avoid forking whenever I can).

Consider such code:

if X<Y...

Is it just basic conditional, and should be read as:

if (X<Y) ...

or is it first part of generic expression:

if (X<Y>) ...

There are more operators involved here — what to do in case of “+” and “<” conflict? And what to do in case of two “<”? To keep long story short I refined the way the operators are picked up and also introduced more precise pattern to define resolution of shift/reduce conflict, for example:

rs shift LANGLE expression(AND OR EXOR) expression;

This translates to rule — make a shift when incoming symbol (lookahead) is left angle bracket and there is AND, OR, or EXOR operator on stack and there is conflict between expression and expression. Such pattern makes the code:

if 5<3 and 2<7 then

to parse as:

if (5<3) and (2<7) then

One note: without template syntax setting the priority of basic operator precedence (like in yacc) was enough.

I had to drop one feature I added some time ago, naively thinking it could limit the number of the conflicts. Until today when there was empty reduce rule conflicting with shift rule NLT reduced first and then shifted. With template syntax I finally found counter example — consider defining new class:

class Foo _ : Bar

(underscore denotes template arguments placement), and passing named argument to a function:

foo(x _ : 5);

Here underscore denotes a problem — we have identifier (exactly like previously) and possibly empty template (because of the grammar). The question is should we make this empty template or not? Of course not, you cannot write:

foo(x<String> : 5);

but with mechanism of creating reduce+shift sequence it was exactly what happened — fixed!

Tagged , ,