Tag Archives: mlr

GLR is here

Two wrong approaches with optimization later and GLR parser is ready for download.

In my case the last missing step was upgrading LALR parser to MLR — I was following great explanation given by David R. Tribble (of course all errors in implementations are mine). One thing that caused my confusion was the term “lookahead” — it seems a lot of people using it for two different things, and so I introduced “after lookahead” and “next lookahead” terms to clarify which is which. The first one is used when building a parser to denote terminals which can show up after entire production is consumed (so it is irrelevant how much of the production is processed). The “next lookahead” is used when parsing — it denotes terminals which can show next in input stream (for reduce states of productions those two sets are identical).

Besides that implementing MLR parser was easier than LALR one — you don’t have to compute follow sets, there is less work with NFA and DFA.

As for optimization I went with pre-computing first sets for entire chunks of the symbols present in productions and with stamping NFA states whenever their after-lookaheads get updated. It allowed me to get to the point where lexer is again the bottleneck of entire processing.

I don’t know how about you, but I think I deserved Grand Imperial Porter. I will get another one after implementing multi-regex.

Tagged , , , , ,

More on repeated groups

From time to time the vast difference between numbers surprises me — take for example LALR(1) and LALR(2). It is just one symbol more, but this means going from object to collection of objects. The internal algorithm also is way more complex that one could expected from the “2-1” difference.

Same story with repetitions — one-or-more:

(COMMA- id:identifier)+

was pretty easy to implement with straightforward methods “seed” and “append”. Zero-or-more:

(COMMA- id:identifier)*

reuses previous implementation by reformulating the problem to “nothing or one-or-more” (tell the mathematician to pull out the nail half-way in a wall and a guy will hammer it down entirely just to reduce the case to well-known problem). But two-or-more:

(COMMA- id:identifier)++

made me struggle with it. You have to register new names for the objects (you cannot use “id” twice in the same block of code), you have to skip ignored symbol (here “COMMA”) just once, etc. So once simple “seed” looks more like a monster now with plenty of conditions.

And why not go further and add general repetitions with limits (like in regex “X{3,5}”)? I didn’t have time to thoroughly think about it but my impression is that with one symbol as lookahead there is only a difference between 0, 1, and 2 initial repetitions. With higher limits the control should be performed as semantic check, not syntax one.

Another feature requested by real-life case was naming arbitrary symbols within repeated sequence — let’s say you have a space-separated elements and you have to keep track of the spaces, now it is possible:

seq_list -> (s:space- e:elem)+ { new Seq(s,e) };

While working on new example for parsing chemical formulas (a problem recurring on various forums) I started wondering how big is memory hunger of LR parsers. Is it still a problem for modern computers? You know writing shift/reduce resolution is fun once or twice, but when you see “stupid” conflicts you wish your parser could be a little smarter. After some search I found a promising approach — not LR, not LALR, but MLR. From what I understood MLR collapses nodes as LALR does, but if it leads to a conflict it keeps given nodes intact, in LR fashion.

Tempting, tempting…

Tagged ,