More on repeated groups

From time to time the vast difference between numbers surprises me — take for example LALR(1) and LALR(2). It is just one symbol more, but this means going from object to collection of objects. The internal algorithm also is way more complex that one could expected from the “2-1” difference.

Same story with repetitions — one-or-more:

(COMMA- id:identifier)+

was pretty easy to implement with straightforward methods “seed” and “append”. Zero-or-more:

(COMMA- id:identifier)*

reuses previous implementation by reformulating the problem to “nothing or one-or-more” (tell the mathematician to pull out the nail half-way in a wall and a guy will hammer it down entirely just to reduce the case to well-known problem). But two-or-more:

(COMMA- id:identifier)++

made me struggle with it. You have to register new names for the objects (you cannot use “id” twice in the same block of code), you have to skip ignored symbol (here “COMMA”) just once, etc. So once simple “seed” looks more like a monster now with plenty of conditions.

And why not go further and add general repetitions with limits (like in regex “X{3,5}”)? I didn’t have time to thoroughly think about it but my impression is that with one symbol as lookahead there is only a difference between 0, 1, and 2 initial repetitions. With higher limits the control should be performed as semantic check, not syntax one.

Another feature requested by real-life case was naming arbitrary symbols within repeated sequence — let’s say you have a space-separated elements and you have to keep track of the spaces, now it is possible:

seq_list -> (s:space- e:elem)+ { new Seq(s,e) };

While working on new example for parsing chemical formulas (a problem recurring on various forums) I started wondering how big is memory hunger of LR parsers. Is it still a problem for modern computers? You know writing shift/reduce resolution is fun once or twice, but when you see “stupid” conflicts you wish your parser could be a little smarter. After some search I found a promising approach — not LR, not LALR, but MLR. From what I understood MLR collapses nodes as LALR does, but if it leads to a conflict it keeps given nodes intact, in LR fashion.

Tempting, tempting…

Tagged ,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: