Tag Archives: repetition

More on repeated groups

From time to time the vast difference between numbers surprises me — take for example LALR(1) and LALR(2). It is just one symbol more, but this means going from object to collection of objects. The internal algorithm also is way more complex that one could expected from the “2-1” difference.

Same story with repetitions — one-or-more:

(COMMA- id:identifier)+

was pretty easy to implement with straightforward methods “seed” and “append”. Zero-or-more:

(COMMA- id:identifier)*

reuses previous implementation by reformulating the problem to “nothing or one-or-more” (tell the mathematician to pull out the nail half-way in a wall and a guy will hammer it down entirely just to reduce the case to well-known problem). But two-or-more:

(COMMA- id:identifier)++

made me struggle with it. You have to register new names for the objects (you cannot use “id” twice in the same block of code), you have to skip ignored symbol (here “COMMA”) just once, etc. So once simple “seed” looks more like a monster now with plenty of conditions.

And why not go further and add general repetitions with limits (like in regex “X{3,5}”)? I didn’t have time to thoroughly think about it but my impression is that with one symbol as lookahead there is only a difference between 0, 1, and 2 initial repetitions. With higher limits the control should be performed as semantic check, not syntax one.

Another feature requested by real-life case was naming arbitrary symbols within repeated sequence — let’s say you have a space-separated elements and you have to keep track of the spaces, now it is possible:

seq_list -> (s:space- e:elem)+ { new Seq(s,e) };

While working on new example for parsing chemical formulas (a problem recurring on various forums) I started wondering how big is memory hunger of LR parsers. Is it still a problem for modern computers? You know writing shift/reduce resolution is fun once or twice, but when you see “stupid” conflicts you wish your parser could be a little smarter. After some search I found a promising approach — not LR, not LALR, but MLR. From what I understood MLR collapses nodes as LALR does, but if it leads to a conflict it keeps given nodes intact, in LR fashion.

Tempting, tempting…

Tagged ,

More auto-magic comes to NLT

What is parser generator without even simplest form of repetition? I had to add it to NLT as well:

namespaces -> ns:namespace_+
              { new Namespaces(currCoords(),ns) };

Behind the scenes NLT creates a list — List<T> — where “T” is either “object” or the type defined for given symbol (here: “Namespace” for “ns”). So when I pass “ns” in code all it takes is making “Namespaces” class accept “IEnumerable<Namespace>”.

When I tried to write the production for parsing parameters I ran into a problem:

params_ -> (COMMA p:param)*

Of course this is incorrect, it says that the list of parameters starts with the comma.

params_ -> p:param (COMMA p:param)*

I didn’t like it very much — not only I would have to make an exception of allowing duplicate names, but the user would have to repeat some of the symbols. Redundancy is bad. So I came up with an idea:

params_ -> (COMMA- p:param)*
           { new Parameters(currCoords(),p) };

Minus character tells NLT to exclude such symbol from initial repetition and the rule is written in compact way. I don’t know if this syntax will scale up to other problems, but for now it works.

One thing I miss here is local projection — the list of the parameters is created with raw parameters but on-fly transformations are helpful to express more complex scenarios:

params_ := (COMMA- p:param -> {p.ToUpperCase()})*
           { new Parameters(currCoords(),p) };

The main action is the same, but the rule contains transformation for each parameter which is added to the list. I wrote it next to shuffle mode on my “to-do” list.

Another Skila-related issue emerged while working on NLT — there are a lot of constraints which are too complex to handle them in parser rules, however they do not entirely fit into semantic analysis (consider checking if all of the elements inside the group are not marked with minus sign).

Let‘s say I add another line to action — after creating an object I check it, and if it is invalid I raise an exception. It works, but it also means that an error during check actually stops all subsequent checks. NLT provides a mechanism for that — pass an extra argument, ParseControl, and set its state accordingly to the error. This approach has two problems — it requires more code and it does not prevent from using incorrect object.

I solved the first issue in the simplest way — I introduced “IParseControl” interface and implemented it in desired classes. This way I can create an object, validate its state in constructor and set state of the “IParseControl” part accordingly to the error. The action code is clean as before — just creating an object.

The remaining issue is how to prevent from using invalid object? Because it is really interesting not only for parsing I restate this problem fully:

How to create an object, validate it and if it is invalid forbid using it except of passing reference of it?

Maybe I am wrong but in C# world the solution is quite elaborate — let’s assume we are about to validate class “Foo”:

  • make all constructors private in “Foo”,
  • make all fields private in “Foo”,
  • inherit from “Foo” a new type “InvalidFoo” which overrides every property and every method with single statement — throwing an exception,
  • add static method “Create” to “Foo” — it creates instance of “Foo”, validates it, if everything is OK it returns it, but if not — it returns an instance of “InvalidFoo” instead.

I don’t see simpler solution in any language which does not support objects validation natively. I have in mind something similar to Ruby’s taint checking. And so this issue makes an entry in Skila “to-do” list.

Update 2013-11-06: while maybe it is interesting from theoretical point of view, I am puzzled that I didn’t saw a tree while standing in the forest. In NLT I simply even don’t need that mechanism, because once user returns invalid object no other user action is executed. In other words invalid object won’t make its way into any other object. Passing back and forth parse control object might look cleaner, but throwing an exception enforce more tight control and requires less code (1KB).

Tagged , , , , , ,