Some time ago I noticed that even innocent rule which makes the grammar more clear and terse can, surprisingly, introduce conflicts. This made me thinking — if the grammar in its essence is ambiguous, that’s OK, but when it is all about rewriting rules such behaviour is simply unacceptable. It is like forbidding creating temporary variables in programming languages.
I approached this issue as far as it solves the problem — namely I look for identity aliases:
elem -> namedElem | anonElem;
and unfold them. Production:
id -> LBRACE elem RBRACE { $elem };
becomes:
id -> LBRACE namedElem RBRACE { $namedElem }; id -> LBRACE anonElem RBRACE { $anonElem };
Obviously there is room for improvements here (for starters unfolding wrappers like `id
` above as well) but for now it is enough (Skila is waiting).
Nevertheless that move caused new problem — unfolding aliases multiplied productions and all tables exploded. Skila parser went from 874 entries to 2214, and even if I didn’t mind, Mono refused to run such behemoth. Unfolding for sure is operation that should be possible to do, so if I have an explosion of nodes it means the node merging is too picky. So I revised how DFA is created.
When merging DFA nodes it is always said, all the states of two nodes have to be equal. So far I didn’t read a definition of states equality other than comparison of production and number of symbols read (SLR(0) state equality). Consider such states:
comp -> LPAREN . expr RPAREN { func1 }; expr -> . LPAREN NUM RPAREN { func2 };
Different productions, different number of symbols read — it is easy to see they are not equal. But what about:
expr -> PLUS . expr { func1 }; expr -> MINUS . expr { func2 };
Also not equal, but this “natural” notion of equality is deceptive — because there is a lot of wasted space. Production actually should not matter. In fact the symbol on the left hand side matters, the symbols that are left to be read too, and the action user entered.
States in both previous examples are not equal indeed, but the following ones are:
expr -> PLUS . expr { func1 }; expr -> MINUS . expr { func1 }; // note: the same function
Ok, equal might be too strong word — they are compatible, but is is enough for merger. They are expecting the same reduction, they target the same node after reduction, they are waiting for the same input, and the action which will be executed is the same. This is all we need (currently I added the number of the read symbols to the mix, because of error recovery mechanism — as soon I fix this I will get rid of this factor as well).
And the results? With new merge algorithm and unaliasing turned off, the number of entries was reduced by 10%. After enabling unfolding aliases I got almost 1% additional gain. In raw numbers it went from 874 to 782 nodes.
All in all not bad, automatically removing conflicts in the grammar led to reduction of the table entries. Code is uploaded so you can take it for a test drive.