It took me too long — 10 days — but I finally did it. LALR(k) is working and the only missing piece — that’s really ironic — is a parsing problem in Skila that regular LALR(k) is better suited for than forking LALR(1). Even with such poor performance of mine I am happy I have another tool in my toolbox. If you are going to implement LALR(k) parser on your own, here is what I’ve learned.
FIRSTk
sets
Make sure you are reading math notation the right way — the algorithm given in “The Theory of Parsing, Translation, and Compiling (vol.1)” by Alfred V. Aho, and Jeffrey D. Ullman is rock solid once you understand it. Comparing to my old, naive LALR(1) code Aho&Ullman implementation is way more shorter and cleaner. Just pay attention to two issues:
- empty set (
∅
) is not the same as set with empty element ({ε}
) — this makes the difference for multi-concatenation of sets (⊕k
), if the first set is empty you will get an empty set as the result,
- initialize
F0
for non terminals with less than k
terminals only if α
is empty. If it is not — it has to be k
terminals, and if this condition cannot be fulfilled you don’t return anything for given production (returning empty string — ε
— in this case will render your algorithm invalid).
FOLLOWk
sets
Either I cannot read (again) or the definition of FOLLOW
sets given in “The Theory of Parsing” leads to very easy implementation, but not useful and — no wonder — not used anymore. In better known book “Compilers: Principles, Techniques, and Tools” by Alfred V. Aho, et al. the definition is already changed, but no algorithm for building it is given. Let me try describe my algorithm:
A
— non terminal, U
— symbol (non- or terminal), α
, β
— a sequence of symbols, possibly empty, u
— a sequence of non-terminals, possibly empty, ⊕k
— cartesian product of two sets, each pair of sequences is concatenated and trimmed to k
symbols,
- for every symbol initialize its entry as an empty set,
- set
FOLLOW
sets for start symbol as sequence of k
pseudo-terminal EOF
s,
- build initial
FOLLOW
sets from FIRST
sets by iterating over all symbols and looking for them at RHS of every production — let’s say it is defined as A := α U β
, where U
is symbol we iterate over. Compute FIRSTk(β)
— if it is empty set add to FOLLOWk[U]
non terminal A
, if it is not add FIRSTk(β) ⊕k A
set (unlike reading math, I love using math notation because it makes me look more smarter than I really am),
- at this point we have two kind of entries — the ones with terminals only and the ones with single non terminal at the end — in form
u A
. The latter ones will be used as generators,
- iterate over all generators replacing their entries with
u ⊕k FOLLOWk[A]
,
- repeat (5) until no single change is made to
FOLLOWk
sets,
- remove all generator entries.
Please note that unlike FIRSTk
sets, every entry in FOLLOWk
sets is k
terminals long.
Lookaheads and DFA
Computing lookaheads is basically the same algorithm as step (3) from computing FOLLOWk
sets only now the iteration is not over all symbols and searching them in productions — but over productions and looking for the split points between stack and input. After all computation it is necessary to either remove dead NFA states from DFA or mark them some way, because otherwise you will get a lot of conflicts for your grammar. A dead NFA state is a state with non terminal following split point.
Final words
After finding all the bugs you will notice the hardest part is UX of the parser. How to report DFA in meaningful way, how to define precedence rules efficiently and correctly. Those issues are ahead of me, but stay in touch — once I deal with them I surely share my experience with you. And last bit of advice — if you are not sure whether you should go with LALR(1) or LALR(k) I suggest go with the latter. It will force you to use proper algorithm (Aho&Ullman), not — with all respect — something half-baked by yourself (which will work yet with high chance of being inefficient). It will force you to clean up you code related to DFA as well. All LALR(k>1) parsers are of the same league, the only distinct parser is LALR(1) alone so starting from it is very likely to be waste of time. I love that feeling of peace of mind after finished job. Now I can enjoy the upcoming weekend with Skila…