Tag Archives: first sets

When GLR is looking at the horizon

Technically GLR works fine, but for me it is annoying that I have to postpone some decisions until run time. Consider such Skila code:

abstract class...
static def... // function
static let... // field

and grammar for it:

class_mod = ε | class_mod ABSTRACT | class_mod STATIC | ...
class = class_mod CLASS ...
func_mod = ε | func_mod ABSTRACT | func_mod STATIC | ...
func = func_mod DEF ...
field = STATIC? LET 

This creates a shift-reduce/reduce-reduce conflict and GLR when parsing `abstract override static ...` forks on each keyword waiting for the right moment to kill incorrect parse trees. Could we do better? Yes — we know in advance that we are waiting for `class`, `def` or `let` keywords. All we have to do is to incorporate that knowledge into generated parser and instead of forking the tree we will check the incoming data.

To achieve this we have to make just small modifications to the parser generator — along with the first sets and follow sets we add two new ones. Cover set for given symbol is a bag of symbols which appeared on the right hand side of the productions for this symbol (directly or indirectly). For example cover set for `func` would contain among others things `DEF`, `func_mod` and also `ABSTRACT` as well. The second new set is — surprise, surprise — horizon set. It is very similar to the follow set, but unlike it, it does not mindlessly tell us what can show up right after given symbol (like for `class_mod` — `ABSTRACT`). Horizon set ignores such reductions/expansions as shown in the first and the third line of grammar, it focuses solely on true reductions of the symbol. Such reduction, that it cannot be recursively triggered again. So for `class_mod` horizon set has only one symbol — `CLASS`.

Having first sets, follow sets and cover sets, computing horizon sets is easy — ignore all productions for which given symbol covers left hand side (LHS) of the production, and for other productions take first symbols of what comes right after our symbol. If there is too little data, take the rest from the follow set of LHS of the current production. Or in other words — it is exactly the same algorithm as for the follow sets, only this time we narrow down the productions with help of cover set and because we have follow sets computed we don’t have to bother with recurrent computations, we just grab the needed data.

For each non-terminal horizon set is checked against its cover set — if those two sets overlap, horizon set is scrapped as unusable. Further computations are performed only in case all involved horizon sets are usable (non-empty).

Armed with all four sets we are ready to resolve ambiguities — in case of reduce-reduce conflict we check whether the horizons overlap, if not, we can use those data in run time. Shift-reduce resolution is a bit more elaborate — having production in shift state we process symbol after symbol from RHS testing if there is an overlap between cover set for current symbol and horizon for reduction on one hand and on the other — between LHS of reduction (when the verification of entire production was unsuccessful we make one final check — instead of using cover set for current symbol we use after-lookaheads). As previously, no overlap means we resolved the conflict — this time using not only horizon for reduction action, but also adding the non-overlapping symbol as a cut off marker to avoid excessive reading of the input.

In our example to resolve reduce-reduce conflict in run time `class_mod` and `func_mod` would run over stream of input symbols until they find their horizons, `CLASS` or `DEF` respectively — which one is found first, appropriate reduction wins. However because we have shift production in the mix we stop short on `LET` (this is cut off for both reduction actions), and if this happens shift action wins.

The code is uploaded, so take a look if you are interested in implementation.

Tagged , , , ,

GLR is here

Two wrong approaches with optimization later and GLR parser is ready for download.

In my case the last missing step was upgrading LALR parser to MLR — I was following great explanation given by David R. Tribble (of course all errors in implementations are mine). One thing that caused my confusion was the term “lookahead” — it seems a lot of people using it for two different things, and so I introduced “after lookahead” and “next lookahead” terms to clarify which is which. The first one is used when building a parser to denote terminals which can show up after entire production is consumed (so it is irrelevant how much of the production is processed). The “next lookahead” is used when parsing — it denotes terminals which can show next in input stream (for reduce states of productions those two sets are identical).

Besides that implementing MLR parser was easier than LALR one — you don’t have to compute follow sets, there is less work with NFA and DFA.

As for optimization I went with pre-computing first sets for entire chunks of the symbols present in productions and with stamping NFA states whenever their after-lookaheads get updated. It allowed me to get to the point where lexer is again the bottleneck of entire processing.

I don’t know how about you, but I think I deserved Grand Imperial Porter. I will get another one after implementing multi-regex.

Tagged , , , , ,

LALR(k>1) — mission accomplished

It took me too long — 10 days — but I finally did it. LALR(k) is working and the only missing piece — that’s really ironic — is a parsing problem in Skila that regular LALR(k) is better suited for than forking LALR(1). Even with such poor performance of mine I am happy I have another tool in my toolbox. If you are going to implement LALR(k) parser on your own, here is what I’ve learned.

FIRSTk sets

Make sure you are reading math notation the right way — the algorithm given in “The Theory of Parsing, Translation, and Compiling (vol.1)” by Alfred V. Aho, and Jeffrey D. Ullman is rock solid once you understand it. Comparing to my old, naive LALR(1) code Aho&Ullman implementation is way more shorter and cleaner. Just pay attention to two issues:

  • empty set () is not the same as set with empty element ({ε}) — this makes the difference for multi-concatenation of sets (k), if the first set is empty you will get an empty set as the result,
  • initialize F0 for non terminals with less than k terminals only if α is empty. If it is not — it has to be k terminals, and if this condition cannot be fulfilled you don’t return anything for given production (returning empty string — ε — in this case will render your algorithm invalid).

FOLLOWk sets

Either I cannot read (again) or the definition of FOLLOW sets given in “The Theory of Parsing” leads to very easy implementation, but not useful and — no wonder — not used anymore. In better known book “Compilers: Principles, Techniques, and Tools” by Alfred V. Aho, et al. the definition is already changed, but no algorithm for building it is given. Let me try describe my algorithm:

A — non terminal, U — symbol (non- or terminal), α, β — a sequence of symbols, possibly empty, u — a sequence of non-terminals, possibly empty, k — cartesian product of two sets, each pair of sequences is concatenated and trimmed to k symbols,
  1. for every symbol initialize its entry as an empty set,
  2. set FOLLOW sets for start symbol as sequence of k pseudo-terminal EOFs,
  3. build initial FOLLOW sets from FIRST sets by iterating over all symbols and looking for them at RHS of every production — let’s say it is defined as A := α U β, where U is symbol we iterate over. Compute FIRSTk(β) — if it is empty set add to FOLLOWk[U] non terminal A, if it is not add FIRSTk(β) ⊕k A set (unlike reading math, I love using math notation because it makes me look more smarter than I really am),
  4. at this point we have two kind of entries — the ones with terminals only and the ones with single non terminal at the end — in form u A. The latter ones will be used as generators,
  5. iterate over all generators replacing their entries with u ⊕k FOLLOWk[A],
  6. repeat (5) until no single change is made to FOLLOWk sets,
  7. remove all generator entries.

Please note that unlike FIRSTk sets, every entry in FOLLOWk sets is k terminals long.

Lookaheads and DFA

Computing lookaheads is basically the same algorithm as step (3) from computing FOLLOWk sets only now the iteration is not over all symbols and searching them in productions — but over productions and looking for the split points between stack and input. After all computation it is necessary to either remove dead NFA states from DFA or mark them some way, because otherwise you will get a lot of conflicts for your grammar. A dead NFA state is a state with non terminal following split point.

Final words

After finding all the bugs you will notice the hardest part is UX of the parser. How to report DFA in meaningful way, how to define precedence rules efficiently and correctly. Those issues are ahead of me, but stay in touch — once I deal with them I surely share my experience with you. And last bit of advice — if you are not sure whether you should go with LALR(1) or LALR(k) I suggest go with the latter. It will force you to use proper algorithm (Aho&Ullman), not — with all respect — something half-baked by yourself (which will work yet with high chance of being inefficient). It will force you to clean up you code related to DFA as well. All LALR(k>1) parsers are of the same league, the only distinct parser is LALR(1) alone so starting from it is very likely to be waste of time. I love that feeling of peace of mind after finished job. Now I can enjoy the upcoming weekend with Skila…

Tagged , , , , ,

LALR(k>1) — FIRST sets

Update: Thanks to help by Hendrik Jan after 4 days of struggle I can see my mistake — problem solved.

Three days ago I started working on LALR(k>1) to save a weekend solely for Skila development — unfortunately I underestimated the problem and now not only I feel dizzy but also I don’t see the end of the work. All three days went on algorithm for computing FIRSTk sets.

Surprisingly I didn’t find too many sources explaining how to compute FIRSTk sets for k>1. There is old “The Theory of Parsing, Translation, and Compiling (vol.1)” by Alfred V. Aho, and Jeffrey D. Ullman (yes, that’s 1972), a page “Review of LR(k) and LALR(k) Parsing Theory” by Sean E. O’Connor and a paper “An Algorithm for Constructing a Semi-LL(2) Grammar’s Parsing Table” by Yoshida Keiichi, and Takeuchi Yoshiko.

I have hard time deciphering the last one, and I cannot use the former two because the algorithm presented there (the same one) is in my opinion incorrect. If you don’t have a copy of Aho&Ullman please refer to the mentioned page by Sean E. O’Connor because the notation is the same.

The algorithm uses helper Fi sets to iteratively compute FIRSTk sets. We have:

Fi(a) = { a }

for every terminal a and index i ≥ 0. For non terminals starting point is defined as:

F0(A) = { x | ∃ A := x α, |x| ≤ k }

x denotes a sequence (possibly empty) of terminals, α denotes a sequence of symbols (terminals or non terminals) — also possibly empty.

Two things worth noting — there has to be such production A := x α defined, a derived production is not enough, and second thing — x is a greedy sequence, if it can match k terminals, it does.

At each iteration we get from Fi(A) set no smaller Fi+1(A) set — i.e. each element from Fi(A) is present in Fi+1(A). The claim is when we reach to the point when Fi(A) is equal to Fi+1(A) we stop because Fi(A) is our desired FIRSTk(A) set.

And I don’t think this claim holds.

Since it says FIRSTk(A) = Fi(A) ⊆ Fi-1(A) ⊆ ... ⊆ F0(A) to support my doubt it is enough to find an element from F0(A) that does not belong to FIRSTk(A). Consider such productions:

A := a B
B := b

For k=2 what is the FIRSTk(A) set?

FIRSTk(A) = { a b }

Note, there is no other string, the FIRSTk(A) set contains just one element. And how does the F0(A) look like?

F0(A) = { a }

This element is not part of the FIRSTk(A) and as counter example it shows the algorithm is incorrect.

The real question is — where did I make a mistake? Because when it comes to finding errors by Mr.Nobody in well established publications by well know authors, let’s face it — Mr.Nobody is more often wrong than right. I will be grateful for your comments.

Solution: I misinterpreted the way x’es are obtained for A’s — they can be shorter than k if and only if α is empty.

Tagged , ,