Tag Archives: parser-driven scanning

Context scanning — so it already has a name

The idea of context scanning I wrote yesterday about was like a virus of the mind — I just couldn’t settle down having the syntax for the generator I didn’t like.

The solution I described was pretty complex especially for forking parser — because it might happen that context switch occurs inside one of the forked parse trees. This translates to having one lexer (or lexer environment) per each parse tree. There is another issue with having conflicts on lookaheads — the same lookahead coming from one production can have lexer state switch, and coming from another might not have such switch.

It is hard to find any discussion about context scanning, because the internet is polluted with basic “how do I do X in Lex/Yacc”, and any search I tried was jammed. The only pearl I found was “Context-Sensitive Scanning in ANTLR” by invaluable Scott Stanchfield.

In short — problems. And while I am not afraid of hard work, the clock is ticking — I needed working solution fast. So I came up with recipe that simply works in my case, hoping that it might work in others too.

The context for the lexer is just a sequence of previous tokens — that’s all. It allowed me to get rid of such syntax:

/[A-Za-z_][A-Za-z_0-9]*/ 
  -> ID { new IdSymbol($text) };

and introduce more natural (because it looks more like pair):

/[A-Za-z_][A-Za-z_0-9]*/ 
  -> ID, new IdSymbol($text);

It works, because comma character switches to EXPRESSION state if and only if there were two tokens before — right arrow and identifier.

If I could only solve other problems that quickly…

Advertisement
Tagged , ,

Do lexer generator — get appetite for more

As a little break from Skila I decided to add missing part in NLT generator — a lexer. Irony, because adding it was more difficult than parser part.

First of all I was bitten by lack of tail recursion in C# — when scanning through huge comment block all I’ve got was StackOverflowException. Secondly, I couldn’t decipher scanning history of my own program — that’s bad. Once I fixed those weak spots I started adding feature after feature. Namely:

  • lexer can return stream of tokens per each match text,
  • if lexer detects Value for matched token is not set it uses Text for it (if it is not set too, the terminal code),
  • lexer generator supports pseudo-variables like “$match”, “$value”, “$text”, and “$token” for easier writing scanning rules.

Oh, wait — I didn’t say anything about lexer generator. Well, there are two more sections added to grammar file — one is for defining states for lexer, the second is for the rules.

The NLT generator allows to define string and regex based rules:

"class" -> CLASS;

/[A-Za-z_][A-Za-z_0-9]*/ 
  -> ID { new IdSymbol($text) };

It lets you define simple form rules (as above) — where you just return terminal and optionally a value as a C# expression — or complex rules where everything is expressed as block of C# statements:

"*/" {
       $value = "Unmatched */";
       $token = TokenEnum.Error;
     };

Still, this is an experimental project, but it is not far to get fully operational generator — the only part missing is translating parser rules into action table instead of creating the builder.

And yet I am tempted to wait and add two features to my agenda:

  • supporting pseudo-variables in parser rules,
  • make the parser-driven lexer.

The second idea might be an overkill, but so far I don‘t see simpler solution. Consider such rule:

/[A-Za-z_][A-Za-z_0-9]*/ 
  -> ID { new IdSymbol($text) };

This is easy to scan, because once I see open brace I switch lexer state to CODE and use different scanning rules. But as a user I don’t like the syntax of the grammar file. What I would like to see is:

/[A-Za-z_][A-Za-z_0-9]*/ 
  -> ID, new IdSymbol($text);

No braces, and it is more obvious that simple form requires expression, not a block of code (statements). Not an easy task for lexer though — I cannot switch to CODE state each time I see a comma character. I could add extra logic to scanning rules, but this work partially is already done — it is called parser.

Parser knows much more — which part of the tree it is working on and here after comma it can anticipate nothing else but an expression, because there is no such form as identifier followed by comma except terminal and value pair. Thus it might be possible to change the workflow of lexer — instead of scanning everything in one take, scanning on demand. Second change — allowing parser to change the lexer state.

It looks very promising, and I cannot stop wondering if it is the best approach to tackle such problem as described above. Anyway, the new NLT package is ready for download.

Tagged , , , ,