Do lexer generator — get appetite for more

As a little break from Skila I decided to add missing part in NLT generator — a lexer. Irony, because adding it was more difficult than parser part.

First of all I was bitten by lack of tail recursion in C# — when scanning through huge comment block all I’ve got was StackOverflowException. Secondly, I couldn’t decipher scanning history of my own program — that’s bad. Once I fixed those weak spots I started adding feature after feature. Namely:

  • lexer can return stream of tokens per each match text,
  • if lexer detects Value for matched token is not set it uses Text for it (if it is not set too, the terminal code),
  • lexer generator supports pseudo-variables like “$match”, “$value”, “$text”, and “$token” for easier writing scanning rules.

Oh, wait — I didn’t say anything about lexer generator. Well, there are two more sections added to grammar file — one is for defining states for lexer, the second is for the rules.

The NLT generator allows to define string and regex based rules:

"class" -> CLASS;

/[A-Za-z_][A-Za-z_0-9]*/ 
  -> ID { new IdSymbol($text) };

It lets you define simple form rules (as above) — where you just return terminal and optionally a value as a C# expression — or complex rules where everything is expressed as block of C# statements:

"*/" {
       $value = "Unmatched */";
       $token = TokenEnum.Error;
     };

Still, this is an experimental project, but it is not far to get fully operational generator — the only part missing is translating parser rules into action table instead of creating the builder.

And yet I am tempted to wait and add two features to my agenda:

  • supporting pseudo-variables in parser rules,
  • make the parser-driven lexer.

The second idea might be an overkill, but so far I don‘t see simpler solution. Consider such rule:

/[A-Za-z_][A-Za-z_0-9]*/ 
  -> ID { new IdSymbol($text) };

This is easy to scan, because once I see open brace I switch lexer state to CODE and use different scanning rules. But as a user I don’t like the syntax of the grammar file. What I would like to see is:

/[A-Za-z_][A-Za-z_0-9]*/ 
  -> ID, new IdSymbol($text);

No braces, and it is more obvious that simple form requires expression, not a block of code (statements). Not an easy task for lexer though — I cannot switch to CODE state each time I see a comma character. I could add extra logic to scanning rules, but this work partially is already done — it is called parser.

Parser knows much more — which part of the tree it is working on and here after comma it can anticipate nothing else but an expression, because there is no such form as identifier followed by comma except terminal and value pair. Thus it might be possible to change the workflow of lexer — instead of scanning everything in one take, scanning on demand. Second change — allowing parser to change the lexer state.

It looks very promising, and I cannot stop wondering if it is the best approach to tackle such problem as described above. Anyway, the new NLT package is ready for download.

Tagged , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: