32leaves.net

Generating syntax diagrams from Scala parser combinators

Scala parser combinators are a powerful concept which allow you to write parsers with the standard Scala distribution. They’re implemented as an internal DSL which makes them feel naturally in the language. As mentioned before I’m currently writing some sort of a state machine compiler. As I plan to write some documentation for that software (which actually is kind of rare :-P) I need some way to describe the syntax. Besides EBNF, syntax diagrams are a neat way to do that. The external DSL which is used to describe the state machines is directly written using the parser combinator, so I had figure out a way to generate syntax diagrams as well as EBNF out of them.
My first approach was to pattern match the resulting Parser structure. But unfortunately the combinators themselves are mostly modeled as a function, thus do not appear in the resulting object graph. So I decided to write a Scala compiler plugin and run thru the AST. Basically all this software does is to transform trees from the AST to some intermediate expression structure to a graphing tree one. This approach suffers some limitations and I made some assumptions as well:

  • The name of the class which contains the production rules must equal the filename without “.scala”
  • Production rules must not be named: “stringWrapper”, “literal”, “regex”, “Predef”, “r” or “accept”
  • def production rules must not have parameters, otherwise they’re not considered to be production rules
  • production rules defined via val will appear twice (this is most likely a bug)

All in all this implementation works in my case, but might not work for others. This more of a quick hack than carefully engineered software, but it proofs the concept and gets the job done.

With the ability to draw syntax diagrams and having the expression tree structure at hand (which can be seen as an AST for EBNF) writing an EBNF parser (which is semi ISO/IEC 14977 : 1996(E) compatible) and feeding the diagram generator with the result was a snap.
So let’s consider the following example (which actually is the EBNF parser I wrote):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def start: Parser[List[ProductionRule]] = syntax_rule*
def syntax_rule = meta_identifier ~ '=' ~ definitions_list ~ ';' ^^ {
  case name ~ _ ~ expr ~ _ => ProductionRule(name, expr)
}
def meta_identifier = """[a-zA-Z0-9]+"""r
def definitions_list:Parser[Expression] = (single_definition ~ (('|' ~ single_definition)*)) ^^ {
  case t ~ o => OrExpression(t :: o.map(_ match { case _ ~ x => x }))
}
def single_definition = (primary ~ (((","?) ~ primary)*)) ^^ {
  case t ~ o => SeqExpression(t :: o.map(_ match { case _ ~ x => x }))
}
def primary: Parser[Expression] = optional_sequence | repeated_sequence | grouped_sequence | terminal_string | meta_identifier ^^ {
  case s => RuleRefExpression(s)
}
def optional_sequence = "[" ~ definitions_list ~ "]" ^^ {
  case _ ~ os ~ _ => OptionExpression(os)
}
def repeated_sequence = "{" ~ definitions_list ~ "}" ^^ {
  case _ ~ rs ~ _ => ZeroToManyExpression(rs)
}
def grouped_sequence = "(" ~ definitions_list ~ ")" ^^ {
  case _ ~ gs ~ _ => gs
}
def terminal_string = (""""[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*" """r) ^^ {
  case s => LiteralExpression(s)
}

The corresponding EBNF expression generated by this software is:

1
2
3
4
5
6
7
8
9
10
start = { syntax_rule };
syntax_rule = ( '[a-zA-Z0-9]+' '=' ( single_definition { ( '|' single_definition ) } ) ';' );
meta_identifier = '[a-zA-Z0-9]+';
definitions_list = ( ( primary { ( [ ',' ] primary ) } ) { ( '|' single_definition ) } );
single_definition = ( ( optional_sequence | repeated_sequence | grouped_sequence | terminal_string | meta_identifier ) { ( [ ',' ] primary ) } );
primary = ( ( '[' definitions_list ']' ) | ( '{' definitions_list '}' ) | ( '(' definitions_list ')' ) | '"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*" ' | '[a-zA-Z0-9]+' );
optional_sequence = ( '[' ( single_definition { ( '|' single_definition ) } ) ']' );
repeated_sequence = ( '{' ( single_definition { ( '|' single_definition ) } ) '}' );
grouped_sequence = ( '(' ( single_definition { ( '|' single_definition ) } ) ')' );
terminal_string = '"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*" ';

And of course the syntax diagrams we wanted:

So far the syntax diagram generator understands two options:

  • -P:virtual:rule01,...,ruleN: declares rule01,…,ruleN to be virtual rules which causes them to be resolved wherever they’re referenced. Be aware that recursive virtual rules are only resolved to one level
  • -P:depth:N: resolve rule references until a depth of N (recursive rules are only resolved for one level)
If you’re interested in the code you can grab it from this SVN repository (username: guest, password: guest). You’ll need Scala as well as Batik to run it.
Creative Commons License
SyntaxDiagramGenerator is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License

4 Responses to “Generating syntax diagrams from Scala parser combinators”

  1. Fabian Steeg says:

    This is very cool! Is the code available somewhere? The linked SVN seems to be down.

  2. Mathias says:

    Christian,
    as an alternative to the Scala Parser Combinators you could have used parboiled (http://parboiled.org).
    Parboiled separates rule construction from parser execution, so your grammar is actually represented by a “tree” of rule objects, which is easily written out to EBNF or the like without any need for “reverse-engineering” the parser instances created by the combinator framework.
    As a nice little something on the side your DSL parser would run a lot faster, report errors better and would optionally even recover from syntax errors…
    Cheers,
    Mathias

  3. I can’t seem to see the code in the SVN repo. I wonder do you fancy putting the code somewhere easily accessible? e.g. github.com or google code maybe?

Leave a Reply

Fork me on GitHub