32leaves.net

Generating syntax diagrams from Scala parser combinators

Scala parser combinators are a powerful concept which allow you to write parsers with the standard Scala distribution. They’re implemented as an internal DSL which makes them feel naturally in the language. As mentioned before I’m currently writing some sort of a state machine compiler. As I plan to write some documentation for that software (which actually is kind of rare :-P) I need some way to describe the syntax. Besides EBNF, syntax diagrams are a neat way to do that. The external DSL which is used to describe the state machines is directly written using the parser combinator, so I had figure out a way to generate syntax diagrams as well as EBNF out of them.
My first approach was to pattern match the resulting Parser structure. But unfortunately the combinators themselves are mostly modeled as a function, thus do not appear in the resulting object graph. So I decided to write a Scala compiler plugin and run thru the AST. Basically all this software does is to transform trees from the AST to some intermediate expression structure to a graphing tree one. This approach suffers some limitations and I made some assumptions as well:

  • The name of the class which contains the production rules must equal the filename without “.scala”
  • Production rules must not be named: “stringWrapper”, “literal”, “regex”, “Predef”, “r” or “accept”
  • def production rules must not have parameters, otherwise they’re not considered to be production rules
  • production rules defined via val will appear twice (this is most likely a bug)

All in all this implementation works in my case, but might not work for others. This more of a quick hack than carefully engineered software, but it proofs the concept and gets the job done.

With the ability to draw syntax diagrams and having the expression tree structure at hand (which can be seen as an AST for EBNF) writing an EBNF parser (which is semi ISO/IEC 14977 : 1996(E) compatible) and feeding the diagram generator with the result was a snap.
So let’s consider the following example (which actually is the EBNF parser I wrote):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def start: Parser[List[ProductionRule]] = syntax_rule*
def syntax_rule = meta_identifier ~ '=' ~ definitions_list ~ ';' ^^ {
  case name ~ _ ~ expr ~ _ => ProductionRule(name, expr)
}
def meta_identifier = """[a-zA-Z0-9]+"""r
def definitions_list:Parser[Expression] = (single_definition ~ (('|' ~ single_definition)*)) ^^ {
  case t ~ o => OrExpression(t :: o.map(_ match { case _ ~ x => x }))
}
def single_definition = (primary ~ (((","?) ~ primary)*)) ^^ {
  case t ~ o => SeqExpression(t :: o.map(_ match { case _ ~ x => x }))
}
def primary: Parser[Expression] = optional_sequence | repeated_sequence | grouped_sequence | terminal_string | meta_identifier ^^ {
  case s => RuleRefExpression(s)
}
def optional_sequence = "[" ~ definitions_list ~ "]" ^^ {
  case _ ~ os ~ _ => OptionExpression(os)
}
def repeated_sequence = "{" ~ definitions_list ~ "}" ^^ {
  case _ ~ rs ~ _ => ZeroToManyExpression(rs)
}
def grouped_sequence = "(" ~ definitions_list ~ ")" ^^ {
  case _ ~ gs ~ _ => gs
}
def terminal_string = (""""[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*" """r) ^^ {
  case s => LiteralExpression(s)
}

The corresponding EBNF expression generated by this software is:

1
2
3
4
5
6
7
8
9
10
start = { syntax_rule };
syntax_rule = ( '[a-zA-Z0-9]+' '=' ( single_definition { ( '|' single_definition ) } ) ';' );
meta_identifier = '[a-zA-Z0-9]+';
definitions_list = ( ( primary { ( [ ',' ] primary ) } ) { ( '|' single_definition ) } );
single_definition = ( ( optional_sequence | repeated_sequence | grouped_sequence | terminal_string | meta_identifier ) { ( [ ',' ] primary ) } );
primary = ( ( '[' definitions_list ']' ) | ( '{' definitions_list '}' ) | ( '(' definitions_list ')' ) | '"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*" ' | '[a-zA-Z0-9]+' );
optional_sequence = ( '[' ( single_definition { ( '|' single_definition ) } ) ']' );
repeated_sequence = ( '{' ( single_definition { ( '|' single_definition ) } ) '}' );
grouped_sequence = ( '(' ( single_definition { ( '|' single_definition ) } ) ')' );
terminal_string = '"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*" ';

And of course the syntax diagrams we wanted:

So far the syntax diagram generator understands two options:

  • -P:virtual:rule01,...,ruleN: declares rule01,…,ruleN to be virtual rules which causes them to be resolved wherever they’re referenced. Be aware that recursive virtual rules are only resolved to one level
  • -P:depth:N: resolve rule references until a depth of N (recursive rules are only resolved for one level)
If you’re interested in the code you can grab it from this SVN repository (username: guest, password: guest). You’ll need Scala as well as Batik to run it.
Creative Commons License
SyntaxDiagramGenerator is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License
Fork me on GitHub