foundation.fluent.api:rpg

Java AST driven LR0 / LR1 parser generator, generating parser based on Java AST factory methods, and returning user defined root node of the abstract syntax tree.

License	License New BSD License
GroupId	GroupId foundation.fluent.api
ArtifactId	ArtifactId rpg
Last Version	Last Version 0.14
Release Date	Release Date Aug 14, 2020
Type	Type jar
Description	Description Java AST driven LR0 / LR1 parser generator, generating parser based on Java AST factory methods, and returning user defined root node of the abstract syntax tree.
Project URL	Project URL https://github.com/c0stra/rpg/wiki/rpg/rpg

Download rpg

Filename	Size
rpg-0.14.pom
rpg-0.14.jar	16 KB
rpg-0.14-sources.jar	22 KB
rpg-0.14-javadoc.jar	107 KB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/foundation.fluent.api/rpg/ -->
<dependency>
    <groupId>foundation.fluent.api</groupId>
    <artifactId>rpg</artifactId>
    <version>0.14</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/foundation.fluent.api/rpg/
implementation 'foundation.fluent.api:rpg:0.14'

Gradle Kotlin

// https://jarcasting.com/artifacts/foundation.fluent.api/rpg/
implementation ("foundation.fluent.api:rpg:0.14")

Apache Buildr

'foundation.fluent.api:rpg:jar:0.14'

Apache Ivy

<dependency org="foundation.fluent.api" name="rpg" rev="0.14">
  <artifact name="rpg" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='foundation.fluent.api', module='rpg', version='0.14')
)

Scala SBT

libraryDependencies += "foundation.fluent.api" % "rpg" % "0.14"

Leiningen

[foundation.fluent.api/rpg "0.14"]

Dependencies

test (2)

Group / Artifact	Type	Version
org.testng : testng	jar	7.1.0
org.mockito : mockito-core	jar	2.18.3

Project Modules

There are no modules declared in this project.

RPG - Java Rapid Parser Generator

!!! Under development !!! - API of Lexer, Parser and common tools is not yet stable and may change.

Java Abstract Syntax Tree driven LR0 / LR1 rapid parser generator, generating parser based on Java AST factory methods, and returning user defined root node of the tree.

Table of content

Imagine your syntax tree factory class is at the same time definition of your parser grammar. That can be achieved by small enhancement of the factory.

Node types (classes) in the abstract syntax (parse) tree in fact correspond to non-terminal symbols of a grammar. There are also types, that represent terminal symbols of our grammar (identifiers, string literals, any other literals). In order to define grammar, we'd need to add "unused" types describing only the syntax.

To avoid unnecessary overhead in a project, module rpg-common contains plenty of ready to use classes representing terminal symbols. Common operators, keywords as well as literals.

Maven configuration

Code dependencies

First you need dependencies used in your code.

Following dependency contains annotations used to mark your grammar: @StartSymbol and @Priority.

Then it contains base classes, that the generated parser uses. So this is mandatory compile time dependency. Put it simply to your dependencies.

<dependency>
    <groupId>foundation.fluent.api</groupId>
    <artifactId>rpg</artifactId>
    <version>${rpg.version}</version>
</dependency>

Next dependency is not mandatory, but provides plenty of ready to use tools like:

Ready to use classes representing terminal symbols used in common languages, like operators, keywords, parentheses, whitespaces, etc.
AstUtils class for frequently used actions in building the whole tree. Mostly shortcuts to simply create or update lists or maps.
Ready to use sets of rules:
- Typical whitespace removal rules - ignoring whitespaces and comments.
- Meta rules for typical constructs - lists with left recursion, comma separated lists etc.

It's this dependency:

<dependency>
    <groupId>foundation.fluent.api</groupId>
    <artifactId>rpg-common</artifactId>
    <version>${rpg.version}</version>
</dependency>

Annotation processing dependencies

We need a dependency for annotation processor in order to get the parser generated. Ideally pass it to compiler plugin. But if no compiler plugin is defined, it can be picked up also from project's compile time dependencies.

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>${compiler.plugin.version}</version>
            <configuration>
                <annotationProcessorPaths>
                    <annotationProcessorPath>
                        <groupId>foundation.fluent.api</groupId>
                        <artifactId>rpg-apt</artifactId>
                        <version>${rpg.version}</version>
                    </annotationProcessorPath>
                </annotationProcessorPaths>
            </configuration>
        </plugin>
    </plugins>
</build>

Grammar definition using Java code - re-use AST factory

Given the terminal classes and AST nodes declared for our language, lets define (and enrich by "syntactic" terminals) the the factory, which in turn defines our grammar rules:

package foundation.rpg.sample.language.ast;

import foundation.rpg.StartSymbol;
import foundation.rpg.common.*;

import java.util.List;

import static foundation.rpg.common.AstUtils.addTo;
import static foundation.rpg.common.AstUtils.list;

public class AstFactory {

    @StartSymbol
    Program              is (List<Statement> s)                                    { return new Program(s); }
    List<Statement>      is ()                                                     { return list(); }
    List<Statement>      is (List<Statement> l, Statement s)                       { return addTo(l, s); }
    Statement            is (@Additive Expression e, Dot d)                        { return new ExpressionStatement(e); }
    @Additive Expression is (@Atomic Expression e)                                 { return e; }
    @Additive Expression is (@Additive Expression l, Plus p, @Atomic Expression r) { return new BinaryExpression(l, r); }
    @Atomic Expression   is (Identifier i)                                         { return i; }
    @Atomic Expression   is (LPar l, Expression e, RPar r)                         { return e; }

    static void ignore(WhiteSpace w) {}
    static void ignore(Comment c) {}

}

The goal of RPG is to re-use such factory as grammar definition, and generate the LR(1) parser forsuch grammar, which in turn invokes the factory methods during parsing. BTW. the AST factory in fact looks very similar to other parser generators. Yet this is directly Java code.

Lexer definition

As any parser, we also need to have complementary lexer. It should be generated too from yet another additional information - token description (a.k.a. regular expression).

RPG lexer generator picks token description from annotations.

Annotation @Name(String) is used by the lexer to be matched as-is. That allows usage of any characters, including those, having special meaning in RPG's regular expression.
Annotation @Match(String) is used by the lexer to match token by RPG's regular expression. Keep in mind, that RPG doesn't implement none of widely used full standards of regular expressions. It implements just limited set of features.

RPG's Regular expressions

RGP's regular expressions support following:

Feature	Example	Description
Alternation	`a\|b`	Match `a` or `b`
Concatenation	`ab`	Match exact sequence `ab`
Repetition	`a*`	Matches any number of `a` including 0
Repetition	`a+`	Matches any number of `a` but at least 1
Character class	`[abc]`	Matches `a` or `b` or `c`
Character range	`[a-c]`	Matches `a` or `b` or `c`
Character class inversion	`[^abc]`	Matches characters other than `a`, `b` or `c`
Subpattern (no capturing)	`(a)`	Matches `a`, but useful e.g. for repetition: `(ab)*`
Predefined character class shortcuts
Digit	`\d`	Matches any digit
Word	`\w`	Matches any digit or letter or `_`
Whitespace	`\s`	Matches any whitespace character
Not a digit	`\D`	Matches anything else than digit
Not a word	`\W`	Matches anyhing else than digit or letter or `_`
Not whitespace	`\S`	Matches anyhing else than whitespace character
Additional usefull groups
Start of identifier	`\i`	Any character or `_` or `$`
Start of unicode identifier	`\u`	Any unicode character or `_` or `$`
Unicode word	`\x`	Any unicode character, digit or `_` or `$`

Not supported features (not listed all, but most important examples):

Feature	Example	Description
Optional occurence	`a?`	Match when `a` is present once, or not at all
Repetition with boundaries	`a{3}`, `a{3,6}`	Matches `a` min-to-max times
Possix groups
Equivalence classes
Back references
Anything else not mentioned in the table of supported features

Instructions for Lexer generator

Annotate directly the class representing terminal symbol with @Name or @Match:

@Name(".") class Dot {}
@Name("+") class Plus {}
@Name("(") class LPar {}
@Name(")") class RPar {}
@Match("\\i\\w*") class Identifier {}

Annotate the symbol anywhere in the any grammar rule. Keep in mind, it can be annotated only once not to introduce conflicts

Object is (@Match("\\d+") Integer i) { return i; }

For both cases above, the type representing the terminal symbol needs to have constructor accepting either Token describing fully the code fragment identified, including file, position, length and actual content, or String. Generated lexer will use such constructor.

Explicit instruction for lexer

If more specific creation of a token is needed, then include special methods in the "grammar" factory class, that instruct lexer how to create the instance of terminal. Such method's return type is the type representing terminal symbol, and exactly one parameter must be of type Token, annotated with the matching instruction:

Integer integer (@Match("\\d+") Token t) { return Integer.parseInt(t.toString()); }

Usage of generated parser

Resulting generated code usage example is following:

Given that the start symbol (root node) is called e.g. Program, then by default you'll get parser generated as ProgramParser with methods to parse input stream, file, reader or string.

Usage is following:

ProgramParser parser = new ProgramParser(astFactory);
Program program = parser.parse(new FileReader("program.src"));

Now Program is already our totally custom program representation created using our fully controlled AstFactory.

Name of the parser class can be provided (only simple name, not package) via the @StartSymbol annotation:

@StartSymbol(parserClassName = "MyParser")
Program is (...) { ... }

Generated Parser - Typesafe heterogeneous active stack

The LR(1) parser, generated by RPG is implemented as object oriented stack based state automata.

By object oriented state automata I mean here visitor pattern where: Token represents (wraps) a terminal symbol red from input, which is at the same time the node in the visitor pattern. State represents state of the LR(1) parsing automata, and contains whole transition table as methods. So it is the visitor in the visitor pattern. Any non-starting state is also representing item on the stack holding typesafe symbol of the grammar (node in the AST), and link to previous stacked state. Every state represents exactly known prefix of the longest possible rule to reduce including all types of symbols within the prefix. So it's typesafe. As reduction is done within a state visitor method, it benefits from the type information of all elements on the stack to be used in the rule application (AST method invocation).

Meta rules

RPG supports re-usable rules using annotation marked generic methods.

Such rules are called Meta rules and in fact you can think of them as of generic rules or rule templates, which can be re-used, and applied (expanded) by refering their marker annotation.

See example:

// Marker annotation
@MetaRule @interface SimpleList {}

// Re-usable meta rules
@SimpleList static <T> List<T> is () { return list(); }
@SimpleList static <T> List<T> is(List<T> l, T t) { return addTo(l, t); }

// Usage of the meta rules adds them automatically for given type.
static Program is(@SimpleList List<Statement> s, End e) { return new Program(s); }

Decomposition of grammar definition

There is also support for decomposition of the rules into multiple interfaces. They are re-used simply by extending the other interface.

rpg-common comes already with such re-usable sets of rules. You can re-use them by extending:

public interface MyAstFactory extends WhiteSpaceRules, ListRules {

    @StartSymbol
    Program is (@List1 List<Statement> s) { return new Program(s); }
    // ...

}

"No coding parser"

As RPG allows to use any type, including builtin or 3rd party types for representation of the AST nodes (and therefore symbols of the grammar), then it makes it very simple to generate data parsers. E.g. very simple JSON parser can be constructed with following factory / grammar definition:

@SuppressWarnings("unused")
public class JsonFactory {

    @StartSymbol
    Object              is (@Match("'([~'\\]|\\['\\rnt])*'|\"([~\"\\]|\\[\"\\rnt])*\"") String v) { return v; }
    Object              is (@Match("\\d*") Integer v)                                             { return v; }
    Object              is (@Match("\\d+\\[\\.eE]\\d+") Double v)                                 { return v; }
    Object              is (LBr o, List<Object> l, RBr c)                                         { return l; }
    Object              is (LBr o, RBr c)                                                         { return emptyList(); }
    Object              is (LCurl o, Map<String, Object> m, RCurl c)                              { return m; }
    Object              is (LCurl o, RCurl c)                                                     { return emptyMap(); }
    List<Object>        is (Object v)                                                             { return list(v); }
    List<Object>        is (List<Object> l, Comma c, Object v)                                    { return addTo(l, v); }
    Map<String, Object> is (String k, Colon c, Object v)                                          { return map(k, v); }
    Map<String, Object> is (Map<String, Object> m, Comma s, String k, Colon c, Object v)          { return putUniqueIn(m, k, v, "Duplicate key: " + k); }

    void ignore(@Match("\\s+") WhiteSpace w) { }

}

It demonstrates, that there's no need to define any AST nodes, as Java's List, Map and literals is sufficient to represent simple JSON object. The factory in turn also immediately checks correctness - not allowing duplicate entries in JSON object.

For full example see json package in rpg-sample-language module. There is also unit test for it.

Another interesting example can be no-AST parser. In this case structure generated by the parser may not need to be represented by abstract syntax tree, composed of different nodes.

Example is such regexp parser constructing directly generalized non-deterministic finite automata using Thompson's algorithm (it is used in the lexer generator):

public class RegularGNFAFactory {
    private final Thompson thompson;

    public RegularGNFAFactory(Thompson thompson) { this.thompson = thompson; }

    @StartSymbol
    @Pattern GNFA            is  ()                                                  { return thompson.empty(); }
    @Pattern GNFA            is  (@Chain Stream<GNFA> l)                             { return thompson.alternation(l); }
    @Chain Stream<GNFA>      is  (@Chain GNFA c)                                     { return of(c); }
    @Chain Stream<GNFA>      is  (@Chain Stream<GNFA> l, Pipe p, @Chain GNFA c)      { return concat(l, of(c)); }
    @Chain GNFA              is1 (@Node Stream<GNFA> l)                              { return thompson.chain(l); }
    @Node Stream<GNFA>       is1 (@Node GNFA g)                                      { return of(g); }
    @Node Stream<GNFA>       is1 (@Node Stream<GNFA> l, @Node GNFA g)                { return concat(l, of(g)); }

    @Node GNFA               is  (LPar l, @Pattern GNFA p, RPar r)                   { return p; }
    @Node GNFA               is  (@Node GNFA a, Star t)                              { return thompson.repetition(a); }
    @Node GNFA               is  (@Node GNFA a, Plus p)                              { return thompson.chain(of(a, thompson.repetition(a))); }
    @Node GNFA               is  (Character c)                                       { return thompson.transition(c); }
    @Node GNFA               is  (Dot d)                                             { return thompson.any(); }
    @Node GNFA               is  (Bs b, Character g)                                 { return thompson.group(g); }
    @Node GNFA               is  (Bs b, Bs g)                                        { return is('\\'); }
    @Node GNFA               is  (Bs b, Dot d)                                       { return is('.'); }
    @Node GNFA               is  (Bs b, Star s)                                      { return is('*'); }
    @Node GNFA               is  (Bs b, LBr s)                                       { return is('['); }
    @Node GNFA               is  (Up u)                                              { return is('^'); }
    @Node GNFA               is  (LBr l, @Chars Stream<Character> i, RBr r)          { return thompson.transitions(i); }
    @Node GNFA               is  (LBr l, Up t, @Chars Stream<Character> i, RBr r)    { return thompson.inversions(i); }
    @Chars Stream<Character> is2 (Stream<Character> s)                               { return s; }
    @Chars Stream<Character> is  (@Chars Stream<Character> s1, Stream<Character> s2) { return concat(s1, s2); }
    Stream<Character>        is1 (Character c)                                       { return of(c); }
    Stream<Character>        is1 (Dot dot)                                           { return is1('.'); }
    Stream<Character>        is1 (Star s)                                            { return is1('*'); }
    Stream<Character>        is1 (LBr s)                                             { return is1('['); }
    Stream<Character>        is1 (Pipe s)                                            { return is1('|'); }
    Stream<Character>        is1 (Bs bs, Bs b)                                       { return is1('\\'); }
    Stream<Character>        is1 (Bs bs, RBr b)                                      { return is1(']'); }
    Stream<Character>        is  (Character s, Minus m, Character e)                 { return rangeClosed(s, e).mapToObj(i -> (char) i); }

    @SymbolPart @interface Pattern {}
    @SymbolPart @interface Chain {}
    @SymbolPart @interface Node {}
    @SymbolPart @interface Chars {}
}

Most of the nodes in the output representation (not a tree but complex graph) are partial GNFA, result typically also GNFA. Naming the nodes still allows to use them as "different symbols" in grammar (syntax) but same object on implementation side.

Resources

Resources used for implementation of this parser generator:

[0] https://en.wikipedia.org/wiki/Abstract_syntax_tree
[1] https://en.wikipedia.org/wiki/LR_parser
[2] https://cs.wikipedia.org/wiki/LR_syntakticky_analyzator
[3] https://en.wikipedia.org/wiki/Visitor_pattern
[4] https://en.wikipedia.org/wiki/Regular_grammar
[5] https://en.wikipedia.org/wiki/Thompson%27s_construction
[6] https://en.wikipedia.org/wiki/Powerset_construction
[7] https://en.wikipedia.org/wiki/Generalized_nondeterministic_finite_automaton
[8] https://en.wikipedia.org/wiki/DFA_minimization
[9] ČVUT, 1999, Češka Milan, Prof. RNDr., CSc., Ježek Karel, doc. Ing., Csc., Melichar Bořivoj, prof. Ing., Dr.Sc., Richta Karel: Konstrukce překladačů
[10] 2004, David R. Tribble: Practical LR(k) Parser Construction
[11] 2017, Tomáš Vacho: Algorithms for transformation of regular expressions to finite automata
[12] http://michaluvweb.cz/2015/11/10/ka06-prevod-znka-na-dka/
[13] https://medium.com/@phanindramoganti/regex-under-the-hood-implementing-a-simple-regex-compiler-in-go-ef2af5c6079
[14] 1996, Tim A. Wagner, Susan L. Graham, General incremental lexical analysis. ACM Transactions on Programming Languages and Systems (ftp://ftp.cs.berkeley.edu/pub/sggs/lexing.ps)
[15] 1997, Tim A. Wagner, Susan L. Graham, Efficient and flexible incrementall parsing. ACM Transactions on Programming Languages and Systems (ftp://ftp.cs.berkeley.edu/pub/sggs/toplas-parsing.ps)
[16] 1998, Tim A. Wagner, Susan L. Graham, Modeling user-provided whitespace and comments in an incremental software development environment. Software - Practice and Experience (ftp://ftp.cs.berkeley.edu/pub/sggs/whitespace.ps)
[17] https://microsoft.github.io/language-server-protocol/
[18] https://en.wikipedia.org/wiki/Dangling_else

Versions

Version
0.14 Aug 14, 2020
0.13 Aug 8, 2020
0.12 Aug 3, 2020
0.11 Aug 1, 2020
0.10 Jul 30, 2020
0.9 Jun 1, 2020
0.8 May 29, 2020
0.7 May 26, 2020
0.6 May 22, 2020
0.5 May 22, 2020
0.4 May 21, 2020
0.3 May 19, 2020
0.2 May 15, 2020
0.1 May 6, 2020