License	License MIT License
Categories	Categories Data
GroupId	GroupId fun.parser
ArtifactId	ArtifactId parserkt-ext-metadata
Last Version	Last Version 2.6
Release Date	Release Date Mar 15, 2020
Type	Type pom.sha512
Description	Description metadata Naive one-pass recursive descent, scannerless parser framework for Kotlin
Project URL	Project URL https://github.com/ParserKt/ParserKt
Source Code Management	Source Code Management https://github.com/ParserKt/ParserKt

Filename	Size
parserkt-ext-metadata-2.6.pom
parserkt-ext-metadata-2.6-sources.jar	9 KB
parserkt-ext-metadata-2.6-javadoc.jar	380 KB
Browse

ParserKt

・

・

・

Introduction

ParserKt is a naive one-pass recursive descent, scannerless parser framework for Kotlin (mainly JVM, compatible for JS)

A parser is a kind of program extracting data from input sequences:

NOTE: Using REPL from command line is not a good practice, you can create a project with dependency on JitPack or use IDEA Kotlin REPL instead.

git clone https://github.com/ParserKt/ParserKt.git && cd ParserKt
gradle shadowJar
kotlinc -cp build/libs/ParserKt-*.jar

Don't be scared! ParserKt is just a small library, even full-build won't take too much time. (or you can download pre-built jars here)

// Pattern.read(String), CharInput.STDIN, ...
// item(x), elementIn('a'..'z'), satisfy<Int> { it % 2 == 0 }, ...
// asList(), asString(), ...
import org.parserkt.*
import org.parserkt.util.*
import org.parserkt.pat.*

Import parserkt, util, pat, then we can implement our own combined pattern! (for a more in depth sample, see link at header)

val digits = Repeat(asList(), elementIn('0'..'9')) //{[0-9]}
digits.read("123") //[1, 2, 3]
digits.read("12a3") //[1, 2]
digits.read("a3") //null //(= notParsed)

A parser mainly recognize, and transform inputs into a simpler form (like AST) for post-processing (preetify, evaluate, type-check, ...), and it can be used to create (DSL) language tools / compilers / transpilers / interpreters.

// Generic parsing for [1, a, b] where (b > a)
val ints = Seq(::IntTuple, item(1), *Contextual(item<Int>()) { i -> satisfy<Int> { it > i } }.flatten().items() )
ints.rebuild(1,2,3) //[1, 2, 3]
ints.rebuild(1,2,1) //notParsed
i.rebuild(1,0,1) //[1, 0, 1]

// Using Convert patterns
fun digitItem(digit: SatisfyPattern<Char>) = Convert(digit, {it-'0'}, {'0'+it})
val digit = digitItem(elementIn('0'..'9'))
val digitsInt = Repeat(JoinFold(0) { this*10 + it }, digit)
digitsInt.read("233") //233

(click here if you want to try out more)

ParserKt divides complex grammar into simple subpart val definitions. "entry rule" could be all public structure — like item, elementIn, Repeat, and thus it's very easy to debug, and to reuse implemented syntax.

What means "one-pass"?

See FeedModel.kt:

interface Feed<out T> {
  val peek: T; fun consume(): T
  class End: NoSuchElementException("no more")
}

That's all one subparser can see, no mark/reset, not even one character supported for lookahead.

What means "recursive descent"?

Take a look of these expression: a + b, a + (b + c)

sealed class PlusAst {
  data class Plus(val left: PlusAst, val right: PlusAst): PlusAst()
  data class IntLiteral(val n: Int): PlusAst()
}

(sealed class are just classes with determinable subclass, in their inner name scope)

This data structure implies, every symbol of a + b could be an integer (like 0, 9, 16, ...IntLiteral(n)), or other a + b (1 + 2, 9 + 0, ...Plus(left, right))

PlusAst is recursive, so it's best to implement it's parser in recurse function form, that's "recursive".

Plus := (Plus '+' Plus) | Int  ; left associative
Int := {[0-9]}

({a} denotes repeat, a|b denotes alternative)

When reading input "1+2+3" by rule Plus, results Plus(Plus(Int(1), Int(2)), Int(3)).

Parser keep going to the state of reading substructure Int from reading Plus.

From rule Plus to its subrule Int, that's "descent".

——

ParserKt cannot make any lookaheads, so it's looks like impossible to parse syntax like //, /*, or 0x 0b (and error when 0(octal)).

In fact, "lookahead" can be stored in call stack of Pattern.read by pattern Contextual, so write parser for such syntax is possible (but much more complicated, so it's better to create new Pattern subclass, or fall back to tokenizer-parser LexerFeed, or use TriePattern)

What is the difference between "parser combinator" and "parser compiler"?

Let me clarify first: I really don't know what "combinator" is :(

I think combinators must be related to "functional programming", or at least "declarative programming" — Define what it is, not how computers actually solve it.

Parser combinator is mainly about code reuse, parser compiler is mainly about pattern matching algorithms.

Parser compilers and parser combinators solves the same problem in different ways. A parser compiler have better portability, and better performance (I'm not sure). A parser combinator integrates better with the host language, and it's really easy to write/debug, since it's just hand-wrote program.

For example, keywords about parser compilers: LL(k), LR(k), LALR, NFA, DFA, KMP, Aho–Corasick

😨 Scared? Recursive descent method are the most popular practice of handwriting parser used by many well-known projects like Lua, and it's also the best choice for code quality — source files generated by parser compilers offen a mess!

Features

Minimal boilerplate code & high reusability
DSL style & fully object-oriented library design
Rebuild parse result back to input (REPL friendly)
Generic parsing input (Char, String, XXXToken, ...)
One-pass input stream design without hasNext
ReaderFeed for implementating interactive shells (JVM only)
The framework manages parsed data in a flexible and extensible way (Tuple2, Sized-Tuple, Fold)
Supports complex contextual syntax structure that cannot be supported directly in parser compilers (LayoutPattern, NumUnitPattern, ...)
Extensible error messages using clam {"message"}, clamWhile(pat, defaultValue) {"message"}
Parser input stream Feed can have state storage: withState(T), stateAs<T>()
<500K compiled JVM library, no extra runtime needed (except kotlin-stdlib, Kotlin sequences are optional)
No magics, all code has been rewritten at least 9 times by the author, ugly/ambiguous/disordered codes are removed

Great thanks to Kotlin, for its strong expressibility and amazing type inference.

Provided combinators

Modules

:parserkt for Feed/Input model and basic (org.parserkt.pat) / complex (org.parserkt.pat.complex) patterns
:parserkt-util Fundamental data types (Slice, Tuple, Trie, ...) and helper class (Preety, AnyBy, ...) used/prepared for ParserKt and friends
:parserkt-ext Extension library for real-world parsers, eliminating boilerplate code (LayoutPattern, LexicalBasics, ...)

abbreviations

POPCorn (Pattern, OptionalPattern, PatternWrapper, ConstantPattern)
SURDIES (Seq, Until, Repeat, Decide, item, elementIn, satisfy)
CCDP (Convert, Contextual, Deferred, Piped)
SJIT (SurroundBy, JoinBy, InfixPattern, TriePattern)

More runnable REPL snippets

Note that for frequently-used pattern combinations, we have org.parserkt.pat.ext:

// Using pat.ext and LexicalBasics
import org.parserkt.pat.ext.*
val digitsInt = Repeat(asInt(), LexicalBasics.digitFor('0'..'9'))
digitsInt.read("180") //180

// Implementing complex pattern
import org.parserkt.pat.complex.*
// Patterns with constant values are OK to perform rebuild, even ignored in parse result
val letter = elementIn('A'..'Z', 'a'..'z', '0'..'9') or item('_')
val sep = elementIn(',',':').toConstant(',') named "sep"
val xsv = JoinBy(sep, Repeat(asString(), letter)) // try replace letter using !sep
xsv.read("monkey,banana,melon") //Tuple2(first=[monkey, banana, melon], second=[,, ,])

xsv.concatCharJoin().read("猴:雀:瓜") //Tuple2(first=[猴, 雀, 瓜], second=::)
val goodXsv = xsv.mergeConstantJoin()
goodXsv.read("she,her,herself") //[she, her, herself]
goodXsv.rebuild("she,her,herself") //she,her,herself

ParserKt provides mergeFirst/Second, discardFirst/Second, flatten for Tuple2 pattern converting 😘

import org.parserkt.*
import org.parserkt.util.*
import org.parserkt.pat.*
import org.parserkt.pat.complex.*

val dict = TriePattern<Char, String>().apply {
  mergeStrings("hello" to "こんにちは")
  mergeStrings("world" to "世界")
}
val noun = Repeat(asList(), dict)
noun.read("helloworld") //[こんにちは, 世界]

val pharse = JoinBy(Decide(elementIn('0'..'9'), elementIn(' ', '\t', '\n', '\r')), dict)
pharse.read("hello6world hello") //Tuple2(first=[こんにちは, 世界, こんにちは], second=[Tuple2(first=0, second=6), Tuple2(first=1, second= )])

It's OK (and also easy) to read recursive structure, just defined lateinit var and use Deferred{name} to reference:

// Back converts (third argument for Convert) are optional
sealed class Sexp { data class Term(val name: String): Sexp(); data class Nest(val list: List<Sexp>): Sexp() }
lateinit var sexp: Pattern<Char, Sexp>
val str = Repeat(asString(), !elementIn(' ', '(', ')'))
val atom = Convert(str, { Sexp.Term(it) }, { it.name })

val nestItems = SurroundBy(parens.toCharPat(), JoinBy(item(' '), Deferred{sexp}).mergeConstantJoin())
val nest = Convert(nestItems, { Sexp.Nest(it) }, { it.list })
sexp = Decide(nest, atom).mergeFirst { if (it is Sexp.Nest) 0 else 1 }

sexp.read("((monkey banana) (deep parser))") //Nest(list=[Nest(list=[Term(name=monkey), Term(name=banana)]), Nest(list=[Term(name=deep), Term(name=parser)])])
sexp.rebuild("((monkey banana) (deep parser))")  //((monkey banana) (deep parser))

NOTE: when using pattern Until, think if it can be expressed by Repeat(..., !SatisfPattern) first

References

GitHub: Simple calculator in Haskell
Blog post: 看完这段 Kotlin 代码后我哭了
Blog post: 简单 Kotlin 解析器
Article: Hello, declarative world
Article: Parsing in JavaScript: Tools and Libraries

References on Historic

Origin for Feed stream (1)
Origin for Feed stream (2)
Origin for Feed stream (3) - JavaScript
Origin for Feed stream (4) - Python
Origin for "pattern" model
Origin for NumOps, RangeMap

(script for GitHub pages) <script src="https://parserkt.github.io/scripts/github_corner.js"></script>

metadata

License

Categories

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Source Code Management

Download parserkt-ext-metadata

Dependencies

Project Modules