js-tokens
The tiny, regex powered, lenient, almost spec-compliant JavaScript tokenizer that never fails.
const jsTokens = require("js-tokens");
const jsString = 'JSON.stringify({k:3.14**2}, null /*replacer*/, "\\t")';
Array.from(jsTokens(jsString), (token) => token.value).join("|");
// JSON|.|stringify|(|{|k|:|3.14|**|2|}|,| |null| |/*replacer*/|,| |"\t"|)
Installation
npm install js-tokens
import jsTokens from "js-tokens";
// or:
const jsTokens = require("js-tokens");
Usage
jsTokens(string, options?)
Option | Type | Default | Description |
---|---|---|---|
jsx | boolean |
false |
Enable JSX support. |
This package exports a generator function, jsTokens
, that turns a string of JavaScript code into token objects.
For the empty string, the function yields nothing (which can be turned into an empty list). For any other input, the function always yields something, even for invalid JavaScript, and never throws. Concatenating the token values reproduces the input.
The package is very close to being fully spec compliant (it passes all but 3 of test262-parser-tests), but has taken a couple of shortcuts. See the following sections for limitations of some tokens.
// Loop over tokens:
for (const token of jsTokens("hello, !world")) {
console.log(token);
}
// Get all tokens as an array:
const tokens = Array.from(jsTokens("hello, !world"));
Tokens
Spec: ECMAScript Language: Lexical Grammar + Additional Syntax
export default function jsTokens(input: string): Iterable<Token>;
type Token =
| { type: "StringLiteral"; value: string; closed: boolean }
| { type: "NoSubstitutionTemplate"; value: string; closed: boolean }
| { type: "TemplateHead"; value: string }
| { type: "TemplateMiddle"; value: string }
| { type: "TemplateTail"; value: string; closed: boolean }
| { type: "RegularExpressionLiteral"; value: string; closed: boolean }
| { type: "MultiLineComment"; value: string; closed: boolean }
| { type: "SingleLineComment"; value: string }
| { type: "IdentifierName"; value: string }
| { type: "NumericLiteral"; value: string }
| { type: "Punctuator"; value: string }
| { type: "WhiteSpace"; value: string }
| { type: "LineTerminatorSequence"; value: string }
| { type: "Invalid"; value: string };
StringLiteral
Spec: StringLiteral
If the ending "
or '
is missing, the token has closed: false
. JavaScript strings cannot contain (unescaped) newlines, so unclosed strings simply end at the end of the line.
Escape sequences are supported, but may be invalid. For example, "\u"
is matched as a StringLiteral even though it contains an invalid escape.
Examples:
"string"
'string'
""
''
"\""
'\''
"valid: \u00a0, invalid: \u"
'valid: \u00a0, invalid: \u'
"multi-\
line"
'multi-\
line'
" unclosed
' unclosed
NoSubstitutionTemplate / TemplateHead / TemplateMiddle / TemplateTail
Spec: NoSubstitutionTemplate / TemplateHead / TemplateMiddle / TemplateTail
A template without interpolations is matched as is. For, example:
`abc`
: NoSubstitutionTemplate`abc
: NoSubstitutionTemplate withclosed: false
A template with interpolations is matched as many tokens. For example, `head${1}middle${2}tail`
is matched as follows (apart from the two NumericLiterals):
`head${
: TemplateHead}middle${
: TemplateMiddle}tail`
: TemplateTail
TemplateMiddle is optional, and TemplateTail can be unclosed. For example, `head${1}tail
(note the missing ending `
):
`head${
: TemplateHead}tail
: TemplateTail withclosed: false
Templates can contain unescaped newlines, so unclosed templates go on to the end of input.
Just like for StringLiteral, templates can also contain invalid escapes. `\u`
is matched as a NoSubstitutionTemplate even though it contains an invalid escape. Also note that in tagged templates, invalid escapes are not syntax errors: x`\u`
is syntactically valid JavaScript.
RegularExpressionLiteral
Spec: RegularExpressionLiteral
Regex literals may contain invalid regex syntax. They are still matched as regex literals.
If the ending /
is missing, the token has closed: false
. JavaScript regex literals cannot contain newlines (not even escaped ones), so unclosed regex literals simply end at the end of the line.
According to the specification, the flags of regular expressions are IdentifierParts (unknown and repeated regex flags become errors at a later stage).
Differentiating between regex and division in JavaScript is really tricky. js-tokens looks at the previous token to tell them apart. As long as the previous tokens are valid, it should do the right thing. For invalid code, js-tokens might be confused and start matching division as regex or vice versa.
Examples:
/a/
/a/gimsuy
/a/Inva1id
/+/
/[/]\//
MultiLineComment
Spec: MultiLineComment
If the ending */
is missing, the token has closed: false
. Unclosed multi-line comments go on to the end of the input.
Examples:
/* comment */
/* console.log(
"commented", out + code);
*/
/**/
/* unclosed
SingleLineComment
Spec: SingleLineComment
Examples:
// comment
// console.log("commented", out + code);
//
IdentifierName
Spec: IdentifierName
Keywords, reserved words, null
, true
, false
, variable names and property names.
Examples:
if
for
var
instanceof
package
null
true
false
Infinity
undefined
NaN
$variab1e_name
π
℮
ಠ_ಠ
\u006C\u006F\u006C\u0077\u0061\u0074
NumericLiteral
Spec: NumericLiteral
Examples:
0
1.5
1
12e9
0.123e-32
0xDeadbeef
0b110
12n
07
09.5
Punctuator
Spec: Punctuator + DivPunctuator + RightBracePunctuator
All possible values:
&& || ??
-- ++
. ?.
< <= > >=
!= !== == ===
+ - % & | ^ / * ** << >> >>>
= += -= %= &= |= ^= /= *= **= <<= >>= >>>=
( ) [ ] { }
! ? : ; , ~ ... =>
WhiteSpace
Spec: WhiteSpace
Unlike the specification, multiple whitespace characters in a row are matched as one token, not one token per character.
LineTerminatorSequence
Spec: LineTerminatorSequence
CR, LF and CRLF, plus \u2028
and \u2029
.
Invalid
Spec: n/a
Single code points not matched in another tokens.
Examples:
#
@
💩
JSX Tokens
Spec: JSX Specification
export default function jsTokens(
input: string,
options: { jsx: true }
): Iterable<Token | JSXToken>;
export declare type JSXToken =
| { type: "JSXString"; value: string; closed: boolean }
| { type: "JSXText"; value: string }
| { type: "JSXIdentifier"; value: string }
| { type: "JSXPunctuator"; value: string }
| { type: "JSXInvalid"; value: string };
- The tokenizer switches between outputting runs of
Token
and runs ofJSXToken
. - Runs of
JSXToken
can also contain WhiteSpace, LineTerminatorSequence, MultiLineComment and SingleLineComment.
JSXString
Spec: "
JSXDoubleStringCharacters "
+ '
JSXSingleStringCharacters '
If the ending "
or '
is missing, the token has closed: false
. JSX strings can contain unescaped newlines, so unclosed JSX strings go on to the end of input.
Note that JSX don’t support escape sequences as part of the token grammar. A "
or '
always closes the string, even with a backslash before.
Examples:
"string"
'string'
""
''
"\"
'\'
"multi-
line"
'multi-
line'
" unclosed
' unclosed
JSXText
Spec: JSXText
Anything but <
, >
, {
and }
.
JSXIdentifier
Spec: JSXIdentifier
Examples:
div
class
xml
x-element
x------
$htm1_element
ಠ_ಠ
JSXPunctuator
Spec: n/a
All possible values:
<
>
/
.
:
=
{
}
JSXInvalid
Spec: n/a
Single code points not matched in another token.
Examples in JSX tags:
1
`
+
,
#
@
💩
All possible values in JSX children:
>
}
Compatibility
ECMAScript
The intention is to always support the latest ECMAScript version whose feature set has been finalized.
Currently, ECMAScript 2020 is supported.
Annex B
Annex B: Additional ECMAScript Features for Web Browsers of the spec is optional if the ECMAScript host is not a web browser, and specifies some additional syntax.
- Numeric literals: js-tokens supports legacy octal and octal like numeric literals. It was easy enough, so why not.
- String literals: js-tokens supports legacy octal escapes, since it allows any invalid escapes.
- HTML-like comments: Not supported. js-tokens prefers treating
5<!--x
as5 < !(--x)
rather than as5 //x
. - Regular expression patterns: js-tokens doesn’t care what’s between the starting
/
and ending/
, so this is supported.
TypeScript
Supporting TypeScript is not an explicit goal, but js-tokens and Babel both tokenize this TypeScript fixture and this TSX fixture the same way, with one edge case:
type A = Array<Array<string>>
type B = Array<Array<Array<string>>>
Both lines above should end with a couple of >
tokens, but js-tokens instead matches the >>
and >>>
operators.
JSX
JSX is supported: jsTokens("<p>Hello, world!</p>", { jsx: true })
.
JavaScript runtimes
js-tokens should work in any JavaScript runtime that supports Unicode property escapes. For Node.js, this means Node.js 10 or later.
Known errors
Here are a couple of tricky cases:
// Case 1:
switch (x) {
case x: {}/a/g;
case x: {}<div>x</div>/g;
}
// Case 2:
label: {}/a/g;
label: {}<div>x</div>/g;
// Case 3:
(function f() {}/a/g);
(function f() {}<div>x</div>/g);
This is what they mean:
// Case 1:
switch (x) {
case x:
{
}
/a/g;
case x:
{
}
<div>x</div> / g;
}
// Case 2:
label: {
}
/a/g;
label: {
}
<div>x</div> / g;
// Case 3:
(function f() {} / a / g);
(function f() {} < div > x < /div>/g);
But js-tokens thinks they mean:
// Case 1:
switch (x) {
case x:
({} / a / g);
case x:
({} < div > x < /div>/g);
}
// Case 2:
label: ({} / a / g);
label: ({} < div > x < /div>/g);
// Case 3:
function f() {}
/a/g;
function f() {}
<div>x</div> / g;
In other words, js-tokens:
- Mis-identifies regex as division and JSX as comparison in case 1 and 2.
- Mis-identifies division as regex and comparison as JSX in case 3.
This happens because js-tokens looks at the previous token when deciding between regex and division or JSX and comparison. In these cases, the previous token is }
, which either means “end of block” (→ regex/JSX) or “end of object literal” (→ division/comparison). How does js-tokens determine if the }
belongs to a block or an object literal? By looking at the token before the matching {
.
In case 1 and 2, that’s a :
. A :
usually means that we have an object literal or ternary:
let some = weird ? { value: {}/a/g } : {}/a/g;
But :
is also used for case
and labeled statements.
One idea is to look for case
before the :
as an exception to the rule, but it’s not so easy:
switch (x) {
case weird ? true : {}/a/g: {}/a/g
}
The first {}/a/g
is a division, while the second {}/a/g
is an empty block followed by a regex. Both are preceded by a colon with a case
on the same line, and it does not seem like you can distinguish between the two without implementing a parser.
Labeled statements are similarly difficult, since they are so similar to object literals:
{
label: {}/a/g
}
({
key: {}/a/g
})
Finally, case 3 ((function f() {}/a/g);
) is also difficult, because a )
before a {
means that the {
is part of a block, and blocks are usually statements:
if (x) {
}
/a/g;
function f() {}
/a/g;
But function expressions are of course not statements. It’s difficult to tell an function expression from a function statement without parsing.
Luckily, none of these edge cases are likely to occur in real code.
Performance
With @babel/parser for comparison.
Lines of code | Size | [email protected] | @babel/[email protected] |
---|---|---|---|
~100 | ~4.8 KB | ~2 ms | ~17 ms |
~1 000 | ~46 KB | ~11 ms | ~84 ms |
~10 000 | ~409 KB | ~80 ms | ~550 ms |
~100 000 | ~3.3 MB | ~430 ms | ~7.45 s |
~1 500 000 | ~77 MB | ~7 s | ~4 minutes (*) |
(*) Required increasing Node.js’ memory limit.
See benchmark.js if you want to run benchmarks yourself.
License
MIT.