TypeScriptで使えるキーワードの中で一番長いのはconstructorで11文字

実装中にコードを見て興味が湧いた内容をパッと調べたやつです。

タイトルの通り、TypeScriptで現在利用できるキーワードの中で、最も長い文字数なのはconstructorの11文字のようです。
どうでもいいかも。

一つ留意点で、ECMAScriptではconstructorは予約語ではなく通常の識別子として扱われ、TypeScriptコンパイラが独自にキーワードトークンとして登録しています。
話のスコープがECMAScriptではなくTypeScriptだという話です。

Lexical grammar - JavaScript | MDN

This page describes JavaScript's lexical grammar. JavaScript source text is just a sequence of characters — in order for the interpreter to understand it, the string has to be parsed to a more structured representation. The initial step of parsing is called lexical analysis, in which the text gets scanned from left to right and is converted into a sequence of individual, atomic input elements. Some input elements are insignificant to the interpreter, and will be stripped after this step — they include white space and comments. The others, including identifiers, keywords, literals, and punctuators (mostly operators), will be used for further syntax analysis. Line terminators and multiline comments are also syntactically insignificant, but they guide the process for automatic semicolons insertion to make certain invalid token sequences become valid.

MDN Web Docs

調べたきっかけ

OXCの実装をしていた関係でoxc-parserのコードを見ていました。

発端はこのcrates/oxc_parser/src/lexer/kind.rs#L447-L449に記述されている早期return文です。
条件にマッチした場合に、Identを返却してユーザー定義の字句であるとlexerがtoken生成を行います。

sorafujitani/oxc/crates/oxc_parser/src/lexer/kind.rsLines 447 to 449 in 4a180d4

if len <= 1 || len >= 12 || !unsafe { s.as_bytes().get_unchecked(0) }.is_ascii_lowercase() {
    return Ident;
}

条件	意味
`len <= 1`	1文字以下
`len >= 12`	12文字以上
`!is_ascii_lowercase()`	先頭が小文字アルファベットでない

なぜ12文字以上なのか

oxcのmatch_keyword_implに登録されている最長キーワードはconstructorの11文字です。
これ以上の12文字以上の文字列は絶対にキーワードにならないので、len >= 12で即座にIdentを返します。

sorafujitani/oxc/crates/oxc_parser/src/lexer/kind.rsLine 550 in 4a180d4

"constructor" => Constructor,

MDNのキーワード一覧でも、constructorが最長文字数のようです。

字句文法 - JavaScript | MDN

このページでは、 JavaScript での字句文法を説明します。JavaScript のソーステキストは、単なる文字の列です。これをインタープリターに理解させるためには、文字列をより構造化された表現に解釈させる必要があります。構文解析の最初の手順は字句解析と呼ばれ、テキストを左から右へスキャンして、個々の原子的な入力要素の列に変換します。一部の入力要素、例えばホワイトスペースやコメントはインタープリターにとって重要ではないので、この手順の後で取り除かれます。それ以外の、例えば識別子、キーワード、リテラル、区切り記号（主に演算子）は、その後の構文解析に使用します。改行文字や複数行のコメントも構文的には重要ではありませんが、不正なトークン列を有効にするために自動セミコロン挿入の処理のガイドとなります。

MDN Web Docs

synchronized（12文字）というキーワードがECMAScript1〜3で存在していたようですが、現在の実装では使われておらず、oxc-parserとしても対応していないようです。
ESMでは const synchronized = "hello"; も有効です。

https://developer.mozilla.org/ja/docs/Web/JavaScript/Reference/Lexical_grammar
https://www.tutorialrepublic.com/javascript-reference/javascript-reserved-keywords.php

なぜこの条件があるか

ここからは番外編で、実装意図への考察です。
結論は、パフォーマンス最適化です。

レキサーが処理する量

レキサーはソースコード内の全ての識別子に対して「これはキーワードか？」を判定します。

Expressの例:

const express = require("express");
const app = express();

app.get("/users/:id", authenticateMiddleware, async (req, res) => {
    const userRepository = new UserRepository(databaseConnection);
    const validationResult = validateRequestParams(req.params);
    res.json(await userRepository.findById(req.params.id));
});

識別子の大半はauthenticateMiddleware、userRepository、databaseConnection、validateRequestParamsのような長い変数名・関数名です。これらは絶対にキーワードではないと判定できます。

これにより、長い識別子や大文字始まりのクラス名が、整数比較とバイト比較だけで即座にスキップされます。

#[cold] アノテーション

#[cold]
pub fn match_keyword(s: &str) -> Self {

#[cold]は「この関数はあまり呼ばれない」というヒントです。
コンパイラはこの関数をホットパスから離れた場所に配置し、呼び出し元の命令キャッシュ効率を上げます。
キーワードより識別子の方がはるかに多いので、キーワードマッチング自体がcold pathです。

Code generation - The Rust Reference

doc.rust-lang.org