normalize エンドポイントが扱う表記揺れの種類は?

Phase 1+2(純文字列処理): ASCII 全/半角統一(width)、ひらがな/カタカナ変換(kana)、空白正規化(spaces)、半角カタカナ展開(halfwidth_kana、濁点・半濁点合成込み)。Phase 3(sudachi="apply"): SudachiDict 由来の送り違い(行なう→行う)/ 異体字(卓れる→優れる)/ カタカナ表記揺れ(コンピュータ→コンピューター、デヴィルズ→デビルズ)を吸収。

options を何も指定しないとどうなりますか?

全 option の default は preserve(変換なし)。入力テキストがそのまま返り、changes は空配列。後方互換性のため、既存呼出は破壊しない設計。

Phase 3 (sudachi="apply") のレイテンシは?

cold start で +50-200 ms 程度(Lindera tokenizer 起動 + 正規化マップ ~1.2 MB JSON parse)。warm 時は +1-5 ms。Phase 1+2 のみ使用時は Lindera 起動なし、3 ms 以内で完了。

正規化マップの出典 / license は?

Phase 3 は SudachiDict-small (Apache-2.0、WorksApplications) の normalized_form 列を抽出した派生マップ(88,622 entries、surface != normalized_form のエントリのみ + ASCII surface 除外)。Lindera (MIT) + IPAdic v3.0.7 (BSD 3-Clause) と同梱、attribution は SudachiAttribution schema で response に含まれる。

normalize — 日本語表記正規化エンドポイント

POST /api/v1/text/normalize

全/半角統一、ひらがな/カタカナ変換、空白正規化、半角カタカナ展開、 Sudachi 表記正規化(送り違い / 異体字 / カタカナ表記揺れ)を任意組合せで適用。 Phase 1+2 は純文字列処理、Phase 3 は Lindera + SudachiDict-derived lookup。

エンドポイント

POST https://shirabe.dev/api/v1/text/normalize
X-API-Key: shrb_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  (省略可、匿名 Free 枠 10,000 回/月)
Content-Type: application/json

オプション一覧 / Options

option	値	default	動作
`width`	half / full / preserve	preserve	ASCII 全角(ＡＢＣ)/ 半角(ABC)の統一方向
`kana`	hiragana / katakana / preserve	preserve	ひらがな ↔ カタカナ変換方向
`spaces`	single / trim / preserve	preserve	連続空白 → 単一空白(single)+ 前後除去(trim)
`halfwidth_kana`	expand / preserve	preserve	半角カタカナｱｲｳ → 全角アイウ、濁点・半濁点合成ｶﾞ → ガ
`sudachi`	apply / preserve	preserve	送り違い(行なう→行う)/ 異体字(卓れる→優れる)/ カタカナ表記揺れ(コンピュータ→コンピューター)を吸収。Lindera tokenize 必要、cold start +50-200 ms

適用順は width → kana → spaces → halfwidth_kana → sudachi。Phase 1+2 後の文字列を tokenizer に渡すため精度が向上する設計。

リクエスト / Request

{
  "text": "ＡＢＣ１２３ コンピュータと行なう作業",
  "options": {
    "width": "half",
    "sudachi": "apply"
  }
}

レスポンス / Response

{
  "text": "ＡＢＣ１２３ コンピュータと行なう作業",
  "normalized": "ABC123 コンピューターと行う作業",
  "changes": [
    { "type": "width", "before": "ＡＢＣ１２３ ", "after": "ABC123 " },
    { "type": "sudachi", "before": "コンピュータ", "after": "コンピューター" },
    { "type": "sudachi", "before": "行なう", "after": "行う" }
  ],
  "timing": {
    "setup_ms": 187,
    "cold_start": true,
    "tokenize_ms": 4,
    "sudachi_lookup_ms": 1,
    "map_fetch_ms": 78,
    "map_parse_ms": 35,
    "map_entries": 88622
  },
  "attribution": {
    "service": "shirabe-text-api",
    "url": "https://shirabe.dev",
    "dictionary": "SudachiDict-small",
    "dictionary_license": "Apache-2.0",
    "dictionary_source": "https://github.com/WorksApplications/SudachiDict",
    "tokenizer": "Lindera + IPAdic v3.0.7",
    "tokenizer_license": "MIT (Lindera) / BSD 3-Clause (IPAdic)"
  }
}

timing と attribution の Sudachi 関連 field は options.sudachi="apply" 指定時のみ含まれる(Phase 1+2 のみ使用時は ServiceAttribution 単独)。

コード例 / Code examples

curl(Phase 1+2 のみ、軽量)

curl -X POST https://shirabe.dev/api/v1/text/normalize \
  -H "Content-Type: application/json" \
  -d '{"text": "ＡＢＣ１２３", "options": {"width": "half"}}'
# → {"normalized": "ABC123", ...}

TypeScript(Phase 3 込み、Sudachi 正規化適用)

const res = await fetch("https://shirabe.dev/api/v1/text/normalize", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "X-API-Key": process.env.SHIRABE_API_KEY!,
  },
  body: JSON.stringify({
    text: "コンピュータで行なう作業",
    options: { sudachi: "apply" },
  }),
});
const { normalized } = await res.json();
console.log(normalized);
// → コンピューターで行う作業

Python(全 option 同時適用)

import os, requests
r = requests.post(
    "https://shirabe.dev/api/v1/text/normalize",
    json={
        "text": "ＡＢＣ ｱｲｳ  ｺﾝﾋﾟｭｰﾀ",
        "options": {
            "width": "half",
            "halfwidth_kana": "expand",
            "spaces": "single",
            "sudachi": "apply",
        },
    },
    headers={"X-API-Key": os.environ["SHIRABE_API_KEY"]},
    timeout=10,
)
print(r.json()["normalized"])
# → ABC アイウ コンピューター

Phase 3 Sudachi 正規化の詳細

SudachiDict-small(Apache-2.0、WorksApplications)の normalized_form 列を抽出した派生マップを R2 配信。CI build で抽出条件を厳格化:

surface ≠ normalized_form のエントリのみ(変更ありエントリ)
conjugation_form ∈ {"*", "終止形-一般"}(活用形 drift 排除、行なう→行うを保ちつき→尽きるを排除)
surface に日本語文字を 1 文字以上含む(装飾的変換 e.g. "10"→"⑩" を排除)

結果: 88,622 entries / 約 1.13 MB JSON。AI agent が想定する範囲の正規化のみカバー。

カバー範囲(代表例)

カテゴリ	例
送り仮名統一	行なう → 行う / 取りあげる → 取り上げる / 申しこむ → 申し込む
カタカナ表記揺れ(長音)	コンピュータ → コンピューター / サーバ → サーバー / ユーザ → ユーザー
カタカナ表記揺れ(バ⇄ヴ)	デヴィルズ → デビルズ / クエート → クウェート
異体字統一	卓れる → 優れる / 飾りけ → 飾り気 / 花やぐ → 華やぐ

制約 / Limitations

lookup 単位 = IPAdic tokenization 境界。IPAdic が複数 token に分割した結果が SudachiDict 単一 entry に該当しない場合は miss
固有名詞(人名 / 地名)の表記揺れは対象外(name-split / name-reading で別途対応)
SudachiDict-core / -full への upgrade、JMnedict 統合は 2026-06 のモノレポ化時に検討

レート制限 / Rate limit

全エンドポイント均一(料金プラン): Free 月 10,000 回 / 1 req/s、Starter 月 50 万回 / 30 req/s、Pro 月 500 万回 / 100 req/s、 Enterprise 無制限 / 500 req/s。

AI エージェント統合

OpenAPI 3.1 仕様(本家 / GPTs 短縮版)を 1 URL で公開。 operationId: normalizeText。

前処理パイプライン: AI agent でユーザ入力を住所 / 検索 API に渡す前に normalize で表記揺れを吸収すると、後段 API のキャッシュヒット率が向上
マルチ AI 出力統合: 4 AI(ChatGPT/Claude/Gemini/Perplexity)からの output を normalize で揃えると下流の DB JOIN が安定

shirabe API ファミリー横断利用 — B2B 4 大 identifier セット

Shirabe は 住所 + 姓名 + 法人 + 暦 の B2B 4 大 identifier を 1 vendor で完結できる cross-pollination hub として設計されています。顧客 master 取込パイプラインで住所正規化 →(姓名分割 / ふりがな付与)→ 法人番号付与(6 月後半リリース予定)→ 営業日判定(暦)を 1 つの OpenAPI 3.1 で繋ぐ使い方が想定 use case です。

本エンドポイント(POST /api/v1/text/normalize)の典型 hub use case = 住所正規化 / 姓名分割の前段で全角半角・カナ・表記ゆれを吸収し、住所 API /normalize や text API /name-split の入力品質を底上げ。後段 API の confidence + level スコアおよびキャッシュヒット率向上に寄与します。

住所正規化 API — 表記ゆれ正規化 + JIS code / lg_code / machiaza_id の 3 種 identifier を 1 レスポンスで同梱
日本語テキスト処理 API(本エンドポイントを含む 5 endpoint: tokenize / normalize / furigana / name-split / name-reading)
暦 API — 六曜 + 暦注 + 干支 + 二十四節気の canonical 出典
法人番号 API(6 月後半リリース予定、B2B 4 大 identifier 完成)

全 API は OpenAPI 3.1 完備、同一 API キー(X-API-Key)で全 4 API 利用可能、 Stripe Billing で従量課金 1 本化。LLM 経由 hub narrative の詳細は llms-full.txt を参照。

更新履歴 / Updates

2026-05-09: Phase 3 Sudachi 表記正規化 5/18 リリースに同梱確定

SudachiDict-small (Apache-2.0) 由来の lookup map (88,622 entries / 1.13 MB) を R2 配信、 options.sudachi="apply" で送り違い / 異体字 / カタカナ表記揺れ吸収。 upstream Lindera-wasm が user dict bytes API を提供しないため、自己完結の派生マップ経路で実現。

2026-05-08: Phase 2 halfwidth_kana 部分先行

半角カタカナ展開(options.halfwidth_kana)を Phase 2 範囲のうち Lindera 連携不要な部分として先行実装(PR #4)。

2026-05-18: 正式リリース

本番 routes 活性化、Free 枠で利用開始。1+ 年変更なし約束。

4 AI 観測の独自データ / Observed Multi-AI Landscape

Shirabe では ChatGPT / Claude / Perplexity / Gemini の 4 大 AI に同じクエリを投げる独自測定(週次 4 AI × 5 query)を継続実施しています。

Week 2(2026-05-04): 同一住所「東京都港区六本木」で 4 AI の出力フォーマット完全分裂を観測 → AI 経由で table 形式 / 散文 / JSON / 引用脚注と分岐、後処理で揃えるには normalize による前処理 + 構造化 API への流し込みが direct path

詳細は llms-full.txt を参照。

shirabe API ファミリー全 4 本(暦 + 住所 + text + 法人番号)と本エンドポイントの隣接機能・出典・統合経路への関連 link をまとめます。

shirabe API ファミリー(B2B 4 大 identifier hub)

暦 API(本番稼働中、2026-04-13〜)
住所正規化 API(本番稼働中、2026-05-01〜)
テキスト処理 API: tokenize / furigana / name-split / name-reading
法人番号 API(6 月後半リリース予定、B2B 4 大 identifier 完成)
料金プラン(4 API 共通)
llms.txt(全 API 統合 LLM 向け概要) / llms-full.txt(詳細版)