Core Korean Web

Technical Deep Dive

2026-02-25

Rebuilding the CORE Korean Dictionary Engine: Inside NKDBv2

Why and how NKDBv2 replaced JSON-first lexicon loading with a versioned, chunked binary engine for native and WASM.

Overview

CORE Korean's dictionary layer began as a straightforward data integration task: ingest a large NIKL lexicon export and make it queryable inside a web application. That approach worked initially, but it quickly became clear that just parsing JSON was not going to scale in terms of performance, maintainability, or long-term architectural clarity.

We wanted the dictionary to behave like an engine, not a blob of text.

NKDBv2 is the result of that redesign. It is a versioned, chunked binary format designed specifically for high-performance lexicon lookup across native and WebAssembly environments.

The Motivation

The dataset contains nearly fifty thousand headwords, each potentially with multiple parts of speech, Korean and English definitions, and structured usage examples. JSON is easy to represent but expensive to parse and traverse, especially in constrained mobile environments.

JSON also provides no built-in versioning discipline, no integrity validation, and no structured evolution path. A growing product needs a lexicon layer that is stable, verifiable, and explicitly engineered.

Format Overview

NKDBv2 is a self-contained binary format. Each file begins with a fixed header containing magic marker, format version, structural offsets, and CRC32 checksum. The checksum is validated at load time to prevent silent corruption.

The file is organized into an index section, front-coded key blocks, and a values block backed by a string table.

Keys are sorted in UTF-8 byte order and stored with block front-coding: the first key in a block is full, then following keys encode shared-prefix length plus suffix. Prefix reconstruction is byte-based, not codepoint-based, to preserve UTF-8 correctness for Korean text.

Values are varint-encoded and string-interned. Records model multiple POS entries, KO and EN definitions, and usage structures including multi-turn dialogue.

Chunking Strategy

NKDBv2 partitions the lexicon into 21 buckets. Korean forms map by choseong of their first syllable, while non-Hangul entries map into LAT, DIG, SYM, or ETC.

Lookup normalizes input (trim plus NFC), chooses a bucket, and loads only that chunk. This minimizes I/O and memory while preserving fast binary-search behavior within the chunk.

A manifest tracks chunk metadata including key counts, size, and checksum for integrity checks.

Lookup Algorithm

Lookup starts with normalization and bucket selection. Bytes are read through a storage abstraction, then header and CRC are validated before parsing.

Binary search runs over the sorted key space; each midpoint key is reconstructed from front-coded blocks and compared against normalized UTF-8 query bytes.

On match, record offsets point into values, which are decoded via varints and string table IDs. On current dataset and release builds, lookup latency is in the low-microsecond range on desktop-class hardware.

Storage Abstraction and WASM Compatibility

The NKDBv2 loader depends on a minimal asynchronous byte-source trait in the storage layer. It does not assume filesystem access.

Native uses filesystem-backed sources, web uses fetch-backed sources, and tests use in-memory sources. The parser is storage-agnostic and portable by construction.

Build Pipeline

The build pipeline converts NIKL CSV into NKDBv2 chunks with robust parsing of list-like structures, aggregation by normalized form, POS merge behavior, and duplicate elimination while preserving stable order.

Each chunk builds a local string table (frequency-aware assignment for compact varints), then emits chunk files and manifest metadata.

Benchmarking and CI Enforcement

Performance and correctness are tested continuously, not assumed. The benchmark suite measures build time, cold chunk load, exact lookup latency (median and p95), and concurrent throughput.

Full-dataset validation ensures key roundtrip consistency and catches regressions in encoding, decoding, or indexing logic.

Device Modeling and Simulation

Desktop numbers are not enough for a mobile-first product. Device profiles include representative hardware classes such as iPhone XR, iPad 9th generation, Raspberry Pi 4, and low-end Android devices.

Simulations include CPU slowdown, memory pressure, and I/O delay injection to project realistic behavior across weaker environments.

What This Enables

For learners, dictionary lookup feels instant and consistent.

• Stable, versioned lexicon format

• Integrity validation via CRC

• Bounded load and lookup behavior

• Storage portability across native and web

• CI-enforced correctness and performance

• Device-class behavior projections

Closing Thoughts

NKDBv2 is not about premature optimization; it is about architectural clarity. By explicitly engineering for integrity, portability, and predictable performance, the lexicon becomes a reliable platform subsystem.

That foundation supports future features such as advanced search, morphological expansion, and stronger offline-first workflows with confidence.