Modular Architecture for Artificial Lexicons: Control and Reproducibility Beyond LLMs

The Challenge of Artificial Lexicons and Current LLM Limitations

The creation of artificial lexicons that are simultaneously pronounceable, typologically plausible, and semantically structured represents one of the most complex challenges in computational linguistics. This field is crucial for various applications, from linguistic research to the generation of synthetic data for model training. However, existing approaches for "conlang" (constructed language) generation often exhibit significant shortcomings.

Specifically, many current generators lack formal phonotactic guarantees, meaning that the generated words might not adhere to the natural phonological rules of a language. Another emerging limitation is the reliance on Large Language Model (LLM)-based pipelines which, despite their power, can be opaque and non-reproducible. This lack of transparency and control makes it difficult to understand the generation process and reliably replicate its results, a critical aspect for enterprise environments.

A Modular Framework for Controlled Generation

To address these issues, a new modular framework has been proposed. This approach stands out for its ability to offer granular control over the lexicon generation process, overcoming the opacity and reproducibility limitations found in LLM-based pipelines. The system's modularity allows for the integration of different components, each responsible for a specific phase of lexicon creation.

The framework operates by sampling phoneme inventories from the PHOIBLE resource, a vast database of linguistic phoneme inventories. Subsequently, it generates word forms using interchangeable phonological grammars, including deterministic approaches, Optimality Theory (OT), and MaxEnt models. Finally, the system assigns meanings to words through a Swadesh–Leipzig–Jakarta ontology, ensuring explicit form–meaning alignment. This architecture enables developers to select and combine the modules best suited to their specific needs, promoting transparency and reproducibility.

Performance Evaluation and Typological Realism

The validity of the framework has been rigorously evaluated through a series of quantitative metrics. The analysis included character n-gram perplexity, log-likelihood, and KL divergence, comparing the generated lexicons against PHOIBLE data. Tests were conducted on lexicons of varying sizes, from 100 to 5,000 forms, to understand the system's scalability and robustness.

The results highlighted that probabilistic grammars, such as those based on Optimality Theory and MaxEnt models, consistently outperform deterministic and random baselines. This improvement was evident in terms of both phonotactic coherence, ensuring that generated words adhere to sound rules, and typological realism, making the lexicons more similar to those of natural languages. The ability to generate lexicons with a high degree of linguistic plausibility is fundamental for applications requiring high-quality and faithful data.

Implications for LLM Deployment and Data Sovereignty

While the framework does not directly focus on LLM hardware or deployment, its implications are significant for organizations evaluating on-premise AI solutions. The critique leveled at "opaque, non-reproducible LLM-based pipelines" resonates with the concerns of CTOs and infrastructure architects regarding control, data sovereignty, and compliance. The opacity of black-box models can pose a significant obstacle in contexts where full auditability and predictable behavior are required.

A modular, rule-based approach, such as the one proposed, offers greater transparency and reproducibility, crucial elements for air-gapped environments or sectors with stringent regulatory requirements. The ability to explicitly define grammars and ontologies allows companies to maintain full control over generation processes, mitigating risks associated with using less transparent AI systems. For those evaluating on-premise deployments, the choice between the flexibility but potential opacity of generic LLMs and the controllability of more structured frameworks represents a fundamental trade-off, often analyzed through analytical frameworks available on /llm-onpremise.