Generating Realistic Symbolic Sequences

A new study introduces a surrogate model capable of generating symbolic sequences that faithfully mimic the statistical characteristics of written language and genomic DNA. These sequences exhibit specific frequency distributions and long-range correlations.

Preserving Frequencies and Correlations

The proposed model preserves both the empirical symbol frequencies of the original sequence and its long-range correlation structure, quantified by the detrended fluctuation analysis (DFA) exponent. This approach overcomes the limitations of existing models, which typically maintain only one of the two characteristics.

Mapping and Validation

The method generates surrogates of symbolic sequences by mapping fractional Gaussian noise (FGN) onto the empirical histogram through a frequency-preserving assignment. The resulting surrogates match the original in first-order statistics and long-range scaling while randomizing short-range dependencies. Validation was performed on representative texts in English and Latin, and applicability was demonstrated with genomic DNA, reproducing base composition and DFA scaling.

This tool allows for decoupling the structural features of symbolic systems and testing hypotheses on the origin of scaling laws and memory effects in different domains, from language to DNA.