Word generator(v1.2) Info & updates

Confidence dependent multipliers, %  100% Reset
How does it work?
N-Gram size Generated Word Count (max. syllables ) How does it work?
Dictionary words Probability Score Precision
Dictionary file  

Abstraction Factor

Every generated word is produced according to the probability score of prefix continuation. To determine this score it considers not just concrete prefixes found in the source dictionary but also abstracted ones (where one or more phonemes are substituted with abstracted symbols like the consonant and the vowel.) However, abstracted patterns yield lower scores than concrete ones, in order to make concrete prefixes preferable. The Abstraction Factor defines a multiplicator applied to the score when we turn one phoneme position into an abstracted symbol. The higher the abstraction factor is, the richer variety of words we obtain but their phonotactics may become less similar to that of the original dictionary.

Subsequent phoneme multiplier ("wildcard +")

Each time we increase the abstracted phonemes count in our probability calculations we apply this additional multiplier to the Abstraction Factor.
Example: let (master*confidence_based) abstraction factor be 0.5 and Subsequent phoneme multiplier be 0.7 .
Then, {'a','b','c','d'} 4-gram in generated word sheel will be considered the same probable as {'a','b','c','d'} one in the source dictionary, plus 0.5 times as probable as {'a',<consonant>,'c','d'} and {'a',<vowel>'c','d'} plus (0.5*0.7=0.35) times as probable as <vowel>,<consonant>,'c','d'}.
See also Confidence dependent multiplier

On-the-fly pruning - Minimum required confidence

To enforce more proper phoneme combination choice in a case of sparse training dictionary, an additional pruning of low-confidence N-Grams is applied. This means rejecting a word-in-progress once some produced N-Gram doesn't meet a minimum confidence threshold. Please note it's not the same as avoiding lower confidence NGrams at each step. In contrast, the former method doesn't force completing of every 'weird' word prodction process. That approach enables significant increase of overall word generation precision. But high confidence thresholds should be used with care as most 'high-hanging fruits' are unlikely to be hit with them.

Confidence dependent multiplier

When we apply an abstracted N-gram from the source dictionary to generate words, we first evaluate the confidence of abstraction. If we, say, have seen multiple different consonants occupying the same posintion in a N-gram, we treate an abstraction of them to the Consonant class as confident. Otherwise, if we have just seen just consonants at this position, we treate the abstraction as of low confidence (in another word, we don't expect an arbitrary consonants to be suitable at that position.) You may select an additional multiplier dependent on the confidence level to be applied over the Abstraction factor. You may, say, force more confident abstractions usage in generated words (the default setting) or, if you prefer, to encourage unusual words occurence by increasing the multiplier for lower confidence levels.

(Quasi) Syllable limit

We use a simplified syllable count computation for this limit. Any cluster of one or more vowels is accounted as shaping one syllable. For example, "Austria" is a two quasi-syllable word.


N-Gram size

We generate words by reproducing the probabilistic distribution of phoneme sequences found in the original dictionary. The N-Gram size option defines the maximum lengths of phoneme sequences considered (named N-Grams, where N is the length.) At too small N we may miss important phonotactic patterns while too high N leads to little difference between generated and original words as well as to slow computation. Usually a value of 4-5 is the most reasonable.

Dictionary Words

You may allow the appearance of generated words that exactly match words found in the original dictionary. By default, such words are filtered out.
Thus, if both "Allow" and "Mark(#)" button checked, such words will be marked with a trailing hash character.

Precision evaluation

This mode displays estimated precision of training dictionary lexeme 'hitting'. Using that feature makes playing with Word Generator options and parameters much more meaningful. While evaluating a given word 'hitting' probability, it assumes that this particular word is excluded from the dictionary. Results are grouped by quantiles of most 'challenging' words. Say, 0.05, 0.25, 1 quantile precisions displayed in a result table are computed over 5%, 25% and 100% 'highest-hanging fruits', respectively. (The last one is (also) the mean precision over the whole dictionary)

Probability Score

The Probability Score is an internal metric of a word appearance likelihood. (In the current implementation it's not necessary equal to the probability itself.) You may use the following controls to deal with the Score:
  • Pick top - run a flow that hunts for top scored words rather than making random 'shots'. When using it, please note that the word collection will not tend to show much variety in the length as well as in overall flavor.
  • Sort by - sort the output by word Scores
  • Display - display the Score next to every generated word like word (0.0001)

Source Dictionary

The dictionary used to learn information on phonotactics of the language which we are about to generate words in.
This should be a plain text file (in Ascii or UTF8 encoding) that contain:
  • A Vowelsline containing a comma-separated enumeration of all vowel phonemes available. It's ok for a phoneme to be represented by a combination of characters.
    Example: Vowels: a, aa, e, ee, i, ii, o, oo, u, uu if
  • A similar line for Consonants
  • A plain list of words, one per line