Word generator (alpha v0.2)    Last updates
Abstraction factor
Confidence dependent multipliers, % 100% Reset
Subsequent phoneme multiplier
0.0 0.3 0.5 0.7 1.0
N-Gram size Generated Word Count (max. syllables )
3 4 5 100 200 500 1000 2000 4000
Dictionary words Probability Score
Min Display Sort by Pick top
Dictionary file

Abstraction Factor

Every generated word is produced according to the probability score of prefix continuation. To determine this score it considers not just concrete prefixes found in the source dictionary but also abstracted ones (where one or more phonemes are substituted with abstracted symbols like the consonant, the vowel and the frequent vowel.) However, abstracted patterns yield lower scores than concrete ones, in order to make concrete prefixes preferable. The Abstraction Factor defines a multiplicator applied to the score when we turn one phoneme position into an abstracted symbol. The higher the abstraction factor is, the richer variety of words we obtain but their phonotactics may become less similar to that of the original dictionary.

Subsequent phoneme multiplier

Each time we increase the abstracted phonemes count in our probability calculations we apply this additional multiplier to the Abstraction Factor.
Example: let Abstraction factor be 0.5 and Subsequent phoneme multiplier be 0.7 .
Then, {'a','b','c','d'} 4-gram in generated word sheel will be considered the same probable as {'a','b','c','d'} one in the source dictionary, plus 0.5 times as probable as {'a',<any consonant>,'c','d'} and {'a',<frequent vowel>'c','d'} plus (0.5*0.7=0.35) times as probable as <frequent vowel>,<any consonant>,'c','d'}.
See also Confidence dependent multiplier

Confidence dependent multiplier

When we apply an abstracted N-gram from the source dictionary to generate words, we first evaluate the confidence of abstraction. If we, say, have seen multiple different consonants occupying the same posintion in a N-gram, we treate an abstraction of them to the Consonant class as confident. Otherwise, if we have just seen just consonants at this position, we treate the abstraction as of low confidence (in another word, we don't expect an arbitrary consonants to be suitable at that position.) You may select an additional multiplier dependent on the confidence level to be applied over the Abstraction factor. You may, say, force more confident abstractions usage in generated words (the default setting) or, if you prefer, to encourage unusual words occurence by increasing the multiplier for lower confidence levels.

(Quasi) Syllable limit

We use a simplified syllable count computation for this limit. Any cluster of one or more vowels is accounted as shaping one syllable. For example, "Austria" is a two quasi-syllable word.


N-Gram size

We generate words by reproducing the probabilistic distribution of phoneme sequences found in the original dictionary. The N-Gram size option defines the maximum lengths of phoneme sequences considered (named N-Grams, where N is the length.) At too small N we may miss important phonotactic patterns while too high N leads to little difference between generated and original words as well as to slow computation. Usually a value of 4-5 is the most reasonable.

Dictionary Words

You may allow the appearance of generated words that exactly match words found in the original dictionary. By default, such words are filtered out.
Thus, if both "Allow" and "Mark(#)" button checked, such words will be marked with a trailing hash character.

Probability Score

The Probability Score is an internal metric of a word appearance likelihood. (In the current implementation it's not necessary equal to the probability itself.) You may use the following controls to deal with the Score:
  • Pick top - run a flow that hunts for top scored words rather than making random 'shots'. When using it, please note that the word collection will not tend to show much variety in the length as well as in overall flavor.
  • Sort by - sort the output by word Scores
  • Display - display the Score next to every generated word like word (0.0001)
  • Min - filter out generated words having the Score value less than specified in this field

Source Dictionary

The dictionary used to learn information on phonotactics of the language which we are about to generate words in.
This should be a plain text file (in Ascii or UTF8 encoding) that contain:
  • A Vowelsline containing a comma-separated enumeration of all vowel phonemes available. It's ok for a phoneme to be represented by a combination of characters.
    Example: Vowels: a, aa, e, ee, i, ii, o, oo, u, uu if
  • A similar line for Consonants
  • A plain list of words, one per line