Modeling Word Forms Using Latent Underlying Morphs and Phonology

This website provides datasets and detailed results for Cotterell et al. (2015), "Modeling Word Forms Using Latent Underlying Morphs and Phonology," to appear in Transactions of the Association for Computational Linguistics. [PDF]

Primary Data

Each line's first field contains the surface phonological form of a word. This form is a string of phonemes (segments). Each phoneme is represented by a single ASCII character, using the transcription system in the section below.

The line's remaining fields specify the sequence of abstract morphemes in that word. Each abstract morpheme is consistently referenced by a distinct unanalyzed string, which may be chosen to hint at that morpheme's meaning or underlying pronunciation, or may simply be an integer.

In the datasets derived from the CELEX database, each line ends with an additional field giving the token count of the word. In the special case of German, the token counts in the CELEX database are unfortunately only provided for surface words, some of which are ambiguous; we imputed the disambiguated counts that are provided in the German files below by using an EM procedure to fit a simple unigram model of morpheme sequences, allowing us to predict the posterior probabilities of competing analyses. The imputation method is described in detail in Dreyer (2011) Appendix D.

As the paper notes, we used only the first two datasets to develop our method, code, features, and hyperparameters. Thus, our method was not tailored to the 7 evaluation datasets, which include 5 additional languages.

Phonological Features

These files indicate the pronunciation of each ASCII phoneme symbol by listing its phonological natural classes. Our parametric phonology model uses features that refer to these classes (see section 6 and Table 1 of the paper).

Overview of Datasets

Here we sketch the datasets. The citations below can be found in the bibliography of the paper. You can of course inspect the data files at the links above for a more detailed understanding.

Think before reading these materials: If you are planning to evaluate a new learning method on one of these datasets, the fairness of the evaluation may be compromised if the development of your method could have been affected by prior knowledge of the phenomena represented in the dataset.

Development Data: English

We created the English development dataset for this paper. It is the simplest dataset and contains 53 singular / plural noun pairs, for a total of 106 word forms. All plurals are regular, i.e., they can be fully predicted by simple rules from the singular form. This dataset uses American standard pronunciation (military: [/ˈmɪl.ɪ.tɛɹ.i/]), in contrast to the English (CELEX) evaluation set, which uses UK received pronunciation (military: [/ˈmɪl.ɪ.tɹɪ/]).

The learner typically gets 100% accuracy on the prediction task even when it observes a small subset. We recommend this dataset for initial experiments with new systems.

Our learner achieved only 79% match at matching gold (hand-constructed) URs, however, because the EM algorithm found a different local optimum. This local optimum is linguistically less natural, a fact that our model captures by assigning it a lower likelihood, but it is nonetheless highly predictive of the observed SRs.


Note our English dev data was based on American standard pronunciation whereas the CELEX English data is based on UK received pronunciation. This yields two distinct possible transcriptions of the word military US: , UK: .

Development Data: German (CELEX)

This German development dataset covers exactly the same linguistic phenomena as the German (CELEX) evaluation dataset below, but the forms are disjoint. To be more precise, we extracted all members of the chosen paradigms (see German CELEX section below) and chose disjoint sets for the dev and training portions.


The Maori data was expanded from Blevins (1994) with additional words from a dictionary. It contains 69 word forms: the present active, gerund, and participial inflections (respectively indicated by b, g, p) for each of 23 verbs.

Maori forbids word-final consonants in the surface form. This is typically analyzed as a deletion process during the mapping from URs to SRs. Our learner is capable of recovering a deleted consonant, but of course, only when this consonant has been observed non-word-finally in another form. For instance, the UR /faof#/ surfaces as [fao], whereas the UR /faof#aNa/ surfaces as [faofaNa].

Maori also prohibits certain sequences of three vowels, with the result that the underlying form /fiu/ surfaces as [fiua] rather than [fiuia].



The Catalan dataset is adapted from Kenstowicz and Kisseberth (1979). It contains 72 word forms: the masculine singular, feminine singular, masculine plural and feminine plural inflections for each of of 18 adjectives.

The primary phenomena include final consonant devoicing and consonant deletion to break up illicit clusters. Our learner successfully learns all phenomena when presented with enough evidence.



The Tangale dataset is adapted from an exercise in Kenstowicz and Kisseberth (1979). It consists of 54 nouns with various nominal suffixes.

This dataset exhibits various phonological processes, including retrograde voicing assimilation that exists in an opaque relationship with vowel deletion. For example, /tugat#no/ surfaces as [tugatno]. All of these processes make this dataset the most complicated we experimented on. For this reason, our UR recovery rate is lower than on the other languages.



The Indonesian dataset is an expanded version of the nasal assimilation examples presented in Kager (2000) with additional examples taken from a dictionary. It contains a mix of nouns and verbs for 44 word forms in total. This is our only dataset where words have more than two morphemes.

Indonesian uses the prefix peN as a verbalizing prefix and meN as a nominalizing prefix. The only phonology in the dataset is the nasal place assimilation of the final N in the above forms. E.g., we get peNgunaan, but pendaftaran. As the results show, our learner is successful on this dataset.


German (CELEX)

Our CELEX datasets are subsets of the CELEX database (Baayen et al., 1995). We selected various paradigms to focus on (listed below) and randomly selected from those to form our training sets.

The German dataset contains 1000 word forms: 500 verbs and 500 nouns. The nouns consist of 165 nominative forms, 167 dative forms and 168 genitive forms. The verbs consist of 167 first person singular forms, 166 second person plural forms, and 167 third personal plural forms.

German phonology devoices word-final obstruents. Contrast the German surface forms [rat] (advice) and [rat] (wheel), both in the nominative case. To predict that the dative of advice is [rate] and the dative of wheel is [rade], we need to infer that the URs are /rad/ and /rat/ respectively. Even if we correctly learn the phonology, our method will only infer the UR given the proper amount of evidence from the surface forms. In our experimental paradigm, we often withhold evidence (poke holes in the paradigm). In such a case, we may only observe the the nominative and our method may not posit the proper UR needed for generalization at test time. Of course, it will assign high probability to the correct UR as d → t occurs finally with high probability. However, simply copying t → t also occurs with high probability. There are a few additional irregulars, which the learner does not successfully learn.


Dutch (CELEX)

Our CELEX datasets are subsets of the CELEX database (Baayen et al., 1995).

The Dutch dataset contains 1000 word forms. 500 of these are the singular and plural inflections of 250 nouns. The other 500 are the inflectional paradigms for 100 verbs: first person singular, third person plural, infinitive, gerund, past participle. There are few irregulars and the learner typically tries to coerce them into a regular paradigm.


English (CELEX)

Our CELEX datasets are subsets of the CELEX database (Baayen et al., 1995).

The English dataset contains 1000 word forms. 500 of these are the singular and plural inflections of 250 nouns. The other 500 are the inflectional paradigms for 125 verbs: past, third person singular, gerund and infinitive). The inflected forms are mostly regular. This dataset uses UK received pronunciation (military: [/ˈmɪl.ɪ.tɹɪ/]), in contrast to the English development dataset, which uses American standard pronunciation (military: [/ˈmɪl.ɪ.tɛɹ.i/]),.

Our learner succeeds in learning regular phonology on this dataset when shown enough evidence. We fail, however, to learn irregular patterns such as ablaut (swim/swam/swum, ring/rang/rung, etc.). These vowel changes are rare, and the features of our (current) phonology model are not rich enough to confidently predict that they will occur in a given form. Presumably the model does learn that certain vowel changes become more likely before a nasal consonant, and that deletion of the regular suffix is more likely after a nasal consonant—but not so likely as to emerge in the model's 1-best prediction.

An important instance where our learner struggles is /r/-deletion, found in this non-rhotic dialect of English. Underlying syllable-final /r/ is deleted unless it appears between two vowels. Consider the verb tour. We get the surface forms [tO] (tour), [tOz] (tours), [tOd] (toured) and [tOriN] (touring). These CELEX transcriptions do not mark the compensatory lengthening when /r/ is deleted in the first three forms. (Compare the IPA transcriptions [tɔː] (tour), [tɔːz] (tours), [tɔːd] (toured) and [tɔɹiŋ] (touring).) This makes it essentially impossible to guess that the UR is /tOr/ unless [tOriN] happens to be one of the observed surface forms (just as in the Maori case of final consonant deletion). This class of mistake accounts for a majority of our mistakes on both UR recovery (Table 2) and 1-best SR prediction (Figure 4).

Gold-Standard URs for Stems

These reference URs were constructed manually (with the help of scripts). Although URs are unobservable latent variables, these files represent the first author's reconstruction of URs for these datasets (following the textbook analyses in the cases of the datasets drawn from textbooks). These reconstructions should be uncontroversial in all cases. Table 2 shows how often our learner made the same reconstructions.

CELEX Data Splits

As explained in section 7.2 of the paper, for each dataset and each N in {200, 400, 600, 800}, we ran 10 experiments with different training sets of size N, sampled in a particular way. We partitioned each training set of size N into two files, "train" and "dev", in order to tune hyperparameters as explained in the caption of Figure 4. We then trained on the full training set (the union of "train" and "dev") and evaluated on the remaining forms ("test"). Note that our evaluation actually considered only a subset of the "test" file, as explained in section 7.1 and the caption of Figure 4.

Output From Our Method

The file format is a TSV with the following column labels: True Surface Form · Predicted Surface Form · -log(p(true form)) · expected edit distance.

Local Optima

On our English development data, the hypothesized URs have a surprisingly low agreement with gold (manually constructed) URs, as mentioned above. The difference is that where the gold analysis uses an contextual [I]-insertion rule in noun pluralization, our method instead learns a [I]-deletion rule:

Preferred Model Alternative Model
Word UR SR UR SR Remarks
dog dog# dog dog# dog
dogs dog#z dogz dog#z dogz
cat cat# cat cat# cat
cats cat#z cats cat#z cats voicing assimilation of the plural suffix
fox foks# foks foksI# foks
foxes foks#z foksIz foksI#z foksIz epenthetic [I] breaks up homorganic cluster /sz/
buddy budI# budI budI# bud alternative model predicts incorrect SR
buddies budI#z budIz budI#z budIz

The difference can be seen in the rows for fox/foxes. The standard analysis says that an epenthetic [I] is inserted into foxes by a regular process. Our learner's alternative analysis says that this [I] was underlyingly present in the stem, but is deleted from fox by a regular process that deletes all word-final [I].

Both analyses achieve perfect prediction of SRs in our dataset. The learner would be able to falsify the analysis only if it observed an SR such as [budI] with word-final [I], as shown at the end of the table. It cannot do this because such forms are essentially nonexistent in American English and do not appear in this dataset; e.g., the actual pronunciation of buddy is /budi/, with a different vowel. (By contrast, the UK English CELEX dataset does have a few words that end in [I], such as [&ktSw@rI] (actuary), which may explain why on that dataset, our learner found the standard analysis on all 10 splits.)

Although both analyses capture the surface generalization equally well, linguists prefer the standard analysis because it is more economical. After all, the alternative analysis must posit an extra underlying /I/ in any UR that ends with /s/ or /z/. Our model can see this disadvantage: the extra material is a priori unlikely under our 0-gram distribution Mφ, which prefers shorter URs.

Thus, our learner should prefer the standard analysis just as linguists do. The reason that it finds the alternative analysis is that our objective function is non-convex: the EM algorithm is a local search algorithm that can get stuck in local optima. We confirmed that by manipulating the initialization of EM, we could make the learner find either analysis.