Vocabulary Selice Romani

by Viktor Elšík  

The vocabulary contains 1619 meaning-word pairs ("entries") corresponding to core LWT meanings from the recipient language Selice Romani. The corresponding text chapter was published in the book Loanwords in the World's Languages. The language page Selice Romani contains a list of all loanwords arranged by donor languoid.

Word form LWT code Meaning Core list Borrowed status Source words

Field descriptions


Alternative forms of a lexeme are separated by a comma: e.g. felhó, felhóva ‘cloud’. Optional parts of the lexeme’s form are bracketed, e.g. daj (taj) dad ‘parents’.

Free meaning

The "meaning" of the Selice Romani lexeme is mostly not entered if it corresponds precisely to the pre-defined LWT meaning of the Meanings table. With loanwords, the field is also used to highlight meaning differences from the source form. The field thus sometimes ends up filled in even if there is precise correspondence between the Selice Romani meaning and the LWT meaning: for example, the noun tollo ‘pen’ is a loanword of Hungarian toll ‘feather; pen’, and so this field contains “pen (not *feather)”.

Grammatical info

With inflecting words, the field contains information on inflectional irregularities (oblique stems of nouns, comparatives of adjectives, perfective stems of verbs etc.). With nouns, the field almost always indicates their gender. With verbs, the field sometimes indicates their transitivity. With function words and adverbs, the field indicates the class of the word: pronoun (personal or reflexive), demonstrative, pro-word (interrogative or indefinite), preposition, adverb, co-verb (preverb, adverbial verb modifier), numeral and quantifier, particle etc. Occasionally, those functions of function words that are not sampled in the database are also mentioned. Especially with phrasal forms but also elsewhere, the field may contains information on syntactic construction. The field was also used to highlight mismatches between the pre-defined semantic category of the Meanings table and the grammatical word-class of the Romani word form, which are very rare: e.g. there is no adjective ‘stinking’ in Selice Romani, and so the verb khanden ‘to stink’ was used as an equivalent.

Comment on word form

This field contains various kinds of comments on the Selice Romani lexeme, its form and meaning, including:
– Rare form alternatives;
– Information on the syntactic construction and obligatory (or extremely frequent) collocations of the lexeme;
– Comments on “submeanings” of the lexeme and, rarely, synonyms.


The category ANALYZABLE PHRASAL is used for lexemes consisting of two or more words. The categories ANALYZABLE COMPOUND and ANALYZABLE DERIVED are used for compound and derived lexemes, respectively, whose morphological structure is fully transparent synchronically. The categories SEMI-ANALYZABLE and UNANALYZABLE require more detailed comments. The former category is assigned to several types of morphologically complex lexemes:

1. To synchronically non-transparent compounds: for example, the per nang o [ROOT/PREFIX-naked-INFL] ‘barefoot’ is, diachronically, a compound of the noun pr o ‘foot’ and the adjective nang o ‘naked’, but its first morpheme per is not regularly related to the nominal root.
2. To lexemes that contain a synchronically identifiable derivational marker but which, at the same time, have no synchronic base lexeme: for example, the noun lub ip e [ROOT-ABSTRACT-INFL] ‘adultery’ contains the productive de-adjectival marker of abstract nominalizations but there is no base lexeme *lub( o).
3. To lexemes that are derived from an existing lexeme by means of an obsolete or otherwise idiosyncratic derivational marker: for example, the noun šer and [head-SUFFIX] ‘space under one’s head’ is derived from the noun šer o ‘head’ by means of the unique marker and.
4. To pronouns, pro-words and other function words (e.g. modals) which involve idiosyncratic ‘derivational’ morphology. Although spatial adverbs and prepositions show somewhat more regular morphological relations, they too are assigned to this category.
5. To loanwords that are morphologically analyzable in the source language and whose source base has also been borrowed, unless the imported derivational marker has become productive in Selice Romani (e.g. šetít( )šíg o ‘darkness’ and šetít n o ‘dark’, from Hungarian sötít síg and its base sötít).

The category UNANALYZABLE is assigned to the following types of lexemes:

1. To lexemes with monomorphemic inflectional stems. Importantly, pre-inflectional adaptation markers of loanwords are excluded from consideration: for example, although the inflectional stem of the verb čukl in en [hiccough-LOAN-INFL] ‘to hiccough’, a loanwords of Hungarian csukl ik, is bimorphemic, the stem before the adaptation suffix in is monomorphemic.
2. To loanwords that are morphologically analyzable in the source language but whose source base has not been borrowed (e.g. zéčíg o ‘vegetables’ < Hungarian ződsíg ‘greenness; vegetables’, cf. *zéd n o < Hungarian ződ ‘green’).


This field is filled in for all analyzable lexemes and for semi-analyzable lexemes of the types 1 through 4 (see comments on the analyzability field). The abbreviations used are those of the Leipzing Glossing Rules plus those listed under "Abbreviations".


Age refers to the (diachronic) syntactic or derivation structure of the item, but not to its phonological form or to its meaning. The age of analyzable lexemes (collocations, compounds, and derivations) reflects the time of creation of such complex expressions rather the age of their parts. The relevant units of temporal continuity of univerbal lexemes are inflectional stems. For example, although the ‘canonical’ forms of the Selice Romani verb d-en ‘to give’ (citation form: third plural present indicative) do not directly continue the ‘cannonical’ forms of the Old Indo-Aryan verb dā-da-ti ‘to give’ (citation form: third singular present indicative), but rather certain inflectional forms without the initial reduplication, there is continuity of the verbs’ inflectional stems between Old Indo-Aryan and Selice Romani, viz. da > d-, and so the Selice Romani verb d-en ‘to give’ may be considered to go back, via Old Indo-Aryan, to Proto-Indo-European deh- ‘to give’. Two types of age categories are used:

• First, there are genealogical age categories (1–6). Some of these (1–3 and 5) represent nodes on the tree model of the genealogical affiliation of Selice Romani, and are assigned to lexemes that can be reconstructed for these stages of the language. I should note that some lexemes assigned to the OLD INDO-ARYAN category might be actually older, i.e. PROTO-INDO-IRANIAN or even PROTO-INDO-EUROPEAN, as I have only checked for pre-Indo-Aryan etymologies of Old Indo-Aryan etymons selectively. The category LATER THAN EARLY ROMANI includes lexemes that are dialect specific within Romani, i.e. not reconstructable for Early Romani, and that, in addition, are not loanwords, based on loanwords, or calqued from a post-Early Romani L2.

• Second, there are two subtypes of contact-related age categories: those that are defined with reference to the period of contact with an L2 or a cluster of L2s (7–13) and those that are defined with reference to the beginning of such period only (14–16). The former are assigned not only to loanwords but also to calques from a given L2. The latter subtype of contact-related age categories is only relevant for past L2s; they are assigned to lexemes that are based on loanwords from a given L2 or that contain derivational markers from that L2. Certain arbitrary decisions had to be taken with regard to the assignment of lexemes to concrete L2s (e.g. a loanword that may originate in Slovak or Czech, has been assigned to the age category SLOVAK; see also Chapter).

1 Proto-Indo-European 4500 BCE 3000 BCE
2 Proto-Indo-Iranian 2500 BCE 2000 BCE
3 Old Indo-Aryan 1900 BCE 500 BCE
4 Middle Indo-Aryan 500 BCE 700 CE
5 Early Romani 900 CE 1300 CE
6 later than Early Romani 1300 CE 2007 CE
7 West Asian L2 700 CE 1000 CE
8 Greek L2 900 CE 1300 CE
9 South Slavic L2 1300 CE 1750 CE
10 Hungarian L2 1650 CE 2007 CE
11 Vlax presence 1850 CE 2007 CE
12 Slovak L2 1920 CE 2007 CE
13 Czech L2 1950 CE 2007 CE
14 West Asian L2 or later 700 CE 2007 CE
15 Greek L2 or later 900 CE 2007 CE
16 South Slavic L2 or later 1300 CE 2007 CE

The dates of the age categories are generally very approximate (see the book chapter for discussion). Note that some age categories show temporal overlap or even coincidence (e.g. EARLY ROMANI and GREEK L2).


This field is filled in for very few records. There is no differentiation between colloquial and formal registers in Selice Romani. However, I have used this field to indicate the following register-like distinctions:
– CRYPTOLALIC marks secret words that are used only in communication contexts when outsiders (to the community of the speakers of Selice Romani) are not supposed to understand.
– VULGAR marks vulgar words that have more polite equivalents, e.g. khapaven ‘to drink’ (cf. general pijen). It does not mark words that merely have vulgar uses: e.g. murdajon ‘to die’ is only vulgar when used of humans, while it is the regular term when used of animals. Vulgar uses are marked as “[vulgar:]” in the Meaning field.
– OBSOLETE marks words that are not actively used by middle-aged or younger speakers. Some of these words are only remembered, but not actively used, even by older speakers.


This field has not been filled in for items with no evidence for calquing. Further details on calquing are given in the field "Created on loan basis".

Borrowed base

For some words, the field contains information on the origin and structure of those Selice Romani words that are themselves not loanwords but have been created on the basis of loanwords (marked as loan basis): for example, the noun kiráckiňa ‘queen’ is not a loanword but its derivational base, the noun királi ‘king’, is; the particle kampe ‘need, should’ is not a loanword but has developed through grammaticalization of the verb kamen ‘to want etc.’, which is a loanword, and the accusative form of the indigenous reflexive pronoun pe; etc.
For some words, the field contains information on calqued or semi-calqued form or meaning structure of Selice Romani words (marked as calque): for example, the collocation den kéčen ‘to lend’ is a semi-calque of the Hungarian collocation kölcsön ad [loan.ADV give]; the second meaning of the verb roden ‘to look for; earn’ is due to polysemy calquing of the Hungarian verb keres ‘look for; earn’; etc.

Comment on borrowed

This field contains four kinds of information:
– Evaluation of published borrowing hypothesis on words classified as ‘probably borrowed’, ‘perhaps borrowed’, or ‘little evidence for borrowing’ in the Borrowed status field.
– Explanations of ‘unexpected’ differences, with regard to the source item, in the form and/or meaning of words that are classified as clear loanwords.
– Details on unusual morphological adaptation of clear loanwords.
– Details on borrowed derivational markers in non-loanwords.


221 etymologies are based on previous etymological works or remarks on Romani (Berger 1959; Boretzky & Igla 1994; Hancock 1995; Kostić 1994; Matras 2002; Mānušs et al. 1997; Tálos 1999; Turner 1926; Tzitzilis 2001; Vekerdi 2000) and Indo-Aryan (Beníšek 2006; Kuiper 1948; Lubotsky 2001; Mayrhofer 1996; Turner 1962–6; Witzel 1999). The field contains my surname (882 records) if the etymologies are my own, which is especially the case with all loanwords from Hungarian, Slovak and Czech; but also if I have made a selection among several previously suggested etymologies.

Beníšek, Michael. 2006. “Ke kořenům slova rom.” [On the roots of the word rom.] Romano džaniben, jevend, 9–28.

Berger, Hermann. 1959. “Die Burušaski-Lehnwörter in der Zigeunersprache.” Indo-Iranian Journal 3: 17–43.

Boretzky, Norbert & Igla, Birgit. 1994. Wörterbuch Romani–Deutsch–Englisch für den südosteuropäischen Raum: mit einer Grammatik der Dialektvarianten. Wiesbaden: Harrassowitz.

Hancock, Ian. 1995. “On the migration and affiliation of the Ḍōmba: Iranian words in Rom, Lom and Dom Gypsy.” In: Matras, Yaron (ed.) Romani in contact: the history and sociology of a language. Amsterdam: Benjamins. 25–51.

Kostić, Svetislav. 1994. “Romani čhib a jazykový kontakt” [Romani čhib and language contact] Romano džaniben 1: 42–54.

Kuiper, Franciscus B. J. 1948. Proto-Munda words in Sanskrit. Amsterdam: N. V. Noord-Hollandsche Uitgevers Maatschappij.

Lubotsky, Alexander M. 2001. “The Indo-Iranian substratum.” In: Carpelan, Chr., Parpola, A. & Koskikallio, P. (eds.) Early contacts between Uralic and Indo-European: linguistic and archaeological considerations. Papers presented at an international symposium held at the Tvärminne Research Station of the University of Helsinki 8-10 January 1999. (Mémoires de la Société Finno-ougrienne 242.) Helsinki 2001. 301–317.

Matras, Yaron. 2002. Romani: a linguistic introduction. Cambridge: Cambridge University Press.

Mānušs, Leksa, Neilands, Jānis & Rudevičs, Kārlis. 1997. Čigānu–latviešu–angļu etimoloģiskā vārdnīca un latviešu–čigānu vārdnīca. [Gypsy–Latvian–English etymological dictionary and Latvian–Gypsy dictionary.] Rīgā: Zvaigzne ABC.

Mayrhofer, Manfred. 1986–2001. Etymologisches Wörterbuch des Altindoarischen. 3 volumes. Heidelberg: Carl Winter.

Tálos, Endre. 1999. “Etymologica Zingarica.” Acta Linguistica Hungarica 46: 215–268.

Turner, Ralph L. 1926. “The position of Romani in Indo-Aryan.” Journal of the Gypsy Lore Society, Third series 5: 145–189.

Turner, Ralph L. 1962–1966. A comparative dictionary of the Indo-Aryan languages. Oxford: Oxford University Press.

Tzitzilis, Christos. 2001. “Mittelgriechische Lehnwörter im Romanes.” In: Igla, Birgit & Stolz, Thomas (eds.) “Was ich noch sagen wollte...” A Multilingual Festschrift for Norbert Boretzky on the Occasion of His 65th Birthday (Sprachtypologie und Universalienforschung, Supplements, Studia typologica 2). Berlin: Akademie Verlag. 328–340.

Vekerdi, József; with the assistance of Zsuzsa Várnai. 2000. A comparative dictionary of Gypsy dialects in Hungary. Gypsy–English–Hungarian dictionary with English to Gypsy and Hungarian to Gypsy word lists. Budapest: Terebess Publications.

Witzel, Michael. 1999. “Substrate languages in Old Indo-Aryan (Rgvedic, Middle and Late Vedic).” Electronic Journal of Vedic Studies 5: 1–67.


– The borrowing effect has been categorized as COEXISTENCE if, alongside the relevant loanword, there is an older form of the same or very similar meaning (roughly: in the scope of the pre-defined LWT meaning);
– The borrowing effect has been categorized as REPLACEMENT if a form of the same or very similar meaning is reconstructable for earlier stages of the language. Most reconstructed forms are Early Romani, though a few are post-Early Romani (i.e. dialect-specific) forms that have been lost in Selice Romani but retained in closely related Romani dialects (see Custom field 3).
– The borrowing effect has been categorized as INSERTION in two – partly overlapping – kinds of situations: a) if it can be shown that the relevant meaning of the loanword has not been lexicalized in previous stages of the language: for example, the meaning ‘cousin’ can be reconstructed to have been expressed analytically (e.g. ‘mother’s borther’s daughter’) in Early Romani, as it still is in many Romani dialects; and b) if the relevant concept is likely to have been alien to the pre-borrowing Romani culture: this includes not only modern world concepts such as ‘bus’, ecologically alien concepts such as ‘kangaroo’, but also concepts that are marginal even to the present-day culture of the speech community, such as ‘spring’, ‘east’ or ‘breakfast’.


This field is filled in for borrowed nouns (and for adverbs that are based on case forms of borrowed nouns), adjectives and verbs; it is not filled in for borrowed adverbs (with the exception of the above) and function words, which generally do not allow for morphological integration in Selice Romani.
– A loanword has been categorized as HIGHLY integrated if it shows so-called oikoclitic morphology (see book chapter for details).
– A loanword has been categorized and as INTERMEDIATELY integrated if it shows so-called xenoclitic morphology (see book chapter for details).
– Two items, the address/vocative particles mama ‘mother!’ and tata ‘father!’, have been classified as UNINTEGRATED.


This field has been filled in in such a way that its contents is logically dependent on the contents of the Effect field. The category NOT PRESENT has not been used.

Effect Environmental salience
no information no information
replacement present in pre-contact environment
co-existence present in pre-contact environment
insertion present only since contact


ABL2 Old Ablative (synchronically, an adverbial marker)
DIM diminutive
FACT factitive
FREQ frequentative
INFL inflection
LOC2 Old Locative (synchronically, an adverbial marker)
MULTIPL multiplicative
SOC sociative (instrumental/comitative)

Further category labels used in morpheme-by-morpheme glosses:

ABSTRACT abstract or collective de-adjectival or de-nominal nominalization
ACTION action or product de-verbal nominalization
ADDITIVE additive numeral connector
ATTENUATIVE attenuative
COMPARATIVE comparative
DIRECTIVE directive orientation, movement towards a localization
DISTAL distal deictic root
EXTRAESSIVE extraessive localization (‘outside’)
INESSIVE inessive localization (‘in’)
INFERIOR inferior localization (‘under’)
INTERROGATIVE interrogative root
LOAN loanword adaptation marker
MIDDLE middle, “mediopassive”
NOUN noun
ORDINAL ordinal
PLAIN_DEICTIC plain (non-specific) deictic root
POSTERIOR posterior localization (‘behind’)
PREFIX prefix with a hard-to-describe function
PROXIMAL proximal deictic root
REDUPLICATION reduplicating morpheme
ROOT semi-analyzable root
SPECIFIC_DEICTIC specific deictic root
STATIVE stative orientation
SUFFIX suffix with a hard-to-describe function
SUPERIOR superior localization (‘above’)
VERB verb; verb-deriving marker

Further abbreviations:

SR Selice Romani
ER Early Romani