TRAMORPH ITC-IRST - November 1998 main.txt 1. Introduction The morphological analyzer is a tool able to decompose each Italian word into its morphemes and to give both syntactical information and transcription for each valid decomposition. The morpho-lexicon includes more than 100,000 entries; each morpheme belongs to a class and is associated to its meta-transcription, which is an intermediate representation that can evolve in different ways, depending on the adjacent morphemes. The morphological engine can recognize an input word by combining morphemes, according to a set of concatenation rules. The morpho-lexicon was obtained by properly processing the Zingarelli dictionary, and adding by hand all possible inflections. This base lexicon was then enriched with names and neologisms found in the 65,000 most frequent words of the newspaper "Il Sole 24 Ore". Also the most frequent Italian proper names and surnames (from the telephone directory), geographical names, commonly used foreign words were added to the lexicon. Each morpheme was phonetically transcribed and manually checked. 2. TRAMORPH history 2.1. The morphological tool TRAMORPH is a morphological analyzer composed of a morpho-lexicon and a morphological engine. The morpho-lexicon includes more than 100,000 entries; each morpheme belongs to a class and is associated to its meta-transcription, which is an intermediate representation that can evolve in different ways, depending on the adjacent morphemes. All meta-transcriptions were hand-checked. The morphological engine is capable of recognizing or rejecting an input word by combining morphemes, according to a set of concatenation rules. The transcription of a recognized word can be found by concatenating the meta-transcriptions of the morphemes found and by applying a post-processing to disambiguate the meta-symbols. In case of ambiguities, all valid decompositions for a word can be found, each one with the corresponding transcription. Particular care was posed in the design of the classes, in order to minimize the risk of recognizing (generating) invalid words. For instance, due to the presence of irregular verbs in the Italian language, more than 60 verbal classes were identified, each one with the corresponding set of inflections (about 50 per class). Nouns and adjectives are divided into about 80 classes, according to their inflection set, pos, gender and number. Overgeneration problems suggested not to include alteration ("gamba" -> "gambizzare") and derivation ("casa" -> "casetta") capabilities in the rule set, as well as prefix handling ("ri-telefonare"). All these items must be explicitly included in the morpho-lexicon to be recognized. 2.2. The morpho-lexicon The first morpho-lexicon was obtained by properly processing about 88,000 items from the Zingarelli dictionary. Each of them was assigned to a class, its suffix was removed (functional words apart) and its meta-transcription was automatically generated and then hand-checked. The sets of the valid inflections for each class was added by hand. This process led to about 95,000 morphemes. This base lexicon was then enriched with names and neologisms found in the 65,000 most frequent words of the newspaper "Il Sole 24 Ore" and not recognized by the first lexicon. The new words included mainly neologisms, person names and surnames, company and geographical names, acronyms, words with prefixes ("mega-centrale"), commonly used foreign words. The most frequent Italian proper names and surnames, extracted from the telephone directory, were also added to the lexicon, which is actually composed of about 100,700 morphemes. 3. Warnings Some words can be pronounced in different ways, depending on regional inflections. The most common changes include: open vs close vowels (e <-> E, o <-> O) and voiced vs unvoiced fricatives (x <-> X, z <-> Z). Sometimes both pronunciations occur in the morpho-lexicon, sometimes no. Only the units for the Italian language were used to transcribe the lexicon. Especially for foreign words, this fact leads to some awkward but effective transcription, at least for speech recognition purposes. Despite the manual check and a number of automatic and semiautomatic checks, some error might be still present.