EDBL: a General Lexical Basis for the Automatic Processing of Basque
Résumé
EDBL (Euskararen Datu-Base Lexikala) is a general-purpose lexical database used in Basque text-processing tasks. It is a large repository of lexical knowledge (currently around 80,000 entries) that acts as basis and support in a number of different NLP tasks, thus providing lexical information for several language tools: morphological analysis, spell checking and correction, lemmatization and tagging, syntactic analysis, and so on. It has been designed to be neutral in relation to the different linguistic formalisms, and flexible and open enough to accept new types of information. A browser-based user interface makes the job of consulting the database, correcting and updating entries, adding new ones, etc. easy to the lexicographer.
The paper presents the conceptual schema and the main features of the database, along with some problems encountered in its design and implementation in a commercial DBMS. Given the diversity of the lexical entities and the complex relationships existing among them, three total specializations have been defined under the main class of the hierarchy that represents the conceptual schema. The first one divides all the entries in EDBL into Basque standard and non-standard entries. The second divides the units in the database into dictionary entries (classified into the different parts-of-speech) and other entries (mainly non-independent morphemes and irregularly inflected forms). Finally, another total specialization has been established between single-word entries and multiword lexical units; this permits us to describe the morphotactics of single-word entries, and the constitution and surface realization schemas of multiword lexical units.
A hierarchy of typed feature structures (FS) has been designed to map the entities and relationships in the database conceptual schema. The FSs are coded in TEI-conformant SGML, and Feature Structure Declarations (FSD) have been made for all the types of the hierarchy. Feature structures are used as a delivery format to export the lexical information from the database. The information coded in this way is subsequently used as input by the different language analysis tools.
The paper presents the conceptual schema and the main features of the database, along with some problems encountered in its design and implementation in a commercial DBMS. Given the diversity of the lexical entities and the complex relationships existing among them, three total specializations have been defined under the main class of the hierarchy that represents the conceptual schema. The first one divides all the entries in EDBL into Basque standard and non-standard entries. The second divides the units in the database into dictionary entries (classified into the different parts-of-speech) and other entries (mainly non-independent morphemes and irregularly inflected forms). Finally, another total specialization has been established between single-word entries and multiword lexical units; this permits us to describe the morphotactics of single-word entries, and the constitution and surface realization schemas of multiword lexical units.
A hierarchy of typed feature structures (FS) has been designed to map the entities and relationships in the database conceptual schema. The FSs are coded in TEI-conformant SGML, and Feature Structure Declarations (FSD) have been made for all the types of the hierarchy. Feature structures are used as a delivery format to export the lexical information from the database. The information coded in this way is subsequently used as input by the different language analysis tools.
Loading...