Serbian Unitex Day

Tita Kiriakopoulou: “Extraction and annotation of “location names” (invited lecture) [pdf]

Introduced as part of the last Message Understanding Conferences dedicated to information extraction, Named Entity extraction is a well-studied task in Natural Language Processing. The recognition and the categorization of person names, location names, organisation names, etc. is regarded as a fundamental process for a wide variety of natural language processing applications dealing with content analysis and many research works are devoted to it, achieving very good results. One of our objectives is the identification and automatic (or semi-automatic) annotation of location names in order to apply the most appropriate information extraction methods. Then the main objective concerns the combination and interoperability between symbolic and statistical NLP (Natural Language Processing) methods (symbolic rules, machine learning, and data mining). Our work consisted of recognizing named entities and in particular locations with Unitex, annotating them with Brat and correcting them manually. The recall and accuracy rates are very encouraging, but the question remains: What is a location name?

Cvetana Krstev: “Old or new, we repair, adjust and alter (texts)” [pdf]

In this presentation we will show how e-dictionaries and cascades of finite-state transducers as implemented in Unitex can be used to solve three text transformation problems: correction of texts after OCR, restoration of diacritics and translation in a variant dialect.

Duško Vitas: “Derivation and Graphs” [pdf]

Unitex supports production of e-morphological dictionaries through the use of inflectional classes represented by inflectional transducers. However, for derivational classes such opportunity does not exist. In this presentation, we will not talk about derivational nests in general; we will rather reduce our problem to the phenomenon of the regular derivation that we define as a derivational process in which the meaning of a derived word can be predicted on the basis of the meaning of the original word. In Serbia, such derivational processes are very rich; however, the traditional lexicography does not record them systematically. A mechanism for modelling regular derivation can significantly reduce the number of unrecognized words in text analysis. In this presentation we will present some Unitex mechanisms for regular derivation processing, such as morphological dictionary graphs, but also some enhancements that could be incorporated in Leximirka in order to facilitate processing of derived lemmas. Finally, we will say a few words about the reorganization of lexicographic entries in the light of the presented formalization of regular derivation.

Jelena Jaćimović: “Serbian NER – Dictionaries, Graphs and Cascades” [pdf]

Over the past decades Named Entity Recognition (NER) task, often used as a basis for further Natural Language Processing (NLP) treatments, has been widely studied and diverse language systems and resources has been developed. This presentation provides a concise overview of Serbian NER current state, both in terms of the existing lexical resources and finite-state transducers, used for detection of various named entity classes and description of their context.

Ranka Stanković, Miloš Utvić: “Vebran API for query expansion in Serbian Corpora” [pdf]

The upgrading of existing web interfaces for searching corpora (RudKor, SrpKor) and digital libraries (Biblisha, ROmeka) using Vebran web API will be introduced for IMS CWB and NoSketch Engine. Query expansion options implemented in Vebran, based on morphological electronic dictionaries (developed for Unitex), Serbian Wordnet and terminological databases will be demonstrated.

Ranka Stanković, Branislava Šandrih: “Extraction of Bilingual Terminology using Graphs, Dictionaries and Giza++” [pdf]

In this presentation we will show how Multi-Word Units can be obtained using syntactic graphs and morphological dictionaries developed for the Unitex distribution, for compiling bilingual dictionary. For domain specific parallel corpora with a help of additional resources, domain specific lexica extraction will be presented.

Biljana Lazić, Mihailo Škorić: “From DELA Dictionaries to Leximirka Lexical Database” [pdf]

In this presentation, we will introduce our approach to transformation of Serbian Morphological Dictionaries – SMD from DELA dictionaries to Leximirka lexical database. New possibilities for dictionary improvements based on the lexical database will be presented. We will also present sets of rules developed to establish relations between lexical entries.

Jelena Jaćimović: “Unitex vs. TXM: What’s the Difference?” [pdf]

Advances in computational linguistics have made it possible to automatically process and analyse many language mechanisms. On this occasion, we are exploring the difference between two opensource, cross-platform, multilingual corpus processing environments, developed for the analysis of natural language texts, namely Unitex/GramLab and TXM. Even though both tools have resulted in great insights and interesting findings, they are actually two different concepts that do have some overlap, which will be briefly described.

***The written and enhanced versions of presentations are published in Infotheka issue.

Categories