ELG resources and tools

Corpora

  • SrpELTeC-gold – Named Entity Recognition Training corpus for Serbian – The selection of 11 full novels and excerpts from 15 novels from Serbian literary corpus of novels written more than a century ago, have been automatically labelled with SrpNER system for Serbian  in the first stage of the gold standard preparation. Contains 330.119 tokens, 7 classes: person, organization, location, event, work, demonym, role. License CC-BY-NC-SA-4.0.
  • SrpKor4Tagging – Corpus is created via mix of literary (⅓) and administrative (⅔) texts in Serbian. It is tagged for POS for 2 tagsets: Universal POS tagset and SrpLemKor tagset (made according to traditional, descriptive Serbian grammar) and lemmatized. Consists of 342.803 tokens, license CC-BY-4.0.

Models

  • SrpCNNERNamed Entity Recognizer for Serbian (7 classes)
    A Named Entity Recognizer (NER) trained to recognize 7 different named entity types, with a Convolutional Neural Network (CNN) architecture, having F1 score of approx 91% on the test dataset.
    License CC-BY-NC-SA-4.0.
  • SrpKor4Tagging-TreeTagger – TreeTagger models for tagging using Universal POS and SrpLemKor tagsets, trained using the SrpKor4Tagging annotated corpora and SrpMD4Tagging lexicons. License CC-BY-4.0.
  • SrpKor4Tagging-spaCy – spaCy POS-tagging models for tagging using Universal POS and SrpLemKor tagsets, trained using the SrpKor4Tagging annotated corpora. License CC-BY-4.0.

Lexicons

  • SrpMD – Serbian Morphological Dictionaries – SrpMD follows the methodology and format (known as DELAS/DELAF) that was developed in LADL (Laboratoire d’Automatique Documentaire et Linguistique), 10.288 multiword units, 88.753 simple words и 3.753.750 word forms, license CC-BY-NC-SA-4.0.
  • SrpMD4Tagging – Serbian Morphological Dictionaries for Tagging -SrpMD4Tagging – Serbian Morphological Dictionaries for Tagging derived from Serbian Morphological Dictionaries (Krstev & Vitas)  as lookup dictionary for assigning lemma for given inflected form and POS tag. Two files for two POS tagsets available: Univesal Dependencies and traditional Serbian POS tagset,
    935.466 tagged word forms, license CC-BY-NC-SA-4.0.
  • GeolISSTerm – dictionary of geologic terms is the electronic dictionary as a special-purpose taxonomy of basic geologic concepts and terms. GeolISSTerm is part of the Geologic Information System of Serbia (GeolISS) used for validation, classification and specification of the observed and the interpreted geological attributes. Contains 2.631 bilingual terms with definitions and synonyms, license
    CC-BY-NC-SA-4.0.

Tools

  • Bibliša – Multilingual digital library tool – Biblisha is publicly available multilingual digital library, developed for management, search and the browsing of aligned bilingual text collections. Based on MongoDB NoSQL-database, a web tool enable the use of rich information in the stored text collections. Two level search, with and without login.
  • Leximirka – lexical database and a web application for developing, managing and exploring lexicographic data. It enables lexical entry control, automatic vocabulary enrichment, multiuser work, and establishment of relations among lexical entries. The rule-based system enables automatic linking between lexical entries. Login required for the search.

ELG services

  • SrpCNNER service – Web service that enables annotation of provided text using SrpCNNER having the following tagset: PERS, ROLE, LOC, DEMO, ORG, WORK & EVENT. Available online without login.

Comments are closed.