ELG resources and tools
- SrpELTeC-gold – Named Entity Recognition Training corpus for Serbian – The selection of 11 full novels and excerpts from 15 novels from Serbian literary corpus of novels written more than a century ago, have been automatically labelled with SrpNER system for Serbian in the first stage of the gold standard preparation. Contains 330.119 tokens, 7 classes: person, organization, location, event, work, demonym, role. License CC-BY-NC-SA-4.0.
- SrpKor4Tagging – Corpus is created via mix of literary (⅓) and administrative (⅔) texts in Serbian. It is tagged for POS for 2 tagsets: Universal POS tagset and SrpLemKor tagset (made according to traditional, descriptive Serbian grammar) and lemmatized. Consists of 342.803 tokens, license CC-BY-4.0.
- RudKorP – Serbian Public Mining Corpus – Mining Corpus, specialized corpus in the field of mining and mineral resource exploitation, University of Belgrade, Faculty of Mining and Geology. Contains 2.34 million words, license CC-BY-4.0.
- INTERA Corpus – the Serbian-English part – bilingual corpus 1 million words per language, paired at sentence level, license CC-BY-4.0. 1.
- INTERA Corpus – the Serbian POS annotated part of the SR-EN pair –
million words, license CC BY 4.0.
- MULTEXT-East “1984” annotated corpus 4.0 – automatically tagged by grammar categories, part of speech and lemmas and manually corrected, license MULTEXT-East CC BY-NC-SA 4.0.
- Corpus 80 jours parallel corpus consists of 3.700 paired segments, mainly sentences, license CC-BY-NC-SA-4.0.
- SrpCNNER – Named Entity Recognizer for Serbian (7 classes) –
A Named Entity Recognizer (NER) trained to recognize 7 different named entity types, with a Convolutional Neural Network (CNN) architecture, having F1 score of approx 91% on the test dataset.
- SrpKor4Tagging-TreeTagger – TreeTagger models for tagging using Universal POS and SrpLemKor tagsets, trained using the SrpKor4Tagging annotated corpora and SrpMD4Tagging lexicons. License CC-BY-4.0.
- SrpKor4Tagging-spaCy – spaCy POS-tagging models for tagging using Universal POS and SrpLemKor tagsets, trained using the SrpKor4Tagging annotated corpora. License CC-BY-4.0.
- SrpMD – Serbian Morphological Dictionaries – SrpMD follows the methodology and format (known as DELAS/DELAF) that was developed in LADL (Laboratoire d’Automatique Documentaire et Linguistique), 10.288 multiword units, 88.753 simple words и 3.753.750 word forms, license CC-BY-NC-SA-4.0.
- SrpMD4Tagging – Serbian Morphological Dictionaries for Tagging -SrpMD4Tagging – Serbian Morphological Dictionaries for Tagging derived from Serbian Morphological Dictionaries (Krstev & Vitas) as lookup dictionary for assigning lemma for given inflected form and POS tag. Two files for two POS tagsets available: Univesal Dependencies and traditional Serbian POS tagset,
935.466 tagged word forms, license CC-BY-NC-SA-4.0.
- GeolISSTerm – dictionary of geologic terms is the electronic dictionary as a special-purpose taxonomy of basic geologic concepts and terms. GeolISSTerm is part of the Geologic Information System of Serbia (GeolISS) used for validation, classification and specification of the observed and the interpreted geological attributes. Contains 2.631 bilingual terms with definitions and synonyms, license
- Bibliša – Multilingual digital library tool – Biblisha is publicly available multilingual digital library, developed for management, search and the browsing of aligned bilingual text collections. Based on MongoDB NoSQL-database, a web tool enable the use of rich information in the stored text collections. Two level search, with and without login.
- Leximirka – lexical database and a web application for developing, managing and exploring lexicographic data. It enables lexical entry control, automatic vocabulary enrichment, multiuser work, and establishment of relations among lexical entries. The rule-based system enables automatic linking between lexical entries. Login required for the search.