UpLORD is a swissuniversities ORD-funded 2-year project (2023-2024) hosted by the University of Zurich, with the supoport of the the Zurich University Library and the CLARIN-CH Consortium. Since 2018, a consortium of partners has been working on building a national ecosystem of infrastructures, which covers the whole linguistic data lifecycle according to ORD requirements (FAIR principles) from data generating, processing and analyzing to data sharing and archiving. This ecosystem includes the national technology platform LiRI and the national repository for publishing and archiving linguistic data (SWISSUbase) as service providers, a database of Swiss media texts and a platform for hosting of and searching in large text and audio/video corpora.
The project focuses on upgrading workflows and interoperability of existing infrastructure services, establishing working groups on the national level, documenting and promoting best practices, raising awareness and training about ORD practices in the context of teaching, research and publishing, and building a robust practice of data curation. In the long-term, this project will significatively contribute to developing a strong foundation for a sustainable ORD strategy for linguistic data in Switzerland.
Project management: Noah Bubenhofer (LiRI), Andrea Malits (Universitätsbibliothek Zürich) and Cristina Grisot (CLARIN-CH)
In the context of the requirements of Open Science and of FAIR principles, on the one hand, and of that of more challenging data sets (such as, sensitive or with copyright issues), on the other hand, we identified several gaps regarding the current situation in Switzerland and will address them in this ORD project:
(1) the provision and the usability of the individual corpus platforms across Switzerland: their drawback is that they do not satisfy the Interoperability principle,
(2) the status of several linguistic corpora: most of them do not satisfy all four FAIR principles,
(3) the lack of meaningful metadata and of infrastructures to manage a diversity of annotations of linguistic data, this being linked to the Findable and the Reusable principles,
(4) the proper management of sensitive data, informed consent, copyright and intellectual property issues, which are necessary so that the sets of data adhere to the FAIR requirements,
(5) the metadata schema on SWISSUbase which must be adapted and amended to satisfy the Findable principle,
(6) uploading workflows on SWISSUbase,
(7) the quality control and data curation in SWISSUbase to satisfy the Reusable principles,
(8) the setting of ethical standards for best practices, good habits and frames of mind in linguistic data management and collaboration, as well as the lack of training for the target scientific communities to adopt these standards.
Regarding gaps (1) and (2), this project will move forward the implementation of the Linguistic Corpus Platform LCP@LiRI at the national level following the FAIR principles. This will significatively improve discovery, access, integration, usability and reusability of corpora according to FAIR principles, as well as simplify re-formatting, assembling, harmonizing and standardizing the data and metadata.
Regarding gap (3), the current implementation of VIAN-DH is used as a test case for modelling complex annotation schemes that LCP@LiRI has to offer. Data processed within VIAN-DH is complex interactional data consisting of verbal, paraverbal and non-verbal annotation levels. The project will collect further complex annotation models via the CLARIN-CH and NCCR part-ners as well as via other clients LiRI is already working with. It is then evaluated whether the LCP@LiRI can also map these. This process goes hand in hand with the development of best practices to show researchers how to deal with complex annotations, but also to showcase the resulting opportunities.
Regarding gaps (4)-(6), the already implemented modular linguistic metadata schema of LaRS/SWISSUbase will be specified and adapted to the needs that are not included for the moment. To upgrade the usability of SWISS¬Ubase, it should be also possible to provide a workflow via an API. In addition, easy workflows between LCP@CLARIN-CH and VIAN-DH will be implemented.
Regarding gap (7), this project will set up national working groups whose role will be to develop specific metadata for various types of linguistic data (e.g., sociolinguistics data, psycholinguistics experimental, neurolinguistics data, conversational analysis data, lexicog-raphy data, computational data, acquisitional data, multimodal data) that satisfy the FAIR principles.
Regarding gap (8), this project will develop showcases, best practices and good habits for data management according for FAIR principles (such as, planning and writing data management plans, using sustainable file formats, setting solid file-naming conventions, finding data storage backup schemes, creating informative metadata and documentation) and will promote the ORD practice in the CLARIN-CH, the NCCR "Evolving Language" and other linguistic communities. A special attention will be given to finding solutions that overcome disciplinary boundaries.