Navigation auf uzh.ch

Suche

LiRI - Linguistic Research Infrastructure

Upgrading the linguistic ORD-ecosystem - UpLORD

Description

UpLORD is a swissuniversities ORD-funded 2-year project (2023-2024) hosted by the University of Zurich, with the support of the Zurich University Library and the CLARIN-CH Consortium. Since 2018, a consortium of partners has been working on building a national ecosystem of infrastructures, which covers the whole linguistic data lifecycle according to ORD requirements (FAIR principles: Findable, Accessible, Interoperable, Reusable) from data generating, processing and analyzing to data sharing and archiving. This ecosystem includes the national technology platform LiRI and the national repository for publishing and archiving linguistic data (SWISSUbase) as service providers, a database of Swiss media texts and a platform for hosting of and searching in large text and audio/video corpora. 

The project focuses on upgrading workflows and interoperability of existing infrastructure services, establishing working groups on the national level, documenting and promoting best practices, raising awareness and training about ORD practices in the context of teaching, research and publishing, and building a robust practice of data curation. In the long-term, this project will significantly contribute to a strong foundation for a sustainable ORD strategy for linguistic data in Switzerland.

Principal investigators

Prof. Dr. Noah Bubenhofer (LiRI)

Dr. Andrea Malits (Universitätsbibliothek Zürich)

Dr. Cristina Grisot (CLARIN-CH)

 

 

Situation of the linguistic ecosystem (early 2023)

In the context of the requirements of Open Science and of FAIR principles, on the one hand, and of that of more challenging data sets (such as, sensitive or with copyright issues), on the other hand, we identified several gaps regarding the current situation in Switzerland and will address them in this ORD project:

(1) the provision and the usability of the individual corpus platforms across Switzerland: their drawback is that they do not satisfy the Interoperability principle,

(2) the status of several linguistic corpora: most of them do not satisfy all four FAIR principles,

(3) the lack of meaningful metadata and of infrastructures to manage a diversity of annotations of linguistic data, this being linked to the Findable and the Reusable principles,

(4) the proper management of sensitive data, informed consent, copyright and intellectual property issues, which are necessary so that the sets of data adhere to the FAIR requirements,

(5) the metadata schema on SWISSU­base which must be adapted and amended to satisfy the Findable principle,

(6) uploading workflows on SWISSUbase,

(7) the quality control and data curation in SWISSUbase to satisfy the Reusable principles,

(8) the setting of ethical standards for best practices, good habits and frames of mind in linguistic data management and collaboration, as well as the lack of training for the target scientific communities to adopt these standards.

Project goals and achievements (May 2024)

Regarding gaps (1) and (2), this project will move forward the implementation of the LiRI Corpus Platform at the national level following the FAIR principles. This will significatively improve discovery, access, integration, usability and reusability of corpora according to FAIR principles, as well as simplify re-formatting, assembling, harmonizing and standardizing the data and metadata.

  • The LiRI Corpus Platform LCP is a software system for handling and querying corpora of different kinds: text, audio and video (see the modules catchphrase, soundscript and videoscope.
  • Up to now, numerous tests with importing and quering various corpora have been carried out on the test version. The public version will be available summer 2024, when users will be able to query corpora directly from their browser, and upload their own corpora using a command-line interface. 

Regarding gap (3), the current implementation of VIAN-DH is used as a test case for modelling complex annotation schemes that LiRI Corpus Platform has to offer. Data processed within VIAN-DH is complex interactional data consisting of verbal, paraverbal and non-verbal annotation levels. The project will collect further complex annotation models via the CLARIN-CH and NCCR partners as well as via other clients LiRI is already working with. It is then evaluated whether the LCP@LiRI can also map these. This process goes hand in hand with the development of best practices to show researchers how to deal with complex annotations, but also to showcase the resulting opportunities.

  • Complex annotation representation is constantly being tackled during LCP implementation process.
  • Up to now, data converters from (TEI-)XML, generic XML and CWB format to CoNLL-U+ have been implemented.

Regarding gaps (4)-(6), the already implemented modular linguistic metadata schema of LaRS@SWISSUbase will be specified and adapted to the needs that are not included for the moment. To upgrade the usability of SWISSUbase, it should be also possible to provide a workflow via an API. In addition, easy workflows between LCP and VIAN-DH will be implemented.

  • Version control workflows and launch of API for data upload on SWISSUbase are implemented
  • Data Service Unit@UZH established as single point of entry for linguistic data to be deposited on SWISSUbaseswissubase.ch
  • Webinar on the use LaRS/SWISSUbase services (February 2023), see also: Webinar: The Language Repository of Switzerland, powered by SWISSUbase (youtube.com)

Regarding gap (7), this project will set up national CLARIN-CH working groups whose role will be to develop specific metadata for various types of linguistic data (e.g., sociolinguistics data, experimental psycholinguistics, neurolinguistics data, conversational analysis data, lexicography data, computational data, acquisitional data, multimodal data) that satisfy the FAIR principles. 

  • To develop specific metadata, a a mixed approach was adopted: (1) a CLARIN-CH WG on Learner corpora, (2) individual discussions with researchers (e.g. for sign language data) and (3) the reuse of CLARIN CMDI metadata schema.
  • Mapping of existing metadata schema on SWISSUbase to CLARIN CMDI profiles is ongoing
  • Adapting the metadata schema according to the needs and requirements of the linguistic community:
    • Challenge of balancing individual and more general needs/requirements: one additional focus that has emerged as very important is on functionality of SWISSUbase as a catalogue for data and also for services (linkage functionality only).
  • A survey to collect information about recommended/standard data formats is currently in preparation. After an initial pre-test with data experts working at LiRI, SWISSUbase and NCCR Evolving Language (April 2024), the survey is currently circulated in the CLARIN-CH community (May-June 2024). Find the survey here

Regarding gap (8), this project will develop showcases, best practices and documentation for data management according for FAIR principles (such as, planning and writing data management plans, using sustainable file formats, setting solid file-naming conventions, finding data storage backup schemes, creating informative metadata and documentation) and will promote ORD practices in the CLARIN-CH, the NCCR  "Evolving Language" and other linguistic communities. Special attention will be given to finding solutions that overcome disciplinary boundaries.

Steering Committee / Governance

Dissemination

  1. Bubenhofer, N., Malits, A., Strebel, S., Gräen, J., Buerli, S., & Grisot, C. (2023, December). Building and consolidating a FAIR-compliant ecosystem of infrastructures. In CLARIN Annual Conference Proceedings (p. 95-99)
  2. Schaber, J., Graën, J., McDonald, D., Mustac, I., Rajovic, N., Schneider, G., ... & Kontino, T. (2023, October). The LiRI Corpus Platform. In CLARIN annual conference proceedings (pp. 145-149). 
  3. Schaber, J., Graën, J., Mustač, I., Rajović, N., Schneider, G., Zehr, J., & Bubenhofer, N. Swissdox@ LiRI–a large database of media articles made accessible to researchers. CLARIN annual conference proceedings (pp. 111-115). 
  4. Poster at Open Access Week 2023 at UZH 

  5. Presentation at 2023 SWISSUbase Annual event at UZH (November 2023)

Weiterführende Informationen

Call for participation

Are you a member of the Swiss scientific community working with language resources and you feel concerned about the topics addressed in this project?

Would you like to get involved?

Please drop an email to Cristina Grisot.

CLARIN-CH logo with the text "Common Language Resources and Technology Infrastructure"

Common Language Resources and Technology Infrastructure

More about Common Language Resources and Technology Infrastructure

CLARIN – Common Language Resources and Technology Infrastructure – is a pan-European research infrastructure aiming to render accessible all digital language resources and tools from all over Europe through a single sign-on online environment. Several Swiss academic institutions have manifested their intention to join CLARIN, first as an Observer member and later as Full member. For this, they have founded the consortium CLARIN-CH in 2020.