LiRI Corpus Platform

A central pillar of LiRI is offering corpus linguistic services and consultation: Corpus creation, automatic annotation of corpus data, interfaces for manual annotation, management of corpora, and research tools. With time, we realized that a particular desire of users is to make richly annotated corpus data accessible via easy-to-use interfaces, while at the same time allowing for complex linguistic queries. While similar ideas do exist, they come with their own set of limitations, still leaving the desire unanswered. With the LiRI Corpus Platform we aim to close this gap by offering three powerful access points, all supported by the same backend and governed by the same design principles.

Each interface is optimized to the set of modalities it implements:

CatchPhrase: Analyze and visualize text data.

SoundScript: Analyze and visualize text and audio data.

VideoScope: Analyze and visualize text, audio and video data.

Current status and future developments

Headed by Prof. Dr. Bubenhofer, the project received initial funding in 2021 byTPF and swissuniversities grants (specifically within the UpLORD project, and CHORD Track B). This enabled the establishment of both LCP (now CatchPhrase) and VIAN-DH (now VideoScope).

The common basis of both tools is a database capable of efficiently managing richly annotated data, both in terms of character strings and timestamps. With CatchPhrase, it is possible to search large text corpora with any level of annotation (parts of speech, syntactic structures, metadata, etc.) and obtain statistical outputs (e.g., collocation profiles, distribution of hits across metadata, correlations, etc.). VideoScope renders transcribed videos searchable in the same way (typically conversations). This tool also offers the ability to automatically transcribe and identify gestures in the video data through machine learning methods. If one choses to focus on the audio layer of a video or analyses, e.g., speech corpora, SoundScript is most efficient.

In 2023, the project again successfully applied for funding. We are now working on further developing both tools into a common Linguistic Data Environment in cooperation with TPF SARI, characterized by being an innovative, multimodal, powerful yet easy-to-use tool. This presents an opportunity to combine strongly quantitative corpus linguistics with the more qualitatively oriented EMCA (Ethnomethodological Conversation Analysis) and multimodal linguistics, extending into Digital Visual Studies and Digital Humanities. To date, there are no tools that allow this connection, as the respective research communities use partial solutions entrenched in their research logics and traditions.

Corpora ecosystem in Switzerland and FAIR-principles

LCP is already part of CLARIN-CH and is distributed as a service through the swissuniversities-funded project UpLORD. Currently, corpora distributed across Switzerland that are not available, poorly available, not sustainably available, or only available in proprietary systems are being identified and invited to publish in the LCP. There is great interest in this, especially as an interface between LCP and LaRS/SWISSUbase for long-term data archiving is being developed, thus creating an ecosystem for linguistic data that promotes Open Research Data and long-term archiving. The development of this ecosystem is being closely watched at the EU-CLARIN level (see presentations at the CLARIN annual conference 2023).

UpLORD Integration: Improve Platform Interoperability and Corpora Quality

: Zoom (PNG, 751 KB)

UpLORD Poster (PDF, 520 KB)

Design Principles

For more on the technical background of LCP, please refer to the following paper: The LiRI Corpus Platform: CLARIN-CH conference proceedings (PDF), 2023.

Additional Information

Each is optimized to the modalities it implements.

The LiRI Corpus Platform has officially been launched on November 1st, 2024. The program included demonstrations of all three applications, discussions of future developments and presentations of projects related to LCP. You can find recordings of the demonstrations on the clarin-ch news page.

We are currently working on integrating the Swissdox@LiRI database into LCP. This will allow registered users to analyze the corpora they created from Swissdox@LiRI directly in LCP.

Quicklinks

Main navigation

LiRI Corpus Platform

Current status and future developments

Corpora ecosystem in Switzerland and FAIR-principles

UpLORD Integration: Improve Platform Interoperability and Corpora Quality

Design Principles

Additional Information

LCP and its 3 interfaces

LCP Launch Day

Integration of Swissdox@LiRI