Navigation auf uzh.ch
A central pillar of LiRI is offering corpus linguistic services and consultation: Corpus creation, automatic annotation of corpus data, interfaces for manual annotation, management of corpora, and research tools. With time, we realized that a particular desire of users is to make richly annotated corpus data accessible via easy-to-use interfaces, while at the same time allowing for complex linguistic queries. While similar ideas do exist, they come with their own set of limitations, still leaving the desire unanswered. With the LiRI Corpus Platform we aim to close this gap by offering three powerful access points, all supported by the same backend and governed by the same design principles.
Each interface is optimized to the set of modalities it implements:
CatchPhrase: Analyze and visualize text data.
SoundScript: Analyze and visualize text and audio data.
VideoScope: Analyze and visualize text, audio and video data.
Headed by Prof. Dr. Bubenhofer, the project received initial funding in 2021 byTPF and swissuniversities grants (specifically within the UpLORD project, and CHORD Track B). This enabled the establishment of both LCP (now CatchPhrase) and VIAN-DH (now VideoScope).
The common basis of both tools is a database capable of efficiently managing richly annotated data, both in terms of character strings and timestamps. With CatchPhrase, it is possible to search large text corpora with any level of annotation (parts of speech, syntactic structures, metadata, etc.) and obtain statistical outputs (e.g., collocation profiles, distribution of hits across metadata, correlations, etc.). VideoScope renders transcribed videos searchable in the same way (typically conversations). This tool also offers the ability to automatically transcribe and identify gestures in the video data through machine learning methods. If one choses to focus on the audio layer of a video or analyses, e.g., speech corpora, SoundScript is most efficient.
In 2023, the project again successfully applied for funding. We are now working on further developing both tools into a common Linguistic Data Environment in cooperation with TPF SARI, characterized by being an innovative, multimodal, powerful yet easy-to-use tool. This presents an opportunity to combine strongly quantitative corpus linguistics with the more qualitatively oriented EMCA (Ethnomethodological Conversation Analysis) and multimodal linguistics, extending into Digital Visual Studies and Digital Humanities. To date, there are no tools that allow this connection, as the respective research communities use partial solutions entrenched in their research logics and traditions.
LCP is already part of CLARIN-CH and is distributed as a service through the swissuniversities-funded project UpLORD. Currently, corpora distributed across Switzerland that are not available, poorly available, not sustainably available, or only available in proprietary systems are being identified and invited to publish in the LCP. There is great interest in this, especially as an interface between LCP and LaRS/SWISSUbase for long-term data archiving is being developed, thus creating an ecosystem for linguistic data that promotes Open Research Data and long-term archiving. The development of this ecosystem is being closely watched at the EU-CLARIN level (see presentations at the CLARIN annual conference 2023).
For more on the technical background of LCP, please refer to the following paper: The LiRI Corpus Platform: CLARIN-CH conference proceedings (PDF), 2023.