LiRI Service Description Data Management and Processing

Data Management

The Linguistic Research Infrastructure (LiRI), hosted at the Faculty of Arts and Social Sciences, provides technical and practical support for research projects in the field of linguistics, language sciences and language-related disciplines at the University of Zurich and beyond. LiRI services fall into three main categories:

Data acquisition: The LiRI lab has equipment and software for generating and collecting linguistic research data. The facilities can be used by scientists or other users interested in natural language data (i.e. audiovisual, textual and speech data), and by language researchers working on experimental data (e.g. eye-tracking, electroencephalographic, behavioral). The lab will be located at Zurich Nord, Andreasstrasse 15, 8050 (completion expected summer 2021). Some of the research equipment is suitable for use in the field and is available on loan.

Data management: LiRI can set up and manage customized virtual machine (VM) servers for the purposes of processing, analysis, long-term storage and backup of linguistic research data. These VM servers are part of the ScienceCloud environment of the University of Zurich and are accessible both within the UZH network and world wide. Additional software can be installed for specific research requirements. Server access can be customized so that, for example, data is available to a particular research group or specific users only. These data management services are not limited to researchers using the LiRI facilities to collect their data. Language scientists who wish to analyze, temporarily store or archive their self-collected data can also use the LiRI server infrastructure. Ffrom 2021 onwards, publication-ready data can be directly ingested from LiRI into the SWISSUBase infrastructure, the Swiss national center for research data. The data is then available to international research infrastructures such as CLARIN. Sensitive data, which may require no-publication or storage only status, can also be stored at LiRI or SWISSUBase.

Collaboration and support: The Linguistic Ressearch Infrastructure offers technical and methodological assistance to researchers on how best to collect, process and analyze the linguistic data they require. A project's infrastructure needs can be discussed in an initial, fre of charge, consultation. Depending the complexity of the project, support could range from occasional guidance to the LiRI managing all data acquisition and processing. LiRI staff members can also advise on project budgets and data management plans.

Text Data Processing

The LiRI team already provides expert support for data managment and textprocessing, and also for data acquisition (particularly for fieldwork projects), and from October 2020 it will be able to offer database engineering and data science support too. At LiRI work experts in corpus linguistics, computational linguistics and data mining. They not only can provide a range of tools for natural language processing, but they also have the programming skills to adapt existing tools or create new ones.

The following services already exist and can easily be tailored to customer requirements:

  • NLP Corpus Processing: Based on an UIMA Java Framework, we maintain a pipeline for natural language corpus processing with components for automatic annotation of the data. Some of the components are generic, but most of them are language-specific. Available components are: text and sentence segmentation, tokenization, part-of-speech tagging, language identification, morphological analysis, syntactical and dependency parsing, annotation of tense, time expressions, named entities recognition, semantic classes etc. The main languages covered are German, French, English, Italian, including some historical versions. The pipeline can be modified as required.
  • Data crawling: Expertise in web data crawling and scraping, capable of extracting text and metadata from html.
  • Configuration of tools for automatic text analysis and classification (e.g. PoS annotation, semantic analysis)
  • Workflow tool for the management of linguistic annotation.
  • Providing access to specialized software for semiautomatic data transcription (e.g. spoken to written language)

Expertise of the Team

Klaus Rothenhäusler: computational linguistics, linguistic data processing (crawling/scraping/analysis), server administration, programming (Java, UIMA, Python, C(++), JavaScript, R, bash, XML/SGML, XHTML)

Stefan Bircher: server administration, data management, natural language processing, programming (Python, R, SQL, XML)

Johannes Graën (starting in October 2020): computational linguistics, databases, web gui, server administration, programming (SQL, Perl, PHP, Python, JavaScript, R, C, Java, Prolog)