Unstructured Information Management Architecture (UIMA)

3rd UIMA@GSCL Workshop

September 23, 2013

Proceedings: http://ceur-ws.org/Vol-1038/

On the Workshop

For many decades, NLP has suffered from low software engineering standards causing a limited degree of reusability of code and interoperability of different modules within larger NLP systems. While this did not really hamper success in limited task areas (such as implementing a parser), it caused serious problems for the emerging field of language technology where the focus is on building complex integrated software systems, e.g., for information extraction or machine translation. This lack of integration has led to duplicated software development, work-arounds for programs written in different (versions of) programming languages, and ad-hoc tweaking of interfaces between modules developed at different sites.

In recent years, the Unstructured Information Management Architecture (UIMA) framework has been proposed as a middleware platform which offers integration by design through common type systems and standardized communication methods for components analysing streams of unstructured information, such as natural language. The UIMA framework offers a solid processing infrastructure that allows developers to concentrate on the implementation of the actual analytics components. An increasing number of members of the NLP community thus have adopted UIMA as a platform facilitating the creation of reusable NLP components that can be assembled to address different NLP tasks depending on their order, combination and configuration.

This workshop aims at bringing together members of the NLP community -- users, developers or providers of either UIMA components or UIMA-related tools in order to explore and discuss the opportunities and challenges in using UIMA as a platform for modern, well-engineered NLP. Alternatives to and comparisons of other frameworks (such as GATE, LingPipe, etc) with UIMA are of interest, too.

In the context of an active NLP-oriented UIMA community, the challenge of creating reusable and interoperable components raises particular interest. From a methodological perspective, interoperability relies largely on UIMA type systems. Technically, it includes issues related to the packaging and distribution of UIMA components. Also, tools are important, for example to assemble complex processing workflows, to manage the bodies of data that are to be analysed and to visualize, explore, and further deploy the analysis results. Interoperability is also affected by legal issues, such as potentially incompatible licenses of components and tools. Further challenges are involved in embedding UIMA analysis within applications or using it in distributed computing scenarios, such as deployment of and access to required resources. Finally, the preservation of analysis results, their provenance and reproducibility are of particular interest to the scientific user community.

Workshop topics include, but are not limited to:

processing of very large data collections: scale-out, parallelization, and performance optimization
advanced applications driven by UIMA
sophisticated tools to build and manage complex processing pipelines
analysis of results: exploration, evaluation, visualization, and statistical analysis
experience reports combining UIMA-based components from different sources, as well as solutions to interoperability issues
UIMA components with a special focus on genericity and type-system independence
repositories of ready-to-use UIMA-based components
(generic) type systems for UIMA
distribution of UIMA components: documentation, licensing and packaging
developing for UIMA: simplified APIs, debugging, unit testing, and limitations of UIMA

Program

from 08:30	Registration opens
09:30	Welcome and Introduction
09:30-10:00	Storing UIMA CASes in a relational database Georg Fette, Martin Toepfer, Frank Puppe
10:00-10:30	Aid to spatial navigation within a UIMA annotation index Nicolas Hernandez
10:30-11:00	Coffee break
11:00-11:30	A Model-driven approach to NLP programming with UIMA Alessandro Di Bari, Alessandro Faraotti, Carmela Gambardella, Guido Vetere
11:30-12:00	Using UIMA to Structure An Open Platform for Textual Entailment Tae-Gil Noh, Sebastian Padó
12:00-12:30	Bluima: a UIMA-based NLP Toolkit for Neuroscience Renaud Richardet, Jean-Cédric Chappelier, Martin Telefont
12:30-14:00	Lunch break
14:00-15:00	Keynote: Apache clinical Text Analysis and Knowledge Extraction System (cTAKES) Pei Chen, Guergana Savova
15:00-15:30	CSE Framework: A UIMA-based Distributed System for Configuration Space Exploration Elmer Garduno, Zi Yang, Avner Maiberg, Collin McCormack, Yan Fang, Eric Nyberg
15:30-16:00	Coffee break
16:00-16:30	Constraint-driven Evaluation in UIMA Ruta Andreas Wittek, Martin Toepfer, Georg Fette, Peter Kluegl, Frank Puppe
16:30-17:00	Sentiment Analysis and Visualization using UIMA and Solr Carlos Rodríguez-Penagos, David García Narbona, Guillem Massó Sanabre, Jens Grivolla, Joan Codina
17:00-17:30	Extracting hierarchical data points and tables from scanned contracts Jan Stadermann, Stephan Symons, Ingo Thon
17:30-18:00	Wrap-up

Keynote

Apache clinical Text Analysis and Knowledge Extraction System (cTAKES)

Abstract:

The presentation will focus on methods and software development behind the cTAKES platform. An overview of the modules will set the stage, followed by more in-depth discussion of some of the methods and evaluations of select modules. The second part of the presentation will shift to software development topics such as optimization and distributed computing including UIMA integration, UIMA-AS, as well as our plans for UIMA-DUCC integration. A live demo of cTAKES will wrap the talk.

About the speaker:

Pei Chen is a Vice President of Apache Software Foundation, leading the top-level cTAKES project (ctakes.apache.org). He is also a lead application development specialist at the Informatics Program at Boston Children’s Hospital/Harvard Medical School. Mr. Chen’s interests lie in building practical applications using machine learning techniques. He has a passion for the end-user experience and has a background Computer Science/Economics. Mr. Chen is a firm believer in the open source community contributing to cTAKES as well as other Apache Software Foundation Projects. Details at (not fully up-to-date) http://childrenshospital.org/cfapps/research/data_admin/Site3240/mainpageS3240P0.html

Guergana Savova, Ph.D. is faculty at Harvard Medical School and Childrens Hospital Boston. Her research interest is in natural language processing (NLP) especially as applied to the text generated by physicians (the clinical narrative) focusing on higher level semantic and discourse processing which includes topics such as named entity recognition, event recognition, relation detection and classification including co-reference and temporal relations. The methods are mostly machine learning spanning supervised, lightly supervised and completely unsupervised. Her interest is also in the application of the NLP methodologies to biomedical use cases. Dr. Savova has been leading the development and is the principal architect of cTAKES. She holds a Master’s of Science in Computer Science and a PhD in Linguistics with a minor in Cognitive Science from University of Minnesota. Details at (not fully up-to-date) http://childrenshospital.org/cfapps/research/data_admin/Site3240/mainpageS3240P0.html

Tutorial

This is the material from the UIMA tutorial held in conjunction with the 3rd UIMA@GCSL workshop.

Apache UIMA™, Apache uimaFIT™, and DKPro Core component collection
Apache UIMA Ruta™

Slides
Examples (this link is currently broken)

Submissions

We invite submissions of full papers, limited to 8 pages of text, and position papers or papers describing ongoing work as short papers, limited to 4 pages. Also, system demonstration

We invite submissions of full papers, limited to 8 pages of text, and position papers or papers describing ongoing work as short papers, limited to 4 pages. Also, system demonstration papers are welcome (4 pages). Submitted paper must be original, i.e. not published in an earlier workshop or conference or journal. Reviewing will not be anonymous but authors wishing to keep their anonymity may hide their identity on demand. The submitted papers will be reviewed by three members of the program committee.

All submissions must be in English and follow the Springer LNCS style [1] and should be created using LaTeX. All papers must be submitted in PDF and via EasyChair [2].

The one-day workshop will be held with oral presentations of accepted papers. A comfortable time slot for discussions will be given. The workshop will also include a keynote on Apache cTAKES, the Apache clinical Text Analysis and Knowledge Extraction System which is also based on the UIMA framework.

Note that at least one author of each accepted paper must register and present the contribution. Accepted contributions are planned to appear as CEUR Workshop Proceedings (CEUR-WS.org).

[1] http://www.springer.com/computer/lncs?SGWID=0-164-7-72376-0

[2] https://www.easychair.org/conferences/?conf=uimagscl2013

Important Dates

July, 26: Submission deadline (extended)
August, 16: Notification of acceptance
August, 31: Camera-ready deadline
Sept, 23: Workshop held in Darmstadt in conjunction with GSCL

Organizers and Contact

Peter Klügl, Universität Würzburg
Richard Eckart de Castilho, TU Darmstadt
Katrin Tomanek, Averbis GmbH

Please address any inquiries regarding the workshop to: uima.gscl2013@gmail.com

Program Committee

Sophia Ananiadou, University of Manchester
Steven Bethard, KU Leuven
Ekaterina Buyko, Nuance Deutschland
Philipp Cimiano, Universität Bielefeld
Kevin Cohen, University of Colorado
Anni R. Coden, Thomas J. Watson Research Center
Richard Eckart de Castilho, TU Darmstadt
Frank Enders, Averbis GmbH
Nicolai Erbs, TU Darmstadt
Stefan Geissler, TEMIS
Thilo Götz, IBM Deutschland
Udo Hahn, FSU Jena
Nicolas Hernandez, University of Nantes
Michael Herweg, IBM Deutschland
Nancy Ide, Vassar College
Peter Klügl, Universität Würzburg
Eric Nyberg, Carnegie Mellon University
Kai Simon, Averbis GmbH
Michael Tanenblatt, Thomas J. Watson Research Center
Martin Toepfer, Universität Würzburg
Katrin Tomanek, Averbis GmbH
Karin Verspoor, National ICT Australia
Graham Wilcock, University of Helsinki
Torsten Zesch, University of Duisburg-Essen