In the past ten years, almost all of our systems were dealing with real-world documents from the domains of medicine and biology. In both areas, there is an unprecedented demand for natural language systems which automatically summarize or identify relevant documents or extract facts from them. The challenge here is two-fold. On the one hand, medical and biological background knowledge must be made available in order to (even decently) understand what the documents are about. On the other hand, biomedical users want these systems to be operational on very large volumes of documents. Hence, tiny, fragile demonstrators do not help much. Rather natural language processors with inherent robustness must be supplied and their degree of success (or failure) must be assessed. For medical applications, we have been working on German-language clinical reports (mostly, pathology reports), while for biology applications we deal with English-language Medline abstracts.
In this task, a natural language processor takes real-world documents as input and identifies relevant facts in these documents. Relevance can be a priori defined in terms of templates which enumerate the facets (slots) of interesting knowledge the user of a system really wants. An information extraction system then identifies those parts of a document which contain explicit fillers for these template slots. The information extraction problem can be rephrased as a text mining problem where people look at documents from a knowledge discovery perspective. In this textual data mining sceanrio they aim at heuristic guidance to hypothesize or possibly find 'new' (i.e., non-redundant, novel, so far unknown) knowledge that is not explicitly expressed in the documents under scrutiny but must be mined, filtered and aggregated according to giving informational goals (e.g., given two biological entities, say proteins, which direct and indirect interactions might they engage in?).
We have developed two knowledge extraction systems for German-language documents, one dealing with test reports and product announcements from the information technology domain (itSynDiKATe), the other dealing with clinical reports from the medical domain (medSynDiKATe).
In this task, a natural language processor takes real-world documents as input and identifies - based on a user's query - a subset of relevant documents from a large document collection.
We have developed MorphoSaurus, a cross-language document retrieval system which responds to the special needs of medical practitioners who have to deal, on the one hand, with German-language documents (in their clinical environment), and on the other hand, with (usually) English literature as far as access to latest research findings is concerned. We have extended our usage scenario to the Portuguese language as well, and quite recently applied machine learning techniques to extend the coverage of MorphoSaurus to Spanish, Swedish and French.
In this task, a natural language processor takes real-world documents as input and supplies variably condensed summaries to propspective users. Rather than providing natural language summaries (i.e., generating natural language output), we have experimented with various ways to access summarized knowledge via interactive graphical interfaces (TopoGraphic). The core summarization engine, the TOPIC system, is an experimental platform for truly abstracting from the text analysis results stored in knowledge bases rather than extracting portions of original, literal text snippets (sentences or phrases), the latter currently being the prevailing approach to automatic summarization.
The core of syntactic analysis consists of assigning part of speech labels to words, grouping words according to phrasal structures (chunking) and increasingly assign to phrases descriptions of their inherent structural relationships (parsing).
In the past two years, we developed taggers and chunkers adapted to the special needs of the biomedical domain, with a high level of accuracy, relying on various statistical approaches (basically, Hidden Markov Models and derivatives). Part of this work was the development of the first German-language medical text corpus composed of 100,000 POS-annotated text tokens from a large variety of medical text genres.
In recent years, we developed the ParseTalk parser, a dependency-based full-coverage parser of German which served syntactic analysis tasks in the SynDiKATe environment.
Semantic interpretation deals with the mapping from syntactic structures to propositions. This semantic level of linguistic description has many varieties and may range from simple relational information (e.g., as available from thesauri or classification systems) to representations in terms of logical calculi (e.g., some type of predicate logic such as description logics).
In the natural language systems we have been developing, for deep understanding applications (such as with the TOPIC summarization system or the SynDiKATe information extraction systems) we have chosen frame-style description logics as the formal representation framework. For recent activities in the field of document retrieval, however, we dealt with much simpler relational descriptions.
While we developed a parsimonious interface that relates dependency structures with domain representations, we have also spent a lot of efforts dealing with challenging issues such as the role of adjectives in comparatives and subjective utterances, as well as the role of metynomies in textual discourse.
Since we have always been dealing with real-world, large-sized full text documents (test reports, product announcements, clinical reports, etc.), we had to cope with a variety of textual phenomena. At the local level of connectivity between (more or less) adjacent sentences, we adapted the centering model to account for phenomena such as pronomical anaphora ('The gastric mucosa. ... it [=gastric mucosa]), nominal anaphora ('The gastric mucosa. ... This mucoa [=gastric mucosa]), as well as bridging anaphora ('The gastric mucosa. ... The surface mucuous [of the gastric mucosa] ...). At the global level of text organization in terms of text coherence we also dealt with various forms of rhetorical structure relations.
The representation of the background knowledge (usually rooted in knowledge of the medical or biological domain) is based on formalisms which orginated from early frames and semantic networks, and have since then evolved into various formalizations within the framework of description logics. Our work used this formal framework for representing taxonomies (is-a relations) and partonomies (hierarchies of part-whole relations) and empirically adequate reasoning patterns (e.g., exceptions of transitivity in part-of hierarchies).
The actual use of knowledge representation formalisms to represent specific domain knowledge leads to the building and maintenance of knowledge bases or ontologies. We pursued two major directions. In the first approach, we reused large portions of medical knowledge from the Unified Medical Language System (UMLS), a light-weight though high-coverage collection of medical thesauri, and transformed it automatically into a large description logic knowledge base (with approximately 240,000 concepts and relations). In the second approach, we extracted new concepts as a by-product of text understanding processes. Emergent concept hypotheses were assessed for plausibility by means of a so-called quality calculus which combined the hypotheses' conceptual and linguistic evidence.
Our approach to learning conceptual (and grammatical) knowledge is based on already available seed knowledge sources (ontologies, grammars) and learning results can be interpreted as a by-product of text understanding. Whenever lexical gaps are encountered, conceptual and grammatical hypotheses are formed, evaluated on the basis of their conceptual and linguistic evidence, refined as text understanding proceeds, ranked for combined plausibility, and, finally, selected. This task was achieved by inferences of the quality qualculus, a learning system fully embedded into description logcis and terminological reasoning.
Within the context of the MorphoSaurus cross-language document retrieval system, we have quite recently, worked on the extension of our lexical resources by means of cognate mapping (e.g., mapping from Portuguese to Spanish or French, or from German and English to Swedish). Based on proper language alignment we were also able to learn term indexing pattern from English, and apply these patterns to the indexing of German- and Portuguese-language documents.
Both in the field of general-language collocation discovery and in the area of domain-specific terminology mining, we have very recently worked on novel empirical methods to separate semantically and distributionally more rigid multi-word units from the vast majority of more common, non-specific ones.
Given the empirical orientation of our current research, we collect and develop text corpora along three dimensions. The first branch focuses on the collection of very large plain copora with the following considerations in mind: language and genre classification, language alignment (both similar and parallel corpora).
The second branch is dedicated to linguistic annotation of German and English biomedical documents at various levels (sentence boundaries, tokenization, POS-tagging, named entities, phrases). These annotations are carried out both manually and semi-automatically (via active learning).
The third branch is devoted to biomedical knowledge bases. We have set up a description logics ontology which covers the entire anatomy and pathology portion of the Unified Medical Language System (UMLS).