The tools provided by Jena University Language & Information Engineering Lab are licensed under the terms of the Common Public License, Version 1.0 or (at your option) any subsequent version. (See http://opensource.org/licenses/cpl1.0.php)
We appreciate any feedback on our tools. Please leave us a short note when you use the tools in a larger context and whether you found them useful. Please do acknowledge our lab if you use our tools for an application of yours.
The JULIE PUBMED Reader (a UIMA Collection Reader) reads PUBMED (the major bibliographic database for the biomedical domain) abstracts in XML format. These abstracts contain meta-information such as the title, the authors, publication information, and additionally -- in case of the manually curated abstracts -- a list of keywords, MeSH headings, and chemicals.
This information is stored in the type system (see our UIMA type system), the abstract text is set as the document text for further NLP text processing.
The JULIE Lab MUC7 Collection Reader (a UIMA Collection Reader) reads MUC7 data that can be purchased from the Linguistic Data Consortium (LDC, http://www.ldc.upenn.edu/). The MUC7 data must be transformed in valid XML format (instead of SGML). The reader reads sections, paragraphs, all named entities as well as coreferences. The MUC7 templates that are only available in BNF and that describe events are not processed (yet). The extracted information is stored in the type system (see our UIMA type system).
Version 1.1
The JULIE Lab ACE 2005 Collection Reader reads ACE data that can be purchased from the Linguistic Data Consortium (LDC, http://www.ldc.upenn.edu/). The reader reads all named entities, relation, events and as well as coreferences. The extracted information is stored in the type system (see our UIMA type system).
Version 2.0
The JULIE Lab Wikipedia Collection Reader reads in Wikipedia articles from a database, parses the raw wikitext, composes a cleansed document text and retains the original document structure in terms of UIMA annotations (appropriate annotation types are defined in our UIMA type system, version 2.6.8 or higher). The reader uses the Java Wikipedia Library (JWPL) Parser developed by the Ubiquitous Knowledge Processing Lab (TU Darmstadt) for parsing wikitext.
Version 0.2.4
The JULIE Lab Sentence Boundary Detector (JSBD) and the JULIE Lab Token Boundary Detector (JTBD) are machine learning-based tools, developed and optimized for handling life science documents containing many tricky cases which many other, especially rule-based tools, don't handle appropriately.
JSBD and JTBD are available as UIMA components and as stand-alone tools. Note: you don't need to download the stand-alone version if you are only interested in the UIMA version of these tools.
References:
The JULIE Lab Named Entity Tagger (JNET) is a generic and configurable multi-class named entity recognizer. JNET's comprehensive feature set allows to employ JNET for most domains and entity classes.
JNET is available as UIMA component and as stand-alone tool. Note: you don't need to download the stand-alone version if you are only interested in the UIMA version.
We here provide UIMA wrappers for some of the OpenNLP tools. OpenNLP Tool Suite is a collection of the NLP tools based on the OpenNLP Maxent package. The OpenNLP Tool Suite includes e.g. sentence detection, tokenization, part-of-speech tagging, chunking and full parsing tools. For the latest news and version of the OpenNLP Tool Suite check out OpenNLP's project page at the Apache Software Foundation. In order to use the OpenNLP tools, you need models. We here provide the models build on the biomedical corpora ( GENIA corpus and PennBioIE CYP and PennBioIE Oncology corpora).
Reference:
Dependencies: opennlp.tools.v1.3.release
The JULIE Lab Acronym Annotator (JACRO) is an UIMA Analysis Engine that annotates local acronyms (i.e., those introdoced in brackets) with their respective full form found in their neighborhood. This annotator is based on the Acronym Resolution algorithms presented by M. Schwartz and Hearst (2003).
References:
GENO is highly competitive system for gene name normalization.
The Lucene CAS Iindexer (LuCAS) is a UIMA CAS consumer that stores CAS data in a Lucene index. The consumer transforms annotation objects of a CAS into Lucene token streams which are stored in a Lucene document. Token streams can further be processed by token filters. Lucas comes with a set of its own token filters and integrations for some Lucene token filters. Furthermore, you can deploy your own token filters. The mapping between UIMA annotations and Lucene tokens and token filtering is configured by a xml mapping file.
Dependencies: Lucene
Download and Documentation: LuCAS can is available through the UIMA sandbox. The official documentation is available in the DocBook format from the UIMA sandbox svn.
More information on request.
More information on request.