Schnelleinstieg Reader

Home|Suche|Sitemap|Webmail

Startseite FSU

NLP Toolsuite


Please note:

The contents of this page are still valid, yet rather old.  We are currently - and will be for some time - in the process of shifting from components downloadable from this page to more efficient publication methods, namely a public Nexus repository for compiled JAR packages and a GitHub profile for the open-source code.

Please find the JULIE Lab GitHub page here: https://github.com/JULIELab

All components offered on the GitHub page are also available as Maven artifacts from our Nexus repository. To make use of it, add the following repository to your pom.xml:

<repositories>
<repository>
<id>julie-nexus</id>
<name>JULIELab Public Repository</name>
<url>https://www.coling.uni-jena.de/nexus/content/groups/public-julie-components/</url>
</repository>
</repositories>

Then, you have access to all publicly available JULIE components. Please refer to the pom.xml files in the respective GitHub repositories for the current Maven coordinates.

License

The tools provided by Jena University Language & Information Engineering Lab are licensed under the terms of the Common Public License, Version 1.0 or (at your option) any subsequent version. (See http://opensource.org/licenses/cpl1.0.php)

We appreciate any feedback on our tools. Please leave us a short note when you use the tools in a larger context and whether you found them useful. Please do acknowledge our lab if you use our tools for an application of yours.

General Information on downloadable tools

  • The JULIE NLP Toolsuite consists of a collection of NLP components. All our tools are written in Java and require at least Java 5.0 (1.5).
  • Our tools are UIMA components (Collections Readers, Analysis Engines, and CAS Consumers). They are available as UIMA PEAR packages which typically include preconfigured descriptors, models or other resources (if necessary) so that you should be able to run them out of the box. The models should be rather understood as demo models, depending on your specific application you should consider training a new model on you training data (all components offer training facilities).
  • You might also adapt the component descriptor according to you needs. Please refer to the component's documentatoin for information on the parameters that might be changed.
  • The JULIE UIMA type system is contained and used.
  • The source code is always contained so you can adapt the components to your needs.
  • For some of the tool there is a stand-alone version available. This version can be run independent of UIMA.

Important Note

  • Please note: due to type system incompatibilities you must not combine components of new version with components when using these components in a pipeline (e.g. CPE)! Thus we only put the most recent components on our website (these are all compatible with each other). In case you need an older component, please contact us.

Contact

.


UIMA Collection Readers

UIMA PUBMED Collection Reader

The JULIE PUBMED Reader (a UIMA Collection Reader) reads PUBMED (the major bibliographic database for the biomedical domain) abstracts in XML format. These abstracts contain meta-information such as the title, the authors, publication information, and additionally -- in case of the manually curated abstracts -- a list of keywords, MeSH headings, and chemicals.

This information is stored in the type system (see our UIMA type system), the abstract text is set as the document text for further NLP text processing.

Documentation and Download

UIMA MUC7 Collection Reader

The JULIE Lab MUC7 Collection Reader (a UIMA Collection Reader) reads MUC7 data that can be purchased from the Linguistic Data Consortium (LDC, http://www.ldc.upenn.edu/). The MUC7 data must be transformed in valid XML format (instead of SGML). The reader reads sections, paragraphs, all named entities as well as coreferences. The MUC7 templates that are only available in BNF and that describe events are not processed (yet). The extracted information is stored in the type system (see our UIMA type system).

Version 1.1

Documentation and Download

UIMA ACE 2005 Collection Reader

The JULIE Lab ACE 2005 Collection Reader reads ACE data that can be purchased from the Linguistic Data Consortium (LDC, http://www.ldc.upenn.edu/). The reader reads all named entities, relation, events and as well as coreferences. The extracted information is stored in the type system (see our UIMA type system).

Version 2.0

Wikipedia Collection Reader

The JULIE Lab Wikipedia Collection Reader reads in Wikipedia articles from a database, parses the raw wikitext, composes a cleansed document text and retains the original document structure in terms of UIMA annotations (appropriate annotation types are defined in our UIMA type system, version 2.6.8 or higher). The reader uses the Java Wikipedia Library (JWPL) Parser developed by the Ubiquitous Knowledge Processing Lab (TU Darmstadt) for parsing wikitext.

Version 0.2.4

Documentation and Download

UIMA Analysis Engines

JULIE Lab Sentence Splitter and Tokenizer

The JULIE Lab Sentence Boundary Detector (JSBD) and the JULIE Lab Token Boundary Detector (JTBD) are machine learning-based tools, developed and optimized for handling life science documents containing many tricky cases which many other, especially rule-based tools, don't handle appropriately.

JSBD and JTBD are available as UIMA components and as stand-alone tools. Note: you don't need to download the stand-alone version if you are only interested in the UIMA version of these tools.

References:

  • Katrin Tomanek, Joachim Wermter, and Udo Hahn. A reappraisal of sentence and token splitting for life science documents. In MEDINFO 2007 - Proceedings of the 12th World Congress on Medical Informatics. (to appear)

Documentation and Download

JULIE Lab Named Entity Tagger

The JULIE Lab Named Entity Tagger (JNET) is a generic and configurable multi-class named entity recognizer. JNET's comprehensive feature set allows to employ JNET for most domains and entity classes.

JNET is available as UIMA component and as stand-alone tool. Note: you don't need to download the stand-alone version if you are only interested in the UIMA version.

OpenNLP UIMA Wrappers

We here provide UIMA wrappers for some of the OpenNLP tools. OpenNLP Tool Suite is a collection of the NLP tools based on the OpenNLP Maxent package. The OpenNLP Tool Suite includes e.g. sentence detection, tokenization, part-of-speech tagging, chunking and full parsing tools. For the latest news and version of the OpenNLP Tool Suite check out OpenNLP's project page at the Apache Software Foundation. In order to use the OpenNLP tools, you need models. We here provide the models build on the biomedical corpora ( GENIA corpus and PennBioIE CYP and PennBioIE Oncology corpora).

Reference:

  • Buyko et al. : Automatically Adapting an NLP Core Engine to the Biology Domain. BioLink & Bio-Ontoligies SIG Meeting 2006, Fortaleza, Brasil.

Dependencies: opennlp.tools.v1.3.release

Documentation and Download

NOTE! It is advisible to use either only PennBioIE models or only GENIA models in the pipeline!

  • New versions (2.0) of UIMA wrappers are available now! Please us the julie-morpho-syntax-types.xml type system (2.0)

JULIE Lab Acronym Annotator

The JULIE Lab Acronym Annotator (JACRO) is an UIMA Analysis Engine that annotates local acronyms (i.e., those introdoced in brackets) with their respective full form found in their neighborhood. This annotator is based on the Acronym Resolution algorithms presented by M. Schwartz and Hearst (2003).

References:

  • Ariel S. Schwartz and Marti A. Hearst: A simple algorithm for identifying abbreviation definitions in biomedical texts. In Proceedings of the Pacific Symposium on Biocomputing; 2003.

Documentation and Download

GeNo

GENO is highly competitive system for gene name normalization.

UIMA CAS Consumers

UIMA Lucene CAS Indexer (Lucas)

The Lucene CAS Iindexer (LuCAS) is a UIMA CAS consumer that stores CAS data in a Lucene index. The consumer transforms annotation objects of a CAS into Lucene token streams which are stored in a Lucene document. Token streams can further be processed by token filters. Lucas comes with a set of its own token filters and integrations for some Lucene token filters. Furthermore, you can deploy your own token filters. The mapping between UIMA annotations and Lucene tokens and token filtering is configured by a xml mapping file.

Dependencies: Lucene

Download and Documentation: LuCAS can is available through the UIMA sandbox. The official documentation is available in the DocBook format from the UIMA sandbox svn.

CAS2IOB Consumer

More information on request.

JULIE Lab DB Mapper

More information on request.