Schnelleinstieg Reader


Startseite FSU


The FSU PRotein GEne corpus was developed at the JULIE Lab Jena under supervision of Prof. Udo Hahn.
The executing scientist was Dr. Joachim Wermter.
The main annotator was Dr. Rico Pusch who is an expert in biology.
The corpus was developed in the context of the StemNet project (

The goals of the annotation project were

  •  to construct a consistent and (as far as possible) subdomain-independent/-comprehensive protein-annotated corpus
  •  to differentiate between protein families and groups, protein complexes, protein molecules, protein variants (e.g. alleles) and elliptic enumerations of proteins.

The corpus has the following annotation levels / entity types:

  • protein
  • protein_familiy_or_group
  • protein_complex
  • protein_variant
  • protein_enum

For definitions of the annotation levels, please refer to the Proteins-guidelines-final.doc file that should be found in the same archive as this readme.

To achieve a large coverage of biological subdomains, document from multiple other protein / gene corpora were reannotated. For further coverage, new document sets were created. All documents are abstracts from PubMed/MEDLINE. The corpus is made up of the union of all the documents in the different subcorpora. Each subcorpus is stored in its own directory as follows:

All document are delivered as MMAX2 ( annotation projects.

Corpus statistics:

sentences 36223
tokens 960757

Download the corpus here (10MB).