Read Me
Author: Artjom Klein klein.devel@gmail.com
Package provides 2 taggers:
1. DrugTagger - CRF-based with DrugBank presence feature (see feature set for details).
2. DrugnameGazetteer - gazetteer/dictionary-based. Dictionary created from DrugBank.ca database.
Both taggers include grounding/normalisation to DrugBank ids and standard names.
Feature set:
Word, Word-1, Word+1, Word-1_Word, Word_Word+1, DrugBankPresence, POS
DrugBankPresence feature indicates the presence of the drug name in the DrugBank.
See feature vectors in data/ddi_train.iob and data/ddi_test.iob
----------------------------------------------------------------------------------------
Use for tagging the standalone jar.
----------------------------------------------------------------------------------------
1. Install GATE (General Architecture for Text Engineering)
Download gate-7.1-build4485-BIN.zip from https://gate.ac.uk/download/
Unpack to any folder.
2. Download DrugExtractionStandalone.tar.gz from https://sourceforge.net/projects/drug-extraction/ (see Files)
java -cp drug-extraction-jar-with-dependencies.jar com.micronlp.Pipeline <PATH_TO_GATE_HOME> <PATH_TO_DIRECTORY_WITH_PDF_FILES> <PATH_TO_DIRECTORY_FOR_ANNOTATED_DOCUMENTS>
e.g.
java -cp target/drug-extraction-jar-with-dependencies.jar com.micronlp.drug_extraction.Pipeline ../../bin/GATE-7.1 test_data test_data/OUTPUT
Use GATE Developer GUI to load *.xml files from output directory to see annotations.
----------------------------------------------------------------------------------------
Development environment setup
----------------------------------------------------------------------------------------
1. Java 6.
Here are example links where you can download Java 6 for Windows (without Oracle registration)
32bit: http://www.oldapps.com/java.php?old_java=8346
64bit: http://www.oldapps.com/java.php?old_java=8347
2. GATE (General Architecture for Text Engineering)
Download gate-7.1-build4485-BIN.zip from https://gate.ac.uk/download/
Unpack to any folder.
3. Mallet (MAchine Learning for LanguagE Toolkit)
Download mallet-2.0.7.tar.gz from http://mallet.cs.umass.edu/download.php
Unpack to parent folder
4. Download training (DrugDDI_Unified) and test corpora (Test_Unified) from http://labda.inf.uc3m.es/DDIExtraction2011/dataset.html
Unpack them into folder 'ddi-corpus' in the parent folder.
Final directory structure:
/<YOUR-WORKSPACE>
/mallet-2.0.7
/ddi-corpus
/DrugExtraction
----------------------------------------------------------------------------------------
Prepare corpus, training and test data.
----------------------------------------------------------------------------------------
Convert original DrugDDI_Unified and Test_Unified to GATE XML format:
mvn exec:java -Dexec.mainClass=com.micronlp.drug_extraction.DDICorpusToGate -Dexec.args="../ddi-corpus/DrugDDI_Unified ../ddi-corpus/DrugDDI_Unified_GATEXML"
mvn exec:java -Dexec.mainClass=com.micronlp.drug_extraction.DDICorpusToGate -Dexec.args="../ddi-corpus/Test_Unified ../ddi-corpus/Test_Unified_GATEXML"
Create training data:
mvn exec:java -Dexec.mainClass=com.micronlp.drug_extraction.DDIGateToIob -Dexec.args="../ddi-corpus/DrugDDI_Unified_GATEXML data/ddi_train.iob"
Training accuracy=0.980763482642299
Testing accuracy=0.963014047447269
Create test data:
mvn exec:java -Dexec.mainClass=com.micronlp.drug_extraction.DDIGateToIob -Dexec.args="../ddi-corpus/Test_Unified_GATEXML data/ddi_test.iob"
fix (if needed): remove double newlines in data/ddi_test.iob at line 31086 (or 31544)
Training:
java -cp "../mallet-2.0.7/class:../mallet-2.0.7/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --train true --test lab --threads 16 --iterations 500 --model-file src/main/resources/mallet/ddi.model data/ddi_train.iob
Test/Application:
java -cp "../mallet-2.0.7/class:../mallet-2.0.7/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --test lab --threads 16 --model-file src/main/resources/mallet/ddi.model data/ddi_test.iob
----------------------------------------------------------------------------------------
Installation and packaging
----------------------------------------------------------------------------------------
Create index:
mvn exec:java -Dexec.mainClass=com.micronlp.drug_extraction.DrugBankIndexer compile
Test DrugTagger.java:
mvn exec:java -Dexec.mainClass=com.micronlp.drug_extraction.DrugTagger -Dexec.args="./test_data/test1.xml test1_result.xml"
mvn exec:java -Dexec.mainClass=com.micronlp.drug_extraction.DrugTagger -Dexec.args="./test_data/test2.xml test2_result.xml"
Test DrugNameAnnieGazetteer.java:
mvn exec:java -Dexec.mainClass=com.micronlp.drug_extraction.DrugNameAnnieGazetteer -Dexec.args="./test_data/test1.txt test1_result.xml"
mvn exec:java -Dexec.mainClass=com.micronlp.drug_extraction.DrugNameAnnieGazetteer -Dexec.args="./test_data/test2.txt test2_result.xml"
Package or install:
mvn clean package
mvn clean install
----------------------------------------------------------------------------------------
Generate Javadoc
----------------------------------------------------------------------------------------
Generate the Javadoc for public members:
mvn javadoc:javadoc
Generate the Javadoc for all (public and private) members:
mvn javadoc:javadoc -Dshow=private
Copy fresh generated Javadoc to javadoc in the project root:
cp -R target/site/apidocs javadoc
Open the Javadoc in firefox:
firefox javadoc/apidocs/index.html &
----------------------------------------------------------------------------------------
Evaluation
------------------------------------------------------------------------------------------
1. Using conlleval evaluation script.
java -cp "/home/artjomk/bin/mallet-2.0.7/class:/home/artjomk/bin/mallet-2.0.7/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --threads 16 --model-file src/main/resources/mallet/ddi.model data/ddi_test.iob > temp.out
paste -d ' ' data/ddi_test.iob temp.out > ddi_test_result.iob
perl conlleval.pl < ddi_test_result.iob
Evaluation result without DrugBank dictionary:
processed 32065 tokens with 3656 phrases; found: 3482 phrases; correct: 2768.
accuracy: 94.41%; precision: 79.49%; recall: 75.71%; FB1: 77.56
DRUG: precision: 79.49%; recall: 75.71%; FB1: 77.56 3482
Evaluation result with DrugBank dictionary:
processed 32065 tokens with 3656 phrases; found: 3251 phrases; correct: 2786.
accuracy: 95.25%; precision: 85.70%; recall: 76.20%; FB1: 80.67
DRUG: precision: 85.70%; recall: 76.20%; FB1: 80.67 3251
2. Using pipeline.
export MAVEN_OPTS="-Xms1500m -Xmx4096m -ea"
Run pipeline:
mvn exec:java -Dexec.mainClass=com.micronlp.drug_extraction.Pipeline -Dexec.args="../../bin/GATE-7.1 ../ddi-corpus/Test_Unified_GATEXML ../ddi-corpus/Test_Unified_GATEXML/OUTPUT"
Note: Both - DrugTagger and DrugNameGazetteer - annotate text with annotations of type 'Drug'. DrugTagger annotations are in 'DrugTagger' annotation set. DrugNameGazetteer annotations are in 'DrugNameGazetteer' annotation set.
Use GATE Developer and plugins 'Copy Anns To Another Doc' and 'Quality Assurance PR' to evaluate results.
1. Create corpus
2. Load processed(!) documents into corpus (populate from directory)
3. Set path to reference/gold standard corpus in 'Copy Anns To Another Doc'
4. Use F1.0-Lenient metric in 'Quality Assurance PR'
See example in DrugExtractionEvaluation.xml (restore it as application).
DrugTagger approach results without DrugBankPresence feature:
not tested
DrugTagger approach results with DrugBankPresence feature:
f1.0-strict:
Corpus Statistics
Annotation Type Match Only in Key Only in Response Overlap Rec.B/A Prec.B/A f1.0-strict
Drug 2,385.00 969.00 531.00 335.00 0.65 0.73 0.69
Macro Summary 0.00 0.00 0.00 0.00 0.65 0.73 0.69
Micro Summary 2,385.00 969.00 531.00 335.00 0.65 0.73 0.69
f1.0-lenient:
Corpus Statistics
Annotation Type Match Only in Key Only in Response Overlap Rec.B/A Prec.B/A f1.0-lenient
Drug 2,385.00 969.00 531.00 335.00 0.74 0.84 0.78
Macro Summary 0.00 0.00 0.00 0.00 0.74 0.84 0.78
Micro Summary 2,385.00 969.00 531.00 335.00 0.74 0.84 0.78
DrugNameGazetteering approach results:
f1.0-strict:
Corpus Statistics
Annotation Type Match Only in Key Only in Response Overlap Rec.B/A Prec.B/A f1.0-strict
Drug 1,918.00 1,540.00 489.00 231.00 0.52 0.73 0.61
Macro Summary 0.00 0.00 0.00 0.00 0.52 0.73 0.61
Micro Summary 1,918.00 1,540.00 489.00 231.00 0.52 0.73 0.61
f1.0-lenient:
Corpus Statistics
Annotation Type Match Only in Key Only in Response Overlap Rec.B/A Prec.B/A f1.0-lenient
Drug 1,918.00 1,540.00 489.00 231.00 0.58 0.81 0.68
Macro Summary 0.00 0.00 0.00 0.00 0.58 0.81 0.68
Micro Summary 1,918.00 1,540.00 489.00 231.00 0.58 0.81 0.68