Drug Extraction Code

Drug name extraction

Status: Alpha
Brought to you by: kleindevel
Tree [r8] / History
HTTPS access
File	Date	Author	Commit
data	2015-06-12	kleindevel	[r5]
javadoc	2015-06-12	kleindevel	[r5]
screens	2015-06-12	kleindevel	[r4]
src	2015-06-12	kleindevel	[r8]
test_data	2015-06-12	kleindevel	[r5]
DrugExtractionEvaluation.xml	2015-06-12	kleindevel	[r3]
README	2015-06-12	kleindevel	[r8]
conlleval.pl	2015-06-12	kleindevel	[r2]
pom.xml	2015-06-12	kleindevel	[r7]
Read Me

Author: Artjom Klein klein.devel@gmail.com

Package provides 2 taggers:
1. DrugTagger - CRF-based with DrugBank presence feature (see feature set for details).
2. DrugnameGazetteer - gazetteer/dictionary-based. Dictionary created from DrugBank.ca database. 
Both taggers include grounding/normalisation to DrugBank ids and standard names.

Feature set:
Word, Word-1, Word+1, Word-1_Word, Word_Word+1, DrugBankPresence, POS
DrugBankPresence feature indicates the presence of the drug name in the DrugBank.

See feature vectors in data/ddi_train.iob and data/ddi_test.iob




----------------------------------------------------------------------------------------

Use for tagging the standalone jar.

----------------------------------------------------------------------------------------

1. Install GATE (General Architecture for Text Engineering)
Download gate-7.1-build4485-BIN.zip from https://gate.ac.uk/download/
Unpack to any folder.

2. Download DrugExtractionStandalone.tar.gz from https://sourceforge.net/projects/drug-extraction/ (see Files)

java -cp drug-extraction-jar-with-dependencies.jar com.micronlp.Pipeline <PATH_TO_GATE_HOME> <PATH_TO_DIRECTORY_WITH_PDF_FILES> <PATH_TO_DIRECTORY_FOR_ANNOTATED_DOCUMENTS>

e.g.

java -cp target/drug-extraction-jar-with-dependencies.jar com.micronlp.drug_extraction.Pipeline ../../bin/GATE-7.1 test_data test_data/OUTPUT


Use GATE Developer GUI to load *.xml files from output directory to see annotations.





----------------------------------------------------------------------------------------

Development environment setup

----------------------------------------------------------------------------------------

1. Java 6.
Here are example links where you can download Java 6 for Windows (without Oracle registration)
32bit: http://www.oldapps.com/java.php?old_java=8346
64bit: http://www.oldapps.com/java.php?old_java=8347

2. GATE (General Architecture for Text Engineering)
Download gate-7.1-build4485-BIN.zip from https://gate.ac.uk/download/
Unpack to any folder.

3. Mallet (MAchine Learning for LanguagE Toolkit)
Download mallet-2.0.7.tar.gz from http://mallet.cs.umass.edu/download.php
Unpack to parent folder

4. Download training (DrugDDI_Unified) and test corpora (Test_Unified) from http://labda.inf.uc3m.es/DDIExtraction2011/dataset.html
Unpack them into folder 'ddi-corpus' in the parent folder.

Final directory structure:
/<YOUR-WORKSPACE>
  /mallet-2.0.7
  /ddi-corpus
  /DrugExtraction

  
  
----------------------------------------------------------------------------------------

Prepare corpus, training and test data.

----------------------------------------------------------------------------------------

Convert original DrugDDI_Unified and Test_Unified to GATE XML format:

mvn exec:java -Dexec.mainClass=com.micronlp.drug_extraction.DDICorpusToGate -Dexec.args="../ddi-corpus/DrugDDI_Unified ../ddi-corpus/DrugDDI_Unified_GATEXML"
mvn exec:java -Dexec.mainClass=com.micronlp.drug_extraction.DDICorpusToGate -Dexec.args="../ddi-corpus/Test_Unified ../ddi-corpus/Test_Unified_GATEXML"


Create training data:

mvn exec:java -Dexec.mainClass=com.micronlp.drug_extraction.DDIGateToIob -Dexec.args="../ddi-corpus/DrugDDI_Unified_GATEXML data/ddi_train.iob"
Training accuracy=0.980763482642299
Testing accuracy=0.963014047447269

Create test data:

mvn exec:java -Dexec.mainClass=com.micronlp.drug_extraction.DDIGateToIob -Dexec.args="../ddi-corpus/Test_Unified_GATEXML data/ddi_test.iob"

fix (if needed): remove double newlines in data/ddi_test.iob at line 31086 (or 31544)



Training:

java -cp "../mallet-2.0.7/class:../mallet-2.0.7/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --train true --test lab --threads 16 --iterations 500 --model-file src/main/resources/mallet/ddi.model  data/ddi_train.iob


Test/Application:

java -cp "../mallet-2.0.7/class:../mallet-2.0.7/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --test lab --threads 16 --model-file src/main/resources/mallet/ddi.model  data/ddi_test.iob





----------------------------------------------------------------------------------------

Installation and packaging

----------------------------------------------------------------------------------------


Create index:
mvn exec:java -Dexec.mainClass=com.micronlp.drug_extraction.DrugBankIndexer compile



Test DrugTagger.java:
mvn exec:java -Dexec.mainClass=com.micronlp.drug_extraction.DrugTagger -Dexec.args="./test_data/test1.xml test1_result.xml"
mvn exec:java -Dexec.mainClass=com.micronlp.drug_extraction.DrugTagger -Dexec.args="./test_data/test2.xml test2_result.xml"




Test DrugNameAnnieGazetteer.java:
mvn exec:java -Dexec.mainClass=com.micronlp.drug_extraction.DrugNameAnnieGazetteer -Dexec.args="./test_data/test1.txt test1_result.xml"
mvn exec:java -Dexec.mainClass=com.micronlp.drug_extraction.DrugNameAnnieGazetteer -Dexec.args="./test_data/test2.txt test2_result.xml"





Package or install:
mvn clean package
mvn clean install


----------------------------------------------------------------------------------------

Generate Javadoc

----------------------------------------------------------------------------------------

Generate the Javadoc for public members:
mvn javadoc:javadoc

Generate the Javadoc for all (public and private) members:
mvn javadoc:javadoc -Dshow=private

Copy fresh generated Javadoc to javadoc in the project root:
cp -R target/site/apidocs javadoc

Open the Javadoc in firefox:
firefox javadoc/apidocs/index.html &




----------------------------------------------------------------------------------------

Evaluation

------------------------------------------------------------------------------------------

1. Using conlleval evaluation script.

java -cp "/home/artjomk/bin/mallet-2.0.7/class:/home/artjomk/bin/mallet-2.0.7/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --threads 16 --model-file src/main/resources/mallet/ddi.model  data/ddi_test.iob > temp.out

paste -d ' ' data/ddi_test.iob temp.out > ddi_test_result.iob

perl conlleval.pl < ddi_test_result.iob



Evaluation result without DrugBank dictionary:            
processed 32065 tokens with 3656 phrases; found: 3482 phrases; correct: 2768.
accuracy:  94.41%; precision:  79.49%; recall:  75.71%; FB1:  77.56
             DRUG: precision:  79.49%; recall:  75.71%; FB1:  77.56  3482
             
Evaluation result with DrugBank dictionary:
processed 32065 tokens with 3656 phrases; found: 3251 phrases; correct: 2786.
accuracy:  95.25%; precision:  85.70%; recall:  76.20%; FB1:  80.67
             DRUG: precision:  85.70%; recall:  76.20%; FB1:  80.67  3251



2. Using pipeline.

export MAVEN_OPTS="-Xms1500m -Xmx4096m -ea"


Run pipeline:
mvn exec:java -Dexec.mainClass=com.micronlp.drug_extraction.Pipeline -Dexec.args="../../bin/GATE-7.1 ../ddi-corpus/Test_Unified_GATEXML ../ddi-corpus/Test_Unified_GATEXML/OUTPUT"

Note: Both - DrugTagger and DrugNameGazetteer - annotate text with annotations of type 'Drug'. DrugTagger annotations are in 'DrugTagger' annotation set. DrugNameGazetteer annotations are in 'DrugNameGazetteer' annotation set.


Use GATE Developer and plugins 'Copy Anns To Another Doc' and 'Quality Assurance PR' to evaluate results.
1. Create corpus
2. Load processed(!) documents into corpus (populate from directory)
3. Set path to reference/gold standard corpus in 'Copy Anns To Another Doc'
4. Use F1.0-Lenient metric in 'Quality Assurance PR'
    
See example in DrugExtractionEvaluation.xml (restore it as application).





DrugTagger approach results without DrugBankPresence feature:


not tested




DrugTagger approach results with DrugBankPresence feature:


f1.0-strict:

Corpus Statistics
Annotation Type 	Match 	Only in Key 	Only in Response 	Overlap 	Rec.B/A 	Prec.B/A 	f1.0-strict
Drug 	2,385.00 	969.00 	531.00 	335.00 	0.65 	0.73 	0.69
Macro Summary 	0.00 	0.00 	0.00 	0.00 	0.65 	0.73 	0.69
Micro Summary 	2,385.00 	969.00 	531.00 	335.00 	0.65 	0.73 	0.69 


f1.0-lenient:

Corpus Statistics
Annotation Type 	Match 	Only in Key 	Only in Response 	Overlap 	Rec.B/A 	Prec.B/A 	f1.0-lenient
Drug 	2,385.00 	969.00 	531.00 	335.00 	0.74 	0.84 	0.78
Macro Summary 	0.00 	0.00 	0.00 	0.00 	0.74 	0.84 	0.78
Micro Summary 	2,385.00 	969.00 	531.00 	335.00 	0.74 	0.84 	0.78 




DrugNameGazetteering approach results:


f1.0-strict:

Corpus Statistics
Annotation Type 	Match 	Only in Key 	Only in Response 	Overlap 	Rec.B/A 	Prec.B/A 	f1.0-strict
Drug 	1,918.00 	1,540.00 	489.00 	231.00 	0.52 	0.73 	0.61
Macro Summary 	0.00 	0.00 	0.00 	0.00 	0.52 	0.73 	0.61
Micro Summary 	1,918.00 	1,540.00 	489.00 	231.00 	0.52 	0.73 	0.61 


f1.0-lenient:

Corpus Statistics
Annotation Type 	Match 	Only in Key 	Only in Response 	Overlap 	Rec.B/A 	Prec.B/A 	f1.0-lenient
Drug 	1,918.00 	1,540.00 	489.00 	231.00 	0.58 	0.81 	0.68
Macro Summary 	0.00 	0.00 	0.00 	0.00 	0.58 	0.81 	0.68
Micro Summary 	1,918.00 	1,540.00 	489.00 	231.00 	0.58 	0.81 	0.68
Drug Extraction Code

Drug name extraction

Tree [r8] / Download Snapshot History

Read Me

Tree [r8] /

History