Bio_ClinicalBERT is a domain-specific language model tailored for clinical natural language processing (NLP), extending BioBERT with additional training on clinical notes. It was initialized from BioBERT-Base v1.0 and further pre-trained on all clinical notes from the MIMIC-III database (~880M words), which includes ICU patient records. The training focused on improving performance in tasks like named entity recognition and natural language inference within the healthcare domain. Notes were processed using rule-based sectioning and tokenized with SciSpacy. Training was done for 150,000 steps using a batch size of 32, max sequence length of 128, and a masked language modeling objective with a 0.15 mask probability. Bio_ClinicalBERT is available through Hugging Face's Transformers library for easy integration. It supports medical AI research and applications involving electronic health record understanding, clinical decision support, and biomedical information extraction.
Features
- Pre-trained on all MIMIC-III clinical notes (~880M words)
- Initialized from BioBERT, which was trained on PubMed and PMC data
- Optimized for clinical NLP tasks like NER and NLI
- Processes text using medical-specific sentence splitting (SciSpacy)
- Compatible with Hugging Face Transformers (PyTorch, TensorFlow, JAX)
- Masked language model with 0.15 masking probability
- Trained with max sequence length of 128 for real-world clinical note length
- Licensed under MIT, supporting open and flexible usage