cocoNLP is a lightweight natural-language processing toolkit geared toward practical information extraction from raw text, especially for Chinese and mixed Chinese–English content. Instead of requiring a heavy pipeline, it focuses on quick wins such as extracting names, places, organizations, emails, phone numbers, and dates directly from unstructured sentences. The project blends pattern-based methods with NLP heuristics, giving developers dependable results for real-world texts like chats, comments, and user-generated content. Its API is intentionally simple, so you can drop it into scripts, ETL jobs, or dashboards without deep ML expertise. Because it aims at utility over complexity, it’s useful for prototyping data products or building lightweight text analytics where large models would be overkill. The repository also includes examples and test snippets to help you understand expected inputs and typical outputs, which shortens the learning curve for newcomers.
Features
- Ready-made extractors for names, locations, organizations, emails, phones, and dates
- Chinese and mixed-language text handling for common real-world corpora
- Lightweight API surface that integrates into scripts and services quickly
- Pattern-driven approach for predictable behavior and easy customization
- Works well in ETL and data-cleaning pipelines without GPU dependencies
- Examples and test snippets to validate usage and outputs