[go: up one dir, main page]

Showing 40 open source projects for "corpus linguistics"

View related business solutions
  • Gen AI apps are built with MongoDB Atlas Icon
    Gen AI apps are built with MongoDB Atlas

    The database for AI-powered applications.

    MongoDB Atlas is the developer-friendly database used to build, scale, and run gen AI and LLM-powered apps—without needing a separate vector database. Atlas offers built-in vector search, global availability across 115+ regions, and flexible document modeling. Start building AI apps faster, all in one place.
    Start Free
  • Striven | All In One Business Management Software Icon
    Striven | All In One Business Management Software

    Striven is an all-in-one business management software suite with everything your organization needs for success.

    Striven is the all-in-one business management software that lowers your costs, improves your operations, and makes work easier. Make your company’s data coherent, connected, and relevant.
    Learn More
  • 1
    IMS Open Corpus Workbench

    IMS Open Corpus Workbench

    Indexing and query tools for very large text corpora

    The IMS Open Corpus Workbench is a collection of tools for managing and querying large text corpora (100 M words and more) with linguistic annotations. Its central component is the flexible and efficient query processor CQP, which can be used interactively in a terminal session, as a backend e.g. from a Perl script, or through the Web-based GUI CQPweb.
    Leader badge">
    Downloads: 35 This Week
    Last Update:
    See Project
  • 2
    iramuteq
    IRAMUTEQ : Interface de R pour les Analyses Multidimensionnelles de Textes et de Questionnaires. Logiciel de traitement de données pour des corpus texte ou de type individus/caractères. Permet notamment de réaliser des analyses de type "ALCESTE"
    Leader badge">
    Downloads: 830 This Week
    Last Update:
    See Project
  • 3

    modnlp-plugins

    External plugins for modnlp/teccli

    This is a general project for modnlp/teccli plugins, with focus on text visualizaton.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4

    Linguistic Analyzer

    The Linguistic Analyzer is a tool for corpus analysis and comparison

    The Linguistic Analyzer (Almuhalil Alloghawy) is a free tool designed by a team from Al-Imam Muhammad bin Saud islamic university that can be used for corpus analysis and comparison in terms of the several linguistic characteristics, such as frequency lists generation, concordances, collocation extraction, the difference between two words, and keyword identification.
    Downloads: 1 This Week
    Last Update:
    See Project
  • Multi-Entity Cloud Accounting Software for Growing Businesses Icon
    Multi-Entity Cloud Accounting Software for Growing Businesses

    Built for small to midsize businesses that have outgrown entry-level accounting or legacy ERP solutions.

    Built natively on the Microsoft Power Platform (Dynamics 365), Gravity delivers robust multi-entity financial management with seamless integration to Microsoft 365, Power BI, Teams + Copilot — no third-party add-ons required.
    Learn More
  • 5

    korpus

    Corpus Linguistics Software

    Some software for Corpus Linguistics, which includes Corpus Text Editor, Web-based search, etc. This project created for Belarusian Corpus, but can be used for other languages with some adaption.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    Korean Analyzer Rhino

    Korean Analyzer Rhino

    Parsing Korean words by morpheme and part-of-speech

    RHINO parses Korean words by morpheme and part-of-speech. Its dictionaries are based on Korean Modern Tagged Corpus(12 million phrases scale) which was made by Korean government. So it analyses many cases of stems and endings. And the newly developed Dynamic Dictionary Technology can make words to react with their context. That is, a programmed database. For more information see the files in the help folder.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7

    KSUCCA Corpus

    A 50 million tokens corpus of Classical Arabic.

    King Saud University Corpus of Classical Arabic (KSUCCA) is a pioneering 50 million tokens annotated corpus of Classical Arabic texts from the period of pre-Islamic era until the fourth Hijri century (equivalent to the period from the seventh until early eleventh century CE), which is the period of pure classical Arabic. The main aim of this corpus is to be used for studying the distributional lexical semantics of The Quran words.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 8

    SimpleLemmatizer

    This program is for text lemmatization

    It lemmatizes texts based on supplied model. The base model is for slovak texts and is created from Slovak National Corpus, copyright by Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9

    Arabic Corpus

    Text categorization, arabic language processing, language modeling

    The Arabic Corpus {compiled by Dr. Mourad Abbas ( http://sites.google.com/site/mouradabbas9/corpora ) The corpus Khaleej-2004 contains 5690 documents. It is divided to 4 topics (categories). The corpus Watan-2004 contains 20291 documents organized in 6 topics (categories). Researchers who use these two corpora would mention the two main references: (1) For Watan-2004 corpus ---------------------- M. Abbas, K. Smaili, D. Berkani, (2011) Evaluation of Topic Identification Methods on...
    Downloads: 2 This Week
    Last Update:
    See Project
  • DataHub is the leading open-source data catalog helping teams discover, understand, and govern their data assets. Icon
    DataHub is the leading open-source data catalog helping teams discover, understand, and govern their data assets.

    Modern Data Catalog and Metadata Platform

    Built on an open source foundation with a thriving community of 13,000+ members, DataHub gives you unmatched flexibility to customize and extend without vendor lock-in. DataHub Cloud is a modern metadata platform with REST and GraphQL APIs that optimize performance for complex queries, essential for AI-ready data management and ML lifecycle support.
    Learn More
  • 10
    concordia

    concordia

    Powerful search library, best suited for computer-aided translation

    Concordia - Roman goddess of agreement. Concordance searcher - tool for translators who need their translations to "agree" with one standard. Concordia is a C++ library for fast text lookup in large corpora. It uses a RAM stored index, which takes up approximately 600MB of memory for a corpus of 2 million sentences. It is based on the idea of a suffix array, enhanced by the presence of other auxiliary data structures. The effects are stunning - Concordia is able to do simple substring...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11

    eMargin

    online collaborative annotation

    ...These annotations can be shared amongst groups, generating discussions and allowing analyses and interpretations to be combined. The initial aim of the eMargin project was to bridge the gap between two distinct approaches to textual analysis: the top-down, quantitative approach of corpus linguistics and the fine-grained, introspective approach of literary close reading in a classroom context. Our solution, is a web-based annotation tool which, by moving the annotation process online, enables collaboration and discussion across multiple locations in both synchronous and asynchronous modes. eMargin was developed through two Jisc grants. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    Open data for a Khmer language corpus and lexicographic data that can be used for the development of free language tools for Khmer language, such as automatic translators, dictionaries, linguistic analysis tools, etc.
    Leader badge">
    Downloads: 63 This Week
    Last Update:
    See Project
  • 13

    rcqp

    R interface to the Corpus Query Protocol

    Implements the Corpus Query Protocol as a package for the R statistical environment. It allows to query linguistic corpora and manipulate the data as native R objects. It is based on the CWB software.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    Corpus Toolkit

    Corpus Toolkit

    A text management tool for linguistic purposes...

    Downloads: 0 This Week
    Last Update:
    See Project
  • 15

    PADIC

    A multilingual Parallel Arabic DIalectal Corpus

    ...Mourad Abbas Computational Linguistics Department, crstdla https://sites.google.com/site/mouradabbas9 Publications ----------------- K. Meftouh, S. Harrat, S. Jamoussi, M. Abbas, K. Smaïli, Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus, The 29th Pacific Asia Conference on Language, Information and Computation, PACLIC 2015, Shanghai, 2015.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 16

    Arabic business corpora

    Arabic business and management corpus

    ...Both plain text and tagged corpora are available to download, check the Files section for direct download of the zip files. The resource is publicly available and has been used in the Arabic Corpus Linguistics book by King Saud University
    Downloads: 1 This Week
    Last Update:
    See Project
  • 17

    texrex

    Web corpus creation software (moved to GitHub)

    This project has moved to GitHub: https://github.com/rsling/texrex https://github.com/rsling/cow
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18

    Classical Arabic Corpus

    A corpus contains more than 1 M distinct Arabic words.

    This project has been developed as part of a master thesis named "Edit Distance Adapted to Natural Language Words". The available project consists three parts. First, the corpus gathers more than one million distinct Arab words. Second, the text files of Arabic resources. Third, the index file presents some information about these resources. Additional details about these parts are available in README file.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    ICE Nigeria

    ICE Nigeria

    Nigerian component of the International Corpus of English

    This is the Nigerian component of the International Corpus of English, a one million word corpus of written and spoken Nigerian English for linguistic research. It can be used as a stand-alone corpus or in conjunction with other components of the International Corpus of English (such as ICE-GB, ICE-India, etc.) to compare international varieties of English. This is the first release of the complete corpus. The corpus can be downloaded in several parts. The written part can be downloaded as...
    Downloads: 8 This Week
    Last Update:
    See Project
  • 20
    AFEWC corpus is a multilingual comparable text articles in Arabic, French, and English languages. Each triple article is related to the same topic (aligned at article level). AFEWC corpus is collected from Wikipedia. The corpus is available for free for research purposes only. It is composed of 40K aligned articles, 91.3M English words, 57.8M French words, 22M Arabic words, 2.8M English unique words, 1.9M French unique words, and 1.5M Arabic unique words. Wikipedia text is...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    Osman Arabic Text Readability

    Osman Arabic Text Readability

    Open Source tool for Arabic text readability

    We present OSMAN (Open Source Metric for Measuring Arabic Narratives) - a novel open source Arabic readability metric and tool. The open source Java tool allows users to calculate readability for Arabic text (with and without diacritics). The tool provides methods to split the text into words and sentence, count syllables, Faseeh letters, hard and complex words in addition to adding diacritics (vocalise text). This makes the tool useful for researchers and educators working with Arabic text....
    Downloads: 2 This Week
    Last Update:
    See Project
  • 22

    mwetoolkit

    THIS PROJECT MIGRATED TO https://gitlab.com/mwetoolkit/mwetoolkit3/

    ...These include idioms (kick the bucket), noun compounds (cable car), phrasal verbs (take off, give up), etc. Even though it focuses on multiword expresisons, the framework is quite complete and can also be useful in any corpus-based study in computational linguistics. The mwetoolkit can be applied to virtually any text collection, language, and MWE type. It is a command-line tool written mostly in Python. Its development started in 2010 as a PhD thesis but the project keeps active (see the SVN logs). Up-to-date documentation and details about the tool can be found on the mwetoolkit website: http://mwetoolkit.sourceforge.net/
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23

    Drug Extraction

    Drug name extraction

    Drug name recognition and normalisation/grounding to DrugBank ids and standard names. Package provides 2 taggers: 1. DrugTagger - CRF-based with DrugBank presence feature (see feature set for details). 2. DrugnameGazetteer - gazetteer/dictionary-based. Dictionary created from DrugBank.ca database. Both taggers include grounding/normalisation to DrugBank ids and standard names. Feature set: Word, Word-1, Word+1, Word-1_Word, Word_Word+1, DrugBankPresence, POS DrugBankPresence...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    DisMo

    DisMo

    A POS, disfluency and multi-word unit annotator for spoken language

    DisMo is a part-of-speech, disfluency and multi-word unit automatic annotator. It is designed to manage the complexities and phenomena specific to spoken language. It currently supports English and French, with support for more languages coming soon. It is developed and maintained by George Christodoulides (Centre Valibel, IL&C, University of Louvain, Louvain-la-Neuve, Belgium). Visit www.corpusannotation.org to find out more about DisMo and other annotation tools for language...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    TextTools
    TextTools is a freeware corpus linguistics tool developed in Python to aid in research. This program analyzes user-created corpora and displays information about word (token) frequency, n-grams, clusters, collocations, keyword in context (KWIC), and keyness. TextTools is designed to be user-friendly and intuitive and will run natively on Mac OS X.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • Next