Question Answering Corpus download

RC-Data is a dataset generation framework created by Google DeepMind to produce large-scale reading comprehension question-answer pairs from CNN and Daily Mail news articles. The dataset, introduced in the 2015 paper “Teaching Machines to Read and Comprehend” (Hermann et al., NIPS 2015), was among the first large corpora designed to train and evaluate machine reading and comprehension models. The repository provides scripts for downloading archived CNN and Daily Mail articles from the Wayback Machine and automatically generating cloze-style questions where entities in the text are replaced with placeholders. Each data instance consists of a news article (context), a generated question, and its corresponding answer, making it suitable for supervised machine learning setups. The output follows a standardized question-answer format, with entity mappings to help models resolve named references.

Features

Generates large-scale question-answer datasets from news articles
Includes data from CNN and Daily Mail corpora via the Wayback Machine
Produces questions, contexts, and answers in a standardized text format
Supports entity anonymization through mapping for model training
Offers a reproducible generation pipeline using Python scripts
Compatible with machine comprehension and NLP benchmarking tasks

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow Question Answering Corpus

Question Answering Corpus Web Site

User Reviews

Be the first to post a review of Question Answering Corpus!

Additional Project Details

Operating Systems

Linux, Mac

Programming Language

Python

Related Categories

Python Libraries

Registered

2025-10-09

Similar Business Software

DHTMLX

DHTMLX is a JavaScript UI library that provides a set of highly customizable and flexible components for building modern and responsive web applications. The library includes more than 30 UI components, such as Gantt, Scheduler, Kanban, diagrams, charts, grids, spreadsheets, calendars, trees,...

See Software
Chart.js

Simple yet flexible JavaScript charting for designers & developers. Layout boxes can be stacked and weighted in groups. A secondary title plugin with all the same options as the main title. Line segments can be styled by any user-defined criteria. Transitions of every property in every element...

See Software
Webix Grid

Webix Grid is a standalone JavaScript DataGrid component (table/grid UI widget) that is optimized for high-performance, large-dataset scenarios, and is designed to be dropped into web applications where tabular data needs to be displayed, edited, filtered, sorted, etc. Key positioning...

See Software
Kendo UI

Kendo UI is the ultimate collection of JavaScript UI components with libraries for jQuery, Angular, React, and Vue. Quickly build eye-catching, high-performance, responsive web applications—regardless of your JavaScript framework choice. Easily add advanced JavaScript components into your...

See Software
Sencha Ext JS

Sencha Ext JS is a comprehensive JavaScript application framework for building feature-rich, cross-browser-compatible web and mobile applications. It includes a library with 140+ high-performance customizable components, a set of tools and a powerful UI Framework. Key Features Rich UI...

See Software
Auth.js

Auth.js is an open-source authentication library designed to integrate seamlessly with modern JavaScript frameworks, providing a flexible and secure authentication experience. It supports various authentication methods, including OAuth (e.g., Google, GitHub), credentials, and WebAuthn, allowing...

See Software

Report inappropriate content

Question Answering Corpus

Question answering dataset in "Teaching Machines to Read & Comprehend"

Get an email when there's a new version of Question Answering Corpus