HUAWEI | NOAH'S ARK LAB

Open Data Sets for DL4NLP

  • Short-Text Conversation

    This dataset consists of 4.4 millions of message-response pairs crawled from Weibo. It can be used for training of a neural dialogue system. You can get this dataset for research purposes by clicking Noah_NRM_Data. If you have any question on the dataset, please contact Lifeng Shang.

    Please cite the following paper if you use the data in your work.

    Neural Responding Machine for Short-Text Conversation. Lifeng Shang, Zhengdong Lu, and Hang Li. ACL 2015.

  • Generative Question Answering

    This dataset consists of 720K question-answer pairs associated with 1.1M triples in a knowledge-base. The question-answer pairs are collected from two Chinese community QA sites, and the knowledge-base is built by mining from three Chinese encyclopedia sites. From the format of the data, please refer to the README file. You can freely download the data by following this link. Please use the data only for research purposes. If you have any question regarding to the data, please contact Xin Jiang.

    Please cite the following paper if you use the data in your work.

    Neural Generative Question Answering. Jun Yin, Xin Jiang, Zhengdong Lu, Lifeng Shang, Hang Li, and Xiaoming Li. IJCAI 2016.