Skip to content
This repository has been archived by the owner on May 9, 2024. It is now read-only.

Latest commit

 

History

History
49 lines (47 loc) · 1.19 KB

kaist.md

File metadata and controls

49 lines (47 loc) · 1.19 KB

KAIST Raw corpus

Sample

name: kaist
fullname: KAIST Raw corpus
lang: ko
category: formal
description: KAIST Corpus consist of 70,000,000 words Korean raw texts which were
  extracted from various genre such as novel, non-literature, article etc.
license: MIT License
homepage: http://semanticweb.kaist.ac.kr
version: 1.0.0
num_docs: 11157
num_docs_before_processing: 11358
num_segments: 11157
num_sents: 1926901
num_words: 30929508
size_in_bytes: 319727995
num_bytes_before_processing: 343615648
size_in_human_bytes: 304.92 MiB
data_files_modified: '2022-02-23 09:52:18'
meta_files_modified: '2022-02-23 08:40:00'
info_updated: '2022-02-26 03:06:08'
data_files:
  train: kaist-train.parquet
meta_files:
  train: meta-kaist-train.parquet
features:
  columns:
    id: id
    text: text
  data:
    id: int
    text: str
  meta:
    id: int
    filename: str
    version: str
    title: str
    author: str
    date: str
    publisher: str
    kdc: str