-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.txt
79 lines (70 loc) · 3.01 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
Bhojpuri Language Technological Resources (BHLTR)
========================================================
Introduction
=======
The Bhojpuri (https://en.wikipedia.org/wiki/Bhojpuri_language) LT Resources (BHLTR) project was intially initiated by me (Atul (http://ufal.ms.mff.cuni.cz/atul-kr-ojha)) at Jawaharlal Nehru University (JNU), New Delhi (http://sanskrit.jnu.ac.in/index.jsp) during the doctoral(http://sanskrit.jnu.ac.in/rstudents/phd.jsp) research work. BHLTR data contains monolingual, parallel (English-Bhojpuri), and POS annotaed monolingual corpora. In this data, POS is annotated according to Bureau of Indian Standards (BIS) Part Of Speech (POS) tagset(http://tdil-dc.in/tdildcMain/articles/134692Draft%20POS%20Tag%20standard.pdf).
Structure of the `BHLTR data` folder
=======================
bho-resources/
├─ mono-bho-corpus/
│ ├─ monolingual.bho
│ ├─ README.md
│ ├─ pos-annotated/
│ │ └─ pos-tagged.bho
│
│
└─ parallel-corpora/
├─ README.md
├─ eng-bho/
│ └─ eng-bho.en
│ └─ eng-bho.bho
├─ license.md
├─ README.md
├─ README.txt
Acknowledgments
=======
I would like to thanks my Doctoral supervisor Prof. Girish Nath Jha (https://jnu.ac.in/Faculty/gnjha/) and Sanskrit Computational Lab, JNU, New Delhi (http://sanskrit.jnu.ac.in/index.jsp).
References
=======
<pre>
@article{ojha2019english,
title={English-Bhojpuri SMT System: Insights from the Karaka Model},
author={Ojha, Atul Kr},
journal={arXiv preprint arXiv:1905.02239},
year={2019}
}
</pre>
<pre>
@inproceedings{karakanta2019proceedings,
title={Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages},
author={Karakanta, Alina and Ojha, Atul Kr and Liu, Chao-Hong and Washington, Jonathan and Oco, Nathaniel and Lakew, Surafel Melaku and Malykh, Valentin and Zhao, Xiaobing},
booktitle={Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages},
year={2019}
}
</pre>
<pre>
@article{kumar2018automatic,
title={Automatic identification of closely-related Indian languages: Resources and experiments},
author={Kumar, Ritesh and Lahiri, Bornini and Alok, Deepak and Ojha, Atul Kr and Jain, Mayank and Basit, Abdul and Dawer, Yogesh},
journal={arXiv preprint arXiv:1803.09405},
year={2018}
}
</pre>
<pre>
@inproceedings{ojha2015training,
title={Training \& evaluation of POS taggers in Indo-Aryan languages: a case of Hindi, Odia and Bhojpuri},
author={Ojha, Atul Kr. and Behera, Pitambar and Singh, Srishti and Jha, Girish N},
booktitle={the proceedings of 7th Language \& Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics},
pages={524--529},
year={2015}
}
</pre>
<pre>
=== Machine-readable metadata (DO NOT REMOVE!) ================================
Data available since: BHLTR v1.0
License: CC BY-NC-SA 4.0
Includes text: yes
Contributors: Ojha, Atul Kr.
Contact: shashwatup9k@gmail.com
===============================================================================
</pre>