-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adding the API and all required files!
- Loading branch information
0 parents
commit 6f791cf
Showing
17 changed files
with
99 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
## Persianp Processing Toolbox | ||
|
||
Persianp is a text processing tool developed in Java to accomplish preprocessing tasks in Persian texts. The toolbox accomplishes following task: | ||
* Character-level normalization | ||
* Tokenization | ||
* Lemmatization | ||
* POS tagging | ||
* Stopword detection | ||
* Noun phrase chunking | ||
|
||
### Using Persianp from the command line | ||
Be sure folder 'res' is next to the 'jar' file. | ||
|
||
'''bash | ||
$ java -cp persianp-toolbox-1.0.jar com.persianp.nlp.process.Process -input inputfile.txt -output outputfile.txt -task (tokenize|tag|lemmatize|taglemmatize) [-nostopword] [-prop propertyFile.properties] | ||
''' | ||
|
||
At the moment NP chunking is not supported from the comand line. | ||
|
||
### Using the Persianp API | ||
Add the API to libraries of your program. The following example shows how to use the toolbox. | ||
|
||
''' | ||
public class TestPersianp { | ||
|
||
public static void main(String[] args) { | ||
TestPersianp testPersianp = new TestPersianp(); | ||
testPersianp.process(); | ||
} | ||
|
||
private void process() { | ||
try { | ||
Properties properties = new Properties(); | ||
properties.load(this.getClass().getClassLoader().getResourceAsStream("persianp.properties")); | ||
Process process = new Process(properties); | ||
InputStream in = this.getClass().getClassLoader().getResourceAsStream("testText.txt"); | ||
BufferedReader br = new BufferedReader(new InputStreamReader(in, "UTF-8")); | ||
String line; | ||
while ((line = br.readLine()) != null) { | ||
process.process(line); | ||
|
||
System.out.println(process.getText()); | ||
// process.getTokens(); | ||
// process.getTokensText(); | ||
// process.getTags(); | ||
// process.getChunkTag(); | ||
// process.getLemmas(); | ||
// process.getNonStopwordTokens(); | ||
|
||
int sentenceSize = process.getSentencesSize(); | ||
for (int j = 0; j < sentenceSize; ++j) { | ||
// List tokensText = process.getTokensTextInSentence(j); | ||
// List tags = process.getTagsInSentence(j); | ||
// List lemmas = process.getLemmasInSentence(j); | ||
List tokens = process.getTokensInSentence(j); | ||
for (int k = 0; k < tokens.size(); ++k) { | ||
System.out.println(tokens.get(k).getText() + "\t\t\t" + tokens.get(k).getLemma() + "\t\t\t" + tokens.get(k).getTag()); | ||
} | ||
} | ||
} | ||
in.close(); | ||
br.close(); | ||
} catch (Exception e){ | ||
e.printStackTrace(); | ||
} | ||
} | ||
} | ||
|
||
''' | ||
|
||
### More Information / Citing This Toolbox | ||
Please cite the paper below if you use the Persianp toolbox in your research. It also provides more information about the toolbox. | ||
|
||
> Mahdi Mohseni, Javad Ghofrani, Heshaam Faili | ||
> Persianp: A Persian Text Processing Toolbox | ||
> International Conference on Intelligent Text Processing and Computational Linguistics | ||
CICLing 2016: Computational Linguistics and Intelligent Text Processing pp 75-87 | ||
|
||
Bibtex citation: | ||
|
||
''' | ||
@InProceedings{Persianp2016, | ||
author="Mohseni, Mahdi | ||
and Ghofrani, Javad | ||
and Faili, Heshaam", | ||
title="Persianp: A Persian Text Processing Toolbox", | ||
booktitle="Computational Linguistics and Intelligent Text Processing", | ||
year="2018", | ||
publisher="Springer International Publishing", | ||
pages="75--87", | ||
isbn="978-3-319-75477-2" | ||
} | ||
''' | ||
|
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
҄㸑䧇헵碃ġ䰔蕟噽꺈㺔뢣奂椮虣⪠俕䂉烾堑銀 | ||
�>I���x�!L�_V}����� >���YBi.�c*�O�@�p�X�� |
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
陓㕓퇞䎎�묒䋫⇺ᡧ⳻�ధ�媘迡觝蘭ٷ폞 |
Binary file not shown.
Binary file not shown.