You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Task: write the "Getting started: preprocessing" doc page
Advice/Tips to the technical writer
Good to know:
This page appears after the "1. Getting started". The users reading this page has already a global overview of Texthero and are motivated to learn more.
As this is by ordering the second page of the documentation (i.e in theory the second page the users is reading), we want to teach the user how they can profit most from the library. This include:
Mention that the API is very complete and very easy to use, i.e we want to teach him how to find information alone (we achieve our goal of creating an awesome library when the user is capable of finding the answers by looking at our website without the need of searching in Google/opening a new Stackoverflow question)
...
Concept useful to have clear in mind:
What is it and how the pandas.Series.pipe function works
Why text-preprocessing is crucial in many NLP/text mining areas
Dealing with regex is often painful, with Texthero this can be avoided
At exception of tokenize, all preprocessing functions receive as input a TextSeries and returns a TextSeries.
Things to keep in mind when writing:
In the future, Texthero will allow to preprocess also non-Western languages
To stay in the technical discussion loop:
What if every preprocessing function would require as input an already tokenized Series? I.e the first mandatory task would be tokenize, even before remove_punctuation or anything else? This is useful when dealing with non-Western language (see All preprocessing functions to receive as input TokenSeries #145 ).
Page
aim: learn how to preprocess text-based dataset with Texthero
content:
clean function: default way, option when no customization is required
mention/explain it exists different preprocessing functions (cite a few, link them, ...). Not sure we want to explain all, this is up to the writer ...
...
(at the end) --> tokenize
explain what tokenization is about and why is needed
output is TokenSeries (fundamentally different, show an example, every cell is now a list of tokens)
The text was updated successfully, but these errors were encountered:
I am structuring this part as follows (based on review of similar contexts, including Texthero "Getting Started" structure):
** Overview/Intro** Why is pre-processing crucial and what are the benefits of having a standardized/customizable pipeline
** Clean**
What it does and how
** Custom Pipeline** Why and how you should take control of the pre-processing steps
** More details ** Including pre-processing API functionalities
Please let me know if something is not clear or if you have any additional suggestions.
Task: write the "Getting started: preprocessing" doc page
Advice/Tips to the technical writer
Good to know:
Concept useful to have clear in mind:
pandas.Series.pipe
function workstokenize
, all preprocessing functions receive as input aTextSeries
and returns aTextSeries
.Things to keep in mind when writing:
To stay in the technical discussion loop:
tokenize
, even beforeremove_punctuation
or anything else? This is useful when dealing with non-Western language (see All preprocessing functions to receive as input TokenSeries #145 ).Page
aim: learn how to preprocess text-based dataset with Texthero
content:
clean
function: default way, option when no customization is requiredclean
code and edit the pipeline see Preprocessing: explain how to create a custom pipeline #38tokenize
TokenSeries
(fundamentally different, show an example, every cell is now a list of tokens)The text was updated successfully, but these errors were encountered: