Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spacy.PlainTextCorpusReader.v1 #12122

Merged
merged 10 commits into from
Jan 26, 2023

Conversation

danieldk
Copy link
Contributor

Description

This is a corpus reader that reads plain text corpora with the following format:

  • UTF-8 encoding
  • One line per document.
  • Blank lines are ignored.

It is useful for applications where we deal with very large corpora, such as distillation, and don't want to deal with the space overhead of serialized formats. Additionally, many large corpora already use such a text format, keeping the necessary preprocessing to a minimum.

Types of change

New feature

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@danieldk danieldk added enhancement Feature requests and improvements feat / training Feature: Training utils, Example, Corpus and converters 🔜 v4.0 Related to upcoming v4.0 labels Jan 18, 2023
spacy/training/corpus.py Show resolved Hide resolved
website/docs/api/corpus.mdx Outdated Show resolved Hide resolved
@danieldk
Copy link
Contributor Author

Completely forgot to stage the tests that I wrote, added now as well.

Copy link
Contributor

@shadeMe shadeMe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nitpick; LGTM otherwise!

spacy/tests/training/test_corpus.py Outdated Show resolved Hide resolved
Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if this is a silly question, but why can't this just go in master instead of v4?

@danieldk
Copy link
Contributor Author

Apologies if this is a silly question, but why can't this just go in master instead of v4?

That's a good question. I didn't need it there, but I could as well rebase it for master.

@svlandeg
Copy link
Member

Then we'd just update the tags to "3.5.1" and merge this after 3.5 is tagged/out

@danieldk danieldk changed the base branch from v4 to master January 19, 2023 16:41
@danieldk danieldk changed the base branch from master to v4 January 19, 2023 16:41
danieldk and others added 8 commits January 19, 2023 17:43
This is a corpus reader that reads plain text corpora with the following
format:

- UTF-8 encoding
- One line per document.
- Blank lines are ignored.

It is useful for applications where we deal with very large corpora,
such as distillation, and don't want to deal with the space overhead of
serialized formats. Additionally, many large corpora already use such
a text format, keeping the necessary preprocessing to a minimum.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Different OS auto delete/sharing semantics are just wonky.
@danieldk danieldk force-pushed the feature/plain-text-corpus branch from 9fdea9d to 2f9d8f7 Compare January 19, 2023 16:43
@danieldk danieldk changed the base branch from v4 to master January 19, 2023 16:43
@danieldk
Copy link
Contributor Author

Rebased onto master. I hope the force push to make it happen will be forgiven 🤣 .

@danieldk danieldk removed the 🔜 v4.0 Related to upcoming v4.0 label Jan 20, 2023
Copy link
Contributor

@adrianeboyd adrianeboyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nitpicky comments on the tests...

spacy/tests/training/test_corpus.py Outdated Show resolved Hide resolved
spacy/tests/training/test_corpus.py Outdated Show resolved Hide resolved
spacy/tests/training/test_corpus.py Outdated Show resolved Hide resolved
@adrianeboyd adrianeboyd merged commit 8d69874 into explosion:master Jan 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / training Feature: Training utils, Example, Corpus and converters
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants