-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TextDirectoryCorpus
that yields one doc per file recursively read from directory
#1387
Comments
…e recursively read from directory.
I'm not sure this is something we want in core gensim... there's no end to number of similar utility functions and their combinations we could potentially support, a Pandora's box. I summarized my thoughts on this topic in a blog post: API Bondage. @menshikh-iv @gojomo @cscorley @tmylk your thoughts? |
I think It's a useful feature. I have written similar wrappers many times. I understand your position @piskvorky, but where is the line after which we no longer want to add new features? |
I can certainly understand not wanting to bloat a code base. I've also written similar wrappers many times. I figured since there's already a One additional benefit of grouping this and similar corpus classes: it would open up opportunities to provide the distribution used by the |
Yeah, a separate dedicated subpackage (or even a repo) that focuses on various efficient, streamed (parallelized?) readers, for sundry data formats, sounds good to me. I don't mind including this "subdirs reader" as a blueprint example, it's a common use case as you say. But I'd be -1 on adding many such readers in an ad-hoc manner, at ad-hoc locations, throughout gensim. Having a clear structure and plan behind it sounds better. |
Having a dedicated subpackage opens up some interesting options. What do you think of the following proposal? It's a tad bit long, but I hope it's closer to a clear structure and plan. (1) Add a new (2) Combine the various text-based corpus classes into
So, common functionality includes taking in one or more text files as either file handles or FS paths, opening and reading them, performing unicode conversion and tokenization, then yielding the words in some format. Each line may be converted into one or more documents, and lowercasing and stopword removal may be performed. Additional preprocessing may occur (as in the POS tag handling). Finally, the words may be wrapped up in some other wrapper (as in So to reorganize, my initial thought would be this:
An alternative might be to put tagged corpora in a |
Awesome! Thanks for the thoughtful investigation @macks22 . Since this looks like an architectural change, let's also include the bundling of datasets and models into the discussion. We've been wanting to include some common datasets and trained models for playing around into gensim (beyond the tiny data we have as a part of unit tests now). Since it's related to how we handle the corpus inputs, so I think it should be discussed at the same place. |
So you want to discuss how to include pre-trained models or make them accessible for download? For the datasets, I would think you could just provide code to download them from public URLs (as scikit does). For models, I'm not sure what the best approach is. It might be worth checking out how SpaCy handles that, since they do have some mechanism for loading pre-trained GloVe word vectors. Are these the sorts of concerns you are wanting to include in the discussion? |
…ocessing pipeline that emulates Elasticsearch's analyzers API. Preprocessing consists of 0+ character filters, a tokenizer, and 0+ token filters.
…or `TextDirectoryCorpus`.
…dd `lines_are_documents` option and test coverage for it, and add test for non-trivial directory structure. Make sampling more efficient by not preprocessing discarded samples. Consolidate TextCorpus tests in `test_corpora`.
…e recursively read from directory.
…ocessing pipeline that emulates Elasticsearch's analyzers API. Preprocessing consists of 0+ character filters, a tokenizer, and 0+ token filters.
…or `TextDirectoryCorpus`.
…dd `lines_are_documents` option and test coverage for it, and add test for non-trivial directory structure. Make sampling more efficient by not preprocessing discarded samples. Consolidate TextCorpus tests in `test_corpora`.
…ranch: moving new `TextCorpus` sampling method tests into `test_corpora`.
…ke modifying preprocessing steps more modular. Consolidate tests in `test_corpora`.
Description
Plain text corpuses are sometimes represented using a directory structure with, for instance a top-level directory and subdirectories that represent categories. Then within each subdirectory, each file might be a document. The nesting may run deeper in such directory structures to reflect additional categorization. The popular 20 newsgroups dataset has such a structure. Here is a subset of that structure:
It would be useful to have a gensim corpus available to handle this sort of corpus structure.
Steps/Code/Corpus to Reproduce
I'm envisioning something like this:
Expected Results
Actual Results
Versions
This should be available in all versions.
The text was updated successfully, but these errors were encountered: