-
Notifications
You must be signed in to change notification settings - Fork 0
Guide to Lexos Loader Modules
This is a running guide describing the features of the various loader modules in the developing Lexos API. It should be useful for testing and for writing tutorials.
The first loader module, which is (hopefully) deprecated.
Loads plain text files only from paths (local or url) and lists of paths, as well as directories of plain text files.
This file is (hopefully) deprecated. It is the same as the basic
module but accepts .docx
and .pdf
formats. Urls are processed separately from local files using requests
.
Supersedes basic
and advanced
. Loads plain text, .docx
, .pdf
and directories or zip archives containing files in those formats. Local paths and urls are processed together using smart_open
. The Loader
class can be instantiated with a path or list of paths.
Notes:
- PDFs can take a long time to process.
- In GitHub repos, files are not stored along the path to the folder in the web navigation. As a result, it is necessary to fetch them using the GitHub API, which requires the user, repository, and branch. If the loader detects a url containing
github.com
, it will try to reconstruct this information from the supplied path when you callloader=Loader(path_to_repo_dir)
. If it fails, you can supply the information manually by calling
# Return a list of raw download paths
paths = utils.get_github_raw_paths(path_to_repo_dir, user=user, repo=repo, branch=branch)
loader.load(paths)
Note: Good test data for datasets is not yet available.
It is increasingly common for text analysis and NLP tools to use datasets in which many texts are stored in a single file. The simplest is a line-delineated plain text file, where each text is on a separate line. Other common formats include csv/tsv with one text per row, json, jsonl (new-line delineated json), and Excel spreadsheets. Lexos provides a Dataset
class to parse data in a variety of formats into a common data structure. If called in a form like dataset = Dataset.parse_csv("source.csv")
, a list of texts can then be accessed with dataset.texts
.
There is also a more complex DatasetLoader
class, which can be called in a form like dataset = DatasetLoader("source.csv")
. Internally, this tries to guess and implement the correct Dataset
constructor method. The DatasetLoader
can also be used to load items from directories or zip archives, although it will fail if individual items are in different formats. In this case, it is better to parse each item one by one.
A preliminary draft of the documentation for using the Dataset
and DatasetLoader
classes is provided below.
A Dataset object is an instance Pydantic model. This means that printing it gives a handy keyword representation. You can also export it in dict or json format with Dataset.dict()
or Dataset.json()
. These Pydantic helper methods have been extended to include Dataset.df()
to export to a Pandas dataframe.
Datasets have three built-in properties:
-
Dataset.data
: A synonym forDataset.dict()
-
Dataset.names
: A list of "title" values in the dataset. -
Dataset.texts
: A list of "text" values in the dataset.
Dataset
objects are iterable, meaning that you can loop through them with the following:
for item in dataset:
...
Important: The Dataset
object yields generator. If you want the output as a list, use list(generator)
or call dataset.data
. Note that, if you have a lot of texts, or longer texts, these will require you to load the entire dataset into memory. This may cause problems such as exceeding your buffer capacity in a Jupyter notebook. You can avoid these problems by iterating through the generator.
Dataset objects can be parsed from a variety of formats:
- Python dicts.
- Line-delimited texts.
- CSV and TSV formats.
- Excel files.
- JSON format.
- Line-delimited JSON format (.jsonl)
- Directories and zip archives containing files in the above formats.
Except in the cases of dicts and Excel files, the input can either be a string or a path/url to a file containing the string. Dicts must obviously be dicts, and Excel files must be read from a path.
Each format requires a slightly different constructor method, detailed below:
Line-delimited texts are strings containing multiple texts, each on a separate line.
Datasets are constructed from line-delimited texts using Dataset.parse_string(source)
, where source is a string, filepath , or url.
Since line-delimited texts contain no metadata, a list of labels must also be supplied, and the number of labels (generally text titles) must match the number of lines in the string or file. For example:
dataset = Dataset.parse_string(source, labels=["Text1", "Text2"])
Data in CSV or TSV format must have headers and the headers should ideally contain "title" and "text". If they do not, the title_col
and text_col
arguments can be supplied to indicate which titles should be converted to these header names. For instance:
dataset = Dataset.parse_csv(source, title_col="label", text_col="content")
For TSV files, an additional argument must be supplied:
dataset = Dataset.parse_csv(source, sep="\t")
CSV and TSV data is parsed with pandas.read_csv()
, so any Pandas keyword arguments for this method are accepted.
Excel files are treated exactly like CSV and TSV files, except that they are parsed with pandas.read_excel()
and accept arguments for that method. Dataset.parse_excel()
requires a filepath or url as the source of the data.
Data in dict format must either have "title" and "text" fields or the title_field
and text_field
arguments must be supplied to indicate which fields should be converted to these field names. For instance:
dataset = Dataset.parse_dict(source, title_field="label", text_field="content")
Data in JSON format must either have "title" and "text" fields or the title_field
and text_field
arguments must be supplied to indicate which fields should be converted to these field names. For instance:
dataset = Dataset.parse_json(source, title_field="label", text_field="content")
JSON data is parsed with pandas.read_json()
, so any Pandas keyword arguments for this method are accepted. Note, however, that the lines=True
argument should not be used to load line-delimited JSON. Instead, the Dataset.parse_jsonl()
method below should be used.
Data in Line-Delimited JSON format must either have "title" and "text" fields or the title_field
and text_field
arguments must be supplied to indicate which fields should be converted to these field names. For instance:
dataset = Dataset.parse_jsonl(source, title_field="label", text_field="content")
JSON data is parsed with pandas.read_json()
, so any Pandas keyword arguments for this method are accepted.
The DatasetLoader
class attempts to auto-detect the data format an apply the correct constructor method. Like the Dataset
class, it is iterable, so you can call individual items with commands like dataset[0]["title"]
to get the title of the first item in the dataset. Like the Dataset
class, the output of a DatasetLoader
object is a generator. An instance is constructed as follows:
from lexos.io.dataset import Dataset, DatasetLoader
dataset = DatasetLoader("source.csv", title_col="label", text_col="content")
Notice that, although the format is detected, you may still need to provide information required by the Dataset.parse_csv()
method for your data to be parsed correctly.
If you know the data type of your data (say, a CSV), the simplest method of adding a dataset to the Loader
is as follows:
from lexos.io.dataset import Dataset, DatasetLoader
from lexos.io.smart import Loader
loader = Loader()
dataset = Dataset.parse_csv("source.csv") # or `dataset = DatasetLoader("source.csv")`
loader.titles = dataset.titles
loader.texts = dataset.texts