Skip to content

Guide to Lexos Loader Modules

Scott Kleinman edited this page Jun 20, 2022 · 11 revisions

This is a running guide describing the features of the various loader modules in the developing Lexos API. It should be useful for testing and for writing tutorials.

One Text per File

basic

The first loader module, which is (hopefully) deprecated.

Loads plain text files only from paths (local or url) and lists of paths, as well as directories of plain text files.

advanced

This file is (hopefully) deprecated. It is the same as the basic module but accepts .docx and .pdf formats. Urls are processed separately from local files using requests.

smart

Supersedes basic and advanced. Loads plain text, .docx, .pdf and directories or zip archives containing files in those formats. Local paths and urls are processed together using smart_open. The Loader class can be instantiated with a path or list of paths.

Notes:

  • PDFs can take a long time to process.
  • In GitHub repos, files are not stored along the path to the folder in the web navigation. As a result, it is necessary to fetch them using the GitHub API, which requires the user, repository, and branch. If the loader detects a url containing github.com, it will try to reconstruct this information from the supplied path when you call loader=Loader(path_to_repo_dir). If it fails, you can supply the information manually by calling
# Return a list of raw download paths
paths = utils.get_github_raw_paths(path_to_repo_dir, user=user, repo=repo, branch=branch)
loader.load(paths)

Datasets (One File with Multiple Texts)

Note: Good test data for datasets is not yet available.

It is increasingly common for text analysis and NLP tools to use datasets in which many texts are stored in a single file. The simplest is a line-delineated plain text file, where each text is on a separate line. Other common formats include csv/tsv with one text per row, json, jsonl (new-line delineated json), and Excel spreadsheets. Lexos provides a Dataset class to parse data in a variety of formats into a common data structure. If called in a form like dataset = Dataset.parse_csv("source.csv"), a list of texts can then be accessed with dataset.texts.

There is also a more complex DatasetLoader class, which can be called in a form like dataset = DatasetLoader("source.csv"). Internally, this tries to guess and implement the correct Dataset constructor method. The DatasetLoader can also be used to load items from directories or zip archives, although it will fail if individual items are in different formats. In this case, it is better to parse each item one by one.

A preliminary draft of the documentation for using the Dataset and DatasetLoader classes is provided below.

A Dataset object is an instance Pydantic model. This means that printing it gives a handy keyword representation. You can also export it in dict or json format with Dataset.dict() or Dataset.json(). These Pydantic helper methods have been extended to include Dataset.df() to export to a Pandas dataframe.

Datasets have three built-in properties:

  • Dataset.data: A synonym for Dataset.dict()
  • Dataset.names: A list of "title" values in the dataset.
  • Dataset.texts: A list of "text" values in the dataset.

Dataset objects are iterable, meaning that you can loop through them with the following:

for item in dataset:
    ...

Important: The Dataset object yields generator. If you want the output as a list, use list(generator) or call dataset.data. Note that, if you have a lot of texts, or longer texts, these will require you to load the entire dataset into memory. This may cause problems such as exceeding your buffer capacity in a Jupyter notebook. You can avoid these problems by iterating through the generator.

Dataset objects can be parsed from a variety of formats:

  1. Python dicts.
  2. Line-delimited texts.
  3. CSV and TSV formats.
  4. Excel files.
  5. JSON format.
  6. Line-delimited JSON format (.jsonl)
  7. Directories and zip archives containing files in the above formats.

Except in the cases of dicts and Excel files, the input can either be a string or a path/url to a file containing the string. Dicts must obviously be dicts, and Excel files must be read from a path.

Each format requires a slightly different constructor method, detailed below:

Line-Delimited Texts (Dataset.parse_string())

Line-delimited texts are strings containing multiple texts, each on a separate line.

Datasets are constructed from line-delimited texts using Dataset.parse_string(source), where source is a string, filepath , or url.

Since line-delimited texts contain no metadata, a list of labels must also be supplied, and the number of labels (generally text titles) must match the number of lines in the string or file. For example:

dataset = Dataset.parse_string(source, labels=["Text1", "Text2"])

CSV and TSV Formats (Dataset.parse_csv())

Data in CSV or TSV format must have headers and the headers should ideally contain "title" and "text". If they do not, the title_col and text_col arguments can be supplied to indicate which titles should be converted to these header names. For instance:

dataset = Dataset.parse_csv(source, title_col="label", text_col="content")

For TSV files, an additional argument must be supplied:

dataset = Dataset.parse_csv(source, sep="\t")

CSV and TSV data is parsed with pandas.read_csv(), so any Pandas keyword arguments for this method are accepted.

Excel Files (Dataset.parse_excel())

Excel files are treated exactly like CSV and TSV files, except that they are parsed with pandas.read_excel() and accept arguments for that method. Dataset.parse_excel() requires a filepath or url as the source of the data.

Dictionaries (Dataset.parse_dict())

Data in dict format must either have "title" and "text" fields or the title_field and text_field arguments must be supplied to indicate which fields should be converted to these field names. For instance:

dataset = Dataset.parse_dict(source, title_field="label", text_field="content")

JSON Format (Dataset.parse_json())

Data in JSON format must either have "title" and "text" fields or the title_field and text_field arguments must be supplied to indicate which fields should be converted to these field names. For instance:

dataset = Dataset.parse_json(source, title_field="label", text_field="content")

JSON data is parsed with pandas.read_json(), so any Pandas keyword arguments for this method are accepted. Note, however, that the lines=True argument should not be used to load line-delimited JSON. Instead, the Dataset.parse_jsonl() method below should be used.

Line-Delimited JSON Format (.jsonl) (Dataset.parse_json())

Data in Line-Delimited JSON format must either have "title" and "text" fields or the title_field and text_field arguments must be supplied to indicate which fields should be converted to these field names. For instance:

dataset = Dataset.parse_jsonl(source, title_field="label", text_field="content")

JSON data is parsed with pandas.read_json(), so any Pandas keyword arguments for this method are accepted.

The DatasetLoader Class

The DatasetLoader class attempts to auto-detect the data format an apply the correct constructor method. Like the Dataset class, it is iterable, so you can call individual items with commands like dataset[0]["title"] to get the title of the first item in the dataset. Like the Dataset class, the output of a DatasetLoader object is a generator. An instance is constructed as follows:

from lexos.io.dataset import Dataset, DatasetLoader

dataset = DatasetLoader("source.csv", title_col="label", text_col="content")

Notice that, although the format is detected, you may still need to provide information required by the Dataset.parse_csv() method for your data to be parsed correctly.

Methods of Combining Datasets with the Standard Loader Class

If you know the data type of your data (say, a CSV), the simplest method of adding a dataset to the Loader is as follows:

from lexos.io.dataset import Dataset, DatasetLoader
from lexos.io.smart import Loader

loader = Loader()
dataset = Dataset.parse_csv("source.csv") # or `dataset = DatasetLoader("source.csv")`
loader.titles = dataset.titles
loader.texts = dataset.texts