Skip to content
This repository was archived by the owner on Aug 25, 2024. It is now read-only.
This repository was archived by the owner on Aug 25, 2024. It is now read-only.

source: Labeled and Versioned datasets #9

@johnandersen777

Description

@johnandersen777

Assignee: @sudharsana-kjl

DFFML is hoping to participate in Google Summer of Code (GSoC) under the Python Software Foundation umbrella. You can read all about what this means at http://python-gsoc.org/. This issue, and any others tagged gsoc and project are not generally available bugs, but related to project ideas for GSoC.

Project Idea: Labeled and Versioned Datasets.

Project description:
DFFML's initial release includes sources which abstract the format in which the data is stored from the dataset generation and usage in models.

Add information allowing users to have different versions and datasets from the same source.

Skills: Python, git
Difficulty level: Intermediate

Related Readings/Links:

class Source(abc.ABC, Entrypoint):
'''
Abstract base class for all sources. New sources must be derived from this
class and implement the repos method.
'''
ENTRY_POINT = 'dffml.source'
def __init__(self, src: str) -> None:
self.src = src
@abc.abstractmethod
async def update(self, repo: Repo):
'''
Updates a repo for a source
'''
@abc.abstractmethod
async def repos(self) -> AsyncIterator[Repo]:
'''
Returns a list of repos retrieved from self.src
'''
# mypy ignores AsyncIterator[Repo], therefore this is needed
yield Repo('') # pragma: no cover
@abc.abstractmethod
async def repo(self, src_url: str):
'''
Get a repo from the source or add it if it doesn't exist
'''

dffml/dffml/repo.py

Lines 90 to 116 in dd8007d

class Repo(object):
'''
Manages feature independent information and actions for a repo.
'''
REPO_DATA = RepoData
def __init__(self, src_url: str, *,
data: Optional[Dict[str, Any]] = None,
extra: Optional[Dict[str, Any]] = None) -> None:
if data is None:
data = {}
if extra is None:
extra = {}
data['src_url'] = src_url
if 'extra' in data:
# Prefer extra from init arguments to extra stored in data
data['extra'].update(extra)
extra = data['extra']
del data['extra']
self.data = self.REPO_DATA(**data)
self.extra = extra
def dict(self):
data = self.data.dict()
data['extra'] = self.extra
return data

Potential mentors: @pdxjohnny

Getting Started: Source.__init__ probably needs another two arguments, label and version, which should probably have defaults (say, default and v0). Since the same backend (aka, a csv file or json file) would be used to store all the data, you'll have to change the existing sources we have to understand how to deal with this. For CSVSource that might mean adding another column to each repo, for JSONSource that might mean instead of one big array, the array of repos is stored like so:

{
    "default": {
        "v0": [
            "... all the repos ..."
        ]
    }
}

What we want to see in your application: Describe how you intend to solve the problem, and give us some "stretch goals", maybe you'll implement a source using sqlite too or something. Don't forget to include some time for building appropriate tests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestgsocGoogle Summer of Code relatedprojectIssues which will take a while to complete

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions