-
Notifications
You must be signed in to change notification settings - Fork 138
source: Labeled and Versioned datasets #9
Description
Assignee: @sudharsana-kjl
DFFML is hoping to participate in Google Summer of Code (GSoC) under the Python Software Foundation umbrella. You can read all about what this means at http://python-gsoc.org/. This issue, and any others tagged gsoc
and project
are not generally available bugs, but related to project ideas for GSoC.
Project Idea: Labeled and Versioned Datasets.
Project description:
DFFML's initial release includes sources which abstract the format in which the data is stored from the dataset generation and usage in models.
Add information allowing users to have different versions and datasets from the same source.
Skills: Python, git
Difficulty level: Intermediate
Related Readings/Links:
Lines 16 to 45 in dd8007d
class Source(abc.ABC, Entrypoint): | |
''' | |
Abstract base class for all sources. New sources must be derived from this | |
class and implement the repos method. | |
''' | |
ENTRY_POINT = 'dffml.source' | |
def __init__(self, src: str) -> None: | |
self.src = src | |
@abc.abstractmethod | |
async def update(self, repo: Repo): | |
''' | |
Updates a repo for a source | |
''' | |
@abc.abstractmethod | |
async def repos(self) -> AsyncIterator[Repo]: | |
''' | |
Returns a list of repos retrieved from self.src | |
''' | |
# mypy ignores AsyncIterator[Repo], therefore this is needed | |
yield Repo('') # pragma: no cover | |
@abc.abstractmethod | |
async def repo(self, src_url: str): | |
''' | |
Get a repo from the source or add it if it doesn't exist | |
''' |
Lines 90 to 116 in dd8007d
class Repo(object): | |
''' | |
Manages feature independent information and actions for a repo. | |
''' | |
REPO_DATA = RepoData | |
def __init__(self, src_url: str, *, | |
data: Optional[Dict[str, Any]] = None, | |
extra: Optional[Dict[str, Any]] = None) -> None: | |
if data is None: | |
data = {} | |
if extra is None: | |
extra = {} | |
data['src_url'] = src_url | |
if 'extra' in data: | |
# Prefer extra from init arguments to extra stored in data | |
data['extra'].update(extra) | |
extra = data['extra'] | |
del data['extra'] | |
self.data = self.REPO_DATA(**data) | |
self.extra = extra | |
def dict(self): | |
data = self.data.dict() | |
data['extra'] = self.extra | |
return data |
Potential mentors: @pdxjohnny
Getting Started: Source.__init__
probably needs another two arguments, label
and version
, which should probably have defaults (say, default
and v0
). Since the same backend (aka, a csv file or json file) would be used to store all the data, you'll have to change the existing sources we have to understand how to deal with this. For CSVSource
that might mean adding another column to each repo, for JSONSource
that might mean instead of one big array, the array of repos is stored like so:
{
"default": {
"v0": [
"... all the repos ..."
]
}
}
What we want to see in your application: Describe how you intend to solve the problem, and give us some "stretch goals", maybe you'll implement a source using sqlite
too or something. Don't forget to include some time for building appropriate tests.