Skip to content
This repository has been archived by the owner on Mar 14, 2024. It is now read-only.

Sql example #248

Open
wants to merge 30 commits into
base: main
Choose a base branch
from
Open

Sql example #248

wants to merge 30 commits into from

Conversation

tmarkovich
Copy link
Contributor

Note, this PR is a work in progress. I'm putting it up now to get some guidance on code organization -- where should I put these functions, refactoring, etc?

Types of changes

  • [] Docs change / refactoring / dependency upgrade
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Motivation and Context / Related issue

This PR provides a set of scripts to process data stored in SQL tables to get them into the right format for use with PBG.

Checklist

  • The documentation is up-to-date with the changes I made.
  • I have read the CONTRIBUTING document and completed the CLA (see CONTRIBUTING).
  • All tests passed, and additional code has been covered with new tests.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 2, 2022
@tmarkovich tmarkovich marked this pull request as draft February 2, 2022 19:27
Copy link
Contributor

@adamlerer adamlerer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Thomas, this looks good to me! You are going to now add a script that does this end to end (downloads data, converts to sql, runs PBG, processes the outputs back to sql)?

Maybe we call it sql_end2end or something?

torchbiggraph/examples/e2e/data_prep.py Outdated Show resolved Hide resolved
torchbiggraph/examples/e2e/data_prep.py Outdated Show resolved Hide resolved
torchbiggraph/examples/e2e/data_prep.py Outdated Show resolved Hide resolved
torchbiggraph/examples/e2e/data_prep.py Outdated Show resolved Hide resolved
torchbiggraph/examples/e2e/sql_templates.py Outdated Show resolved Hide resolved
@tmarkovich tmarkovich marked this pull request as ready for review February 24, 2022 02:15
Copy link
Contributor

@adamlerer adamlerer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your sql script takes a CSV file as input. I know that's just as an example and not the "intended use", but for this setup is it faster than our existing scripts? If so, should we be replacing those scripts with this code, rather than just using it as an example?

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
torchbiggraph/check.py Outdated Show resolved Hide resolved
torchbiggraph/examples/sql_end2end/README.md Outdated Show resolved Hide resolved
torchbiggraph/examples/sql_end2end/data_prep.py Outdated Show resolved Hide resolved
Comment on lines 290 to 305
def write_rels_dict(rels):
my_rels = ""
for _, row in rels.sort_values(by="graph_id").iterrows():
r = "{"
r += f"'name': '{row['id']}', 'lhs': '{row['source_type']}', 'rhs': '{row['destination_type']}', 'operator': 'translation'"
r += "},\n"
my_rels += r
return my_rels


def write_entities_dict(entity2partitions):
my_entities = "{\n"
for name, part in entity2partitions.items():
my_entities += '\t"{name}": {{"num_partitions": {part} }},\n'.format(name=name, part=part)
my_entities += "}\n"
return my_entities
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config is an actual python function... having python code that generates python code seems overly complex, no?

Either you can call train directly from within an end-to-end script, with the config, or perhaps you can write a config with json.dump or something?

Here's a schematic that might make this more clear:

>>> cfg = dict(
...   entity_path="/foo/bar",
...   edge_paths=[
...     "path1",
...     "path2",
...   ],
...   relations={
...    "rel1": 3
...   }
... )
>>> import json
>>> print(json.dumps(cfg, indent=2))
{
  "entity_path": "/foo/bar",
  "edge_paths": [
    "path1",
    "path2"
  ],
  "relations": {
    "rel1": 3
  }
}

So you would do something like:

cfg = copy.deepcopy(DEFAULT_CFG)
cfg.relations = relations
cfg.entities = entities

with open(cfg_file, "w") as f:
  f.write(f"def get_torchbiggraph_config():\n  {json.dumps(cfg, indent=4)}")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense.

torchbiggraph/examples/sql_end2end/data_prep.py Outdated Show resolved Hide resolved
torchbiggraph/examples/sql_end2end/data_prep.py Outdated Show resolved Hide resolved
torchbiggraph/examples/sql_end2end/data_prep.py Outdated Show resolved Hide resolved
torchbiggraph/examples/sql_end2end/data_prep.py Outdated Show resolved Hide resolved
@tmarkovich tmarkovich requested a review from adamlerer April 1, 2022 19:50
@tmarkovich
Copy link
Contributor Author

tmarkovich commented Apr 1, 2022

In a single test it was faster, but I'm not sure that I'd advocate for replacing the conversion script yet. I'd probably want to spend a little more time optimizing and evaluating first before doing that.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants