-
Notifications
You must be signed in to change notification settings - Fork 452
Sql example #248
base: main
Are you sure you want to change the base?
Sql example #248
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Thomas, this looks good to me! You are going to now add a script that does this end to end (downloads data, converts to sql, runs PBG, processes the outputs back to sql)?
Maybe we call it sql_end2end
or something?
Fixes to SQL Commands
7076f8d
to
fddb146
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your sql script takes a CSV file as input. I know that's just as an example and not the "intended use", but for this setup is it faster than our existing scripts? If so, should we be replacing those scripts with this code, rather than just using it as an example?
def write_rels_dict(rels): | ||
my_rels = "" | ||
for _, row in rels.sort_values(by="graph_id").iterrows(): | ||
r = "{" | ||
r += f"'name': '{row['id']}', 'lhs': '{row['source_type']}', 'rhs': '{row['destination_type']}', 'operator': 'translation'" | ||
r += "},\n" | ||
my_rels += r | ||
return my_rels | ||
|
||
|
||
def write_entities_dict(entity2partitions): | ||
my_entities = "{\n" | ||
for name, part in entity2partitions.items(): | ||
my_entities += '\t"{name}": {{"num_partitions": {part} }},\n'.format(name=name, part=part) | ||
my_entities += "}\n" | ||
return my_entities |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config is an actual python function... having python code that generates python code seems overly complex, no?
Either you can call train
directly from within an end-to-end script, with the config, or perhaps you can write a config with json.dump
or something?
Here's a schematic that might make this more clear:
>>> cfg = dict(
... entity_path="/foo/bar",
... edge_paths=[
... "path1",
... "path2",
... ],
... relations={
... "rel1": 3
... }
... )
>>> import json
>>> print(json.dumps(cfg, indent=2))
{
"entity_path": "/foo/bar",
"edge_paths": [
"path1",
"path2"
],
"relations": {
"rel1": 3
}
}
So you would do something like:
cfg = copy.deepcopy(DEFAULT_CFG)
cfg.relations = relations
cfg.entities = entities
with open(cfg_file, "w") as f:
f.write(f"def get_torchbiggraph_config():\n {json.dumps(cfg, indent=4)}")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense.
In a single test it was faster, but I'm not sure that I'd advocate for replacing the conversion script yet. I'd probably want to spend a little more time optimizing and evaluating first before doing that. |
Note, this PR is a work in progress. I'm putting it up now to get some guidance on code organization -- where should I put these functions, refactoring, etc?
Types of changes
Motivation and Context / Related issue
This PR provides a set of scripts to process data stored in SQL tables to get them into the right format for use with PBG.
Checklist