Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added notebooks and scripts to reranking using graph features (catboo… #146

Merged
merged 2 commits into from
Mar 18, 2024

Conversation

highly0
Copy link
Collaborator

@highly0 highly0 commented Mar 11, 2024

I'm adding scripts to run the reranking pipeline using graph features (numerical, textual, and embedding features). There are also jupyter notebooks alternatives. Note: the scripts only train the model without reranking for now. Thus, converting the .ipynb files and running from top to bottom will do all training, reranking, and features importance.

jupyter nbconvert --to script catboost_features.ipynb

Remember to change configurations before running the .ipynb files. Or alternatively, you can train the models with the scripts, then load it in to the ipynb file and gather reranking results.

In addition to the reranking pipeline, I added graph_features_preparation.py which prepares the dataframe with graph features and publish it to HuggingFace (T5-large-ssm, T5-xl-ssm).

The added notebooks and scripts and its functionality are as below:

  • graph_features_preparation.py: prepare the dataset with graph features and publish to HF
  • linear_regression.ipynb: notebooks to train linear regression and rerank with dataset above (can be ran from top to bottom)
  • train_linear_regression.py: script to train linear regression using dataset above
  • catboost_features.ipynb: notebooks to train catboost and rerank with dataset above (can be ran from top to bottom)
  • train_catboost_regressor.py: script to train catboost using dataset above

@highly0 highly0 self-assigned this Mar 15, 2024
for _, row in tqdm(dataframe.iterrows()):
# convert from json dict to networkx graph
graph_obj = json_graph.node_link_graph(try_literal_eval(row["graph"]))
graph_node_names = get_node_names(graph_obj)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it used anywhere except graph_to_sequence? maybe call it there?

def get_distance_ans_cand(graph, ans_cand_id):
"""get avg distance from ans entity to answer candidate"""
graph = graph.to_undirected() # for ssp both ways
ssp_dict = nx.shortest_path(graph, target=ans_cand_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can our graph contain intermediate nodes? or only 1-hop?

"updated_graph_sequence_embedding",
]
if args.sequence_type != "both":
DROP_EM_FEAT = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like you do the same thing as in catboost but in other way -- it is better to use the same approach for better readability

@highly0 highly0 force-pushed the features/graph_features_reranking branch from 5cd76d1 to 0f484b1 Compare March 18, 2024 11:24
@highly0 highly0 merged commit ffb1d25 into master Mar 18, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants