added notebooks and scripts to reranking using graph features (catboo… #146

highly0 · 2024-03-11T15:05:41Z

I'm adding scripts to run the reranking pipeline using graph features (numerical, textual, and embedding features). There are also jupyter notebooks alternatives. Note: the scripts only train the model without reranking for now. Thus, converting the .ipynb files and running from top to bottom will do all training, reranking, and features importance.

jupyter nbconvert --to script catboost_features.ipynb

Remember to change configurations before running the .ipynb files. Or alternatively, you can train the models with the scripts, then load it in to the ipynb file and gather reranking results.

In addition to the reranking pipeline, I added graph_features_preparation.py which prepares the dataframe with graph features and publish it to HuggingFace (T5-large-ssm, T5-xl-ssm).

The added notebooks and scripts and its functionality are as below:

graph_features_preparation.py: prepare the dataset with graph features and publish to HF
linear_regression.ipynb: notebooks to train linear regression and rerank with dataset above (can be ran from top to bottom)
train_linear_regression.py: script to train linear regression using dataset above
catboost_features.ipynb: notebooks to train catboost and rerank with dataset above (can be ran from top to bottom)
train_catboost_regressor.py: script to train catboost using dataset above

DmiitriyJarosh · 2024-03-16T14:09:41Z

experiments/subgraphs_reranking/graph_features/graph_features_preparation.py

+    for _, row in tqdm(dataframe.iterrows()):
+        # convert from json dict to networkx graph
+        graph_obj = json_graph.node_link_graph(try_literal_eval(row["graph"]))
+        graph_node_names = get_node_names(graph_obj)


is it used anywhere except graph_to_sequence? maybe call it there?

DmiitriyJarosh · 2024-03-16T14:11:12Z

experiments/subgraphs_reranking/graph_features/graph_features_preparation.py

+def get_distance_ans_cand(graph, ans_cand_id):
+    """get avg distance from ans entity to answer candidate"""
+    graph = graph.to_undirected()  # for ssp both ways
+    ssp_dict = nx.shortest_path(graph, target=ans_cand_id)


can our graph contain intermediate nodes? or only 1-hop?

DmiitriyJarosh · 2024-03-16T14:29:49Z

experiments/subgraphs_reranking/graph_features/train_linear_regression.py

+        "updated_graph_sequence_embedding",
+    ]
+    if args.sequence_type != "both":
+        DROP_EM_FEAT = (


looks like you do the same thing as in catboost but in other way -- it is better to use the same approach for better readability

highly0 requested review from mdsalnikov and DmiitriyJarosh March 11, 2024 15:05

highly0 self-assigned this Mar 15, 2024

DmiitriyJarosh approved these changes Mar 16, 2024

View reviewed changes

added graph features reranking for gap, g2t, determ seqs

0f484b1

highly0 force-pushed the features/graph_features_reranking branch from 5cd76d1 to 0f484b1 Compare March 18, 2024 11:24

Merge branch 'master' into features/graph_features_reranking

3c68989

highly0 merged commit ffb1d25 into master Mar 18, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added notebooks and scripts to reranking using graph features (catboo… #146

added notebooks and scripts to reranking using graph features (catboo… #146

highly0 commented Mar 11, 2024 •

edited

Loading

DmiitriyJarosh Mar 16, 2024

DmiitriyJarosh Mar 16, 2024

DmiitriyJarosh Mar 16, 2024

added notebooks and scripts to reranking using graph features (catboo… #146

added notebooks and scripts to reranking using graph features (catboo… #146

Conversation

highly0 commented Mar 11, 2024 • edited Loading

DmiitriyJarosh Mar 16, 2024

Choose a reason for hiding this comment

DmiitriyJarosh Mar 16, 2024

Choose a reason for hiding this comment

DmiitriyJarosh Mar 16, 2024

Choose a reason for hiding this comment

highly0 commented Mar 11, 2024 •

edited

Loading