Which scientific concepts, that have never been investigated jointly, will lead to the most impactful research?
📖 Read our paper here:
Forecasting high-impact research topics via machine learning on evolving knowledge graphs
Xuemei Gu, Mario Krenn
Note
Full Dynamic Knowledge Graph and Datasets can be downloaded at 10.5281/zenodo.10692137
Dataset for Benchmark can be downloaded at 10.5281/zenodo.14527306
create_concept │ ├── Concept_Corpus │ ├── s0_get_preprint_metadata.ipynb: Get metadata from chemRxiv, medRxiv, bioRxiv (arXiv data from Kaggle) │ ├── s1_make_metadate_arxivstyle.ipynb: Preprocessing metadata from different sources │ ├── s2_combine_all_preprint_metadate.ipynb: Combining metadata │ ├── s3_get_concepts.ipynb: Use NLP techniques (for instance RAKE) to extract concepts │ └── s4_improve_concept.ipynb: Further improvements of full concept list │ └── Domain_Concept ├── s0_prepare_optics_quantum_data.ipynb: Get papers for specific domain (optics and quantum physics in our case). ├── s1_split_domain_papers.py: Prepare data for parallelization. ├── s2_get_domain_concepts.py: Get domain-specific vertices in full concept list. ├── s3_merge_concepts.py: Postprocessing domain-specific concepts ├── s4_improve_concepts.ipynb: Further improve concept lists ├── s5_improve_manually_concepts.py: Manually inspect the concepts in the very end for grammar, non-conceptual phrases, verbs, ordinal numbers, conjunctions, adverbials and so on, to improve quality └── full_domain_concepts.txt: Final list of 37,960 concepts (represent vertices of knowledge graph)
create_dynamic_edges ├── _get_openalex_workdata.py: Get metadata from OpenAlex) ├── _get_openalex_workdata_parallel_run1.py: Get parts of the metadata from OpenAlex (run in many parts) ├── get_concept_pairs.py: Create edges of the knowledge graph (edges carry the time and citation information). ├── merge_concept_pairs.py: Combining edges files └── process_edge_to_pandas_frame.py: Post-processing, store the full dynamic knowledge graph
. ├── prepare_unconnected_pair_solution.ipynb: Find unconnected concept pairs (for training, testing and evaluating) ├── prepare_adjacency_pagerank.py: Prepare dynamic knowledge graph and compute properties ├── prepare_node_pair_citation_data_years.ipynb: Prepare citation data for both individual concept nodes and concept pairs for specific years │ ├──create_dynamic_concepts │ ├── get_concept_citation.py: Create dynamic concepts from the knowledge graph (concepts carry the time and citation information). │ ├── merge_concept_citation.py: Combining dynamic concepts files │ └── process_concept_to_pandas_frame.py: Post-processing, store the full dynamic concepts │ ├── merge_concept_pairs.py: Combining dynamic concepts │ └── process_edge_to_pandas_frame.py: Post-processing, store the full dynamic concepts │ └──prepare_eval_data ├── prepare_eval_feature_data.py: Prepare features of knowledge graph (for evaluation dataset) └── prepare_eval_feature_data_condition.py: Prepare features of knowledge graph (for evaluation dataset, conditioned on existence in the future)
. ├── train_model_2019_run.py: Training neural network from 2016 -> 2019 (evaluated form 2019 -> 2022). ├── train_model_2019_condition.py: Training neural network from 2016 -> 2019 (evaluated form 2019 -> 2022, conditioned on existence in the future) ├── train_model_2019_individual_feature.py: Training neural network from 2016 -> 2019 (evaluated form 2019 -> 2022) on individual features └── train_model_2022_run.py: Training 2019 -> 2022 (for real future predictions of 2025)
Feature descriptions for an unconnected pair of concepts (u, v)
Feature Type | Feature Index | Feature Description |
---|---|---|
node feature | 0-5 | the number of neighbours for vertices denoted as: |
6-7 | the number of new neighbors since 1 years prior to denoted as: |
|
8-9 | the number of new neighbors since 2 years prior to denoted as: |
|
10-11 | the rank of the number of new neighbors since 1 years prior to denoted as: |
|
12-13 | the rank of the number of new neighbors since 2 years prior to denoted as: |
|
14-19 | the PageRank score for vertices denoted as: |
|
node citation feature | 20-25 | yearly citation for vertices denoted as: |
26-31 | total citation for vertices denoted as: |
|
32-37 | total citation for vertices denoted as: |
|
38-43 | the number of papers mentioning either concept ( denoted as: |
|
44-49 | The average yearly citation for vertices calculated based on the total citations received during the year divided by the number of papers mentioning the vertices from their first publications up to the respective year denoted as: |
|
50-55 | The average total citation for vertices determined by dividing the cumulative citations by the count of papers that mentioned these vertices since their first publications denoted as: |
|
56-61 | The average total citation for vertices calculated by dividing the cumulative three-year period citations by the count of papers that mentioned these vertices since their first publications denoted as: |
|
62-63 | the number of new citations for vertices denoted as: |
|
64-65 | the number of new citations for vertices denoted as: |
|
66-67 | the rank of the number of new citations for vertices denoted as: |
|
68-69 | the rank of the number of new citations for vertices denoted as: |
|
70-71 | the number of new papers mentioning vertices denoted as: |
|
72-73 | the number of new papers mentioning vertices denoted as: |
|
74-75 | the rank of the number of new papers mentioning vertices denoted as: |
|
76-77 | the rank of the number of new papers mentioning vertices denoted as: |
|
pair feature | 78-80 | the number of shared neighbors between vertices denoted as: |
81-83 | the geometric coefficient for the pair ( calculated by number_shared_neighbor**2 / (deg_u * deg_v), deg_u is the degree of vertex denoted as: |
|
84-86 | the cosine coefficient for the pair ( calculated by geometric_index**0.5 denoted as: |
|
87-89 | the simpson coefficient for the pair ( calculated by number_shared_neighbor / np.min([deg_u, deg_u]) denoted as: |
|
90-92 | the preferential attachment coefficient for the pair ( calculated by deg_u*deg_u denoted as: |
|
93-95 | the Sørensen–Dice coefficient for the pair ( calculated by 2*num_shared_neighbor / (deg_u + deg_v) denoted as: |
|
96-98 | the jaccard coefficient for the pair ( calculated by num_shared_neighbor/(deg_u + deg_v - num_shared_neighbor) denoted as: |
|
pair citation feature | 99-101 | the ratio of the sum of citations received by concepts calculated by ( |
102-104 | the ratio of the product of citations received by concepts calculated by ( |
|
105-107 | the sum of the average citations received by concepts e.g., calculated by ( |
|
108-110 | the sum of the average total citations received by concepts e.g., calculated by ( |
|
111-113 | the sum of the citations received by concepts |
|
114-116 | the sum of the average citations received by concepts |
|
117-119 | the minimum number of the citations received by either concept |
|
120-122 | the maximum number of the citations received by either concept |
|
123-125 | the minimum number of the total citations received by either concept |
|
126-128 | the maximum number of the total citations received by either concept |
|
129-131 | the minimum number of total citations received by either concept |
|
132-134 | the maximum number of total citations received by either concept |
|
135-137 | the minimum number of papers mentioning either concept |
|
138-140 | the maximum number of papers mentioning either concept |
One need to download the data at 10.5281/zenodo.14527306 and unzip the file in the benchmark_code folder.
benchmark_code ├── loops_fcNN.py: fully connected neural network model ├── loops_transformer.py: transformer model ├── loops_tree.py: random forest model ├── loops_xgboost.py: XGBoost model └── other python files: Post-processing, make the Figure 6-8 from the evaluation on different models.
Three examples about 10M evaluation samples (2019-2022) with raw outputs from a neural network trained on 2016-2019 data (accessible at 10.5281/zenodo.14527306) are for producing Figure 11 in the fpr_example folder.