Skip to content

Forecasting high-impact research topics via machine learning on evolving knowledge graphs

License

Notifications You must be signed in to change notification settings

artificial-scientist-lab/Impact4Cast

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Impact4Cast

License: MIT arXiv

Which scientific concepts, that have never been investigated jointly, will lead to the most impactful research?

📖 Read our paper here:
Forecasting high-impact research topics via machine learning on evolving knowledge graphs
Xuemei Gu, Mario Krenn

workflow

Note

Full Dynamic Knowledge Graph and Datasets can be downloaded at 10.5281/zenodo.10692137
Dataset for Benchmark can be downloaded at 10.5281/zenodo.14527306

create_concept
│ 
├── Concept_Corpus
│   ├── s0_get_preprint_metadata.ipynb: Get metadata from chemRxiv, medRxiv, bioRxiv (arXiv data from Kaggle)
│   ├── s1_make_metadate_arxivstyle.ipynb: Preprocessing metadata from different sources
│   ├── s2_combine_all_preprint_metadate.ipynb: Combining metadata
│   ├── s3_get_concepts.ipynb: Use NLP techniques (for instance RAKE) to extract concepts
│   └── s4_improve_concept.ipynb: Further improvements of full concept list
│   
└── Domain_Concept
    ├── s0_prepare_optics_quantum_data.ipynb: Get papers for specific domain (optics and quantum physics in our case).
    ├── s1_split_domain_papers.py: Prepare data for parallelization.
    ├── s2_get_domain_concepts.py: Get domain-specific vertices in full concept list.
    ├── s3_merge_concepts.py: Postprocessing domain-specific concepts
    ├── s4_improve_concepts.ipynb: Further improve concept lists
    ├── s5_improve_manually_concepts.py: Manually inspect the concepts in the very end for grammar, non-conceptual phrases, verbs, ordinal numbers, conjunctions, adverbials and so on, to improve quality
    └── full_domain_concepts.txt: Final list of 37,960 concepts (represent vertices of knowledge graph)
create_dynamic_edges
├── _get_openalex_workdata.py: Get metadata from OpenAlex)
├── _get_openalex_workdata_parallel_run1.py: Get parts of the metadata from OpenAlex (run in many parts)
├── get_concept_pairs.py: Create edges of the knowledge graph (edges carry the time and citation information).
├── merge_concept_pairs.py: Combining edges files
└── process_edge_to_pandas_frame.py: Post-processing, store the full dynamic knowledge graph

workflow

.
├── prepare_unconnected_pair_solution.ipynb: Find unconnected concept pairs (for training, testing and evaluating)
├── prepare_adjacency_pagerank.py: Prepare dynamic knowledge graph and compute properties
├── prepare_node_pair_citation_data_years.ipynb: Prepare citation data for both individual concept nodes and concept pairs for specific years
│
├──create_dynamic_concepts
│  ├── get_concept_citation.py: Create dynamic concepts from the knowledge graph (concepts carry the time and citation information). 
│  ├── merge_concept_citation.py: Combining dynamic concepts files
│  └── process_concept_to_pandas_frame.py: Post-processing, store the full dynamic concepts
│  ├── merge_concept_pairs.py: Combining dynamic concepts
│  └── process_edge_to_pandas_frame.py: Post-processing, store the full dynamic concepts
│
└──prepare_eval_data
   ├── prepare_eval_feature_data.py: Prepare features of knowledge graph (for evaluation dataset)
   └── prepare_eval_feature_data_condition.py: Prepare features of knowledge graph (for evaluation dataset, conditioned on existence in the future)

workflow

.
├── train_model_2019_run.py: Training neural network from 2016 -> 2019 (evaluated form 2019 -> 2022).
├── train_model_2019_condition.py: Training neural network from 2016 -> 2019 (evaluated form 2019 -> 2022, conditioned on existence in the future)
├── train_model_2019_individual_feature.py: Training neural network from 2016 -> 2019 (evaluated form 2019 -> 2022) on individual features
└── train_model_2022_run.py: Training 2019 -> 2022 (for real future predictions of 2025)
Feature descriptions for an unconnected pair of concepts (u, v)
Feature Type Feature Index Feature Description
node feature 0-5 the number of neighbours for vertices $u$ and $v$ in years $y$, $y-1$, $y-2$
denoted as: $N_{u,y}$, $N_{v,y}$, $N_{u,y-1}$, $N_{v,y-1}$, $N_{u, y-2}$, $N_{v, y-2}$
6-7 the number of new neighbors since 1 years prior to $y$ for vertices $u$ and $v$
denoted as: $N_{u,y}^{\Delta}$, $N_{v,y}^{\Delta}$
8-9 the number of new neighbors since 2 years prior to $y$ for vertices $u$ and $v$
denoted as: $N_{u,y}^{\Delta 2}$, $N_{v,y}^{\Delta 2}$
10-11 the rank of the number of new neighbors since 1 years prior to $y$ for vertices $u$ and $v$
denoted as: $rN_{u,y}^{\Delta}$, $rN_{v,y}^{\Delta}$
12-13 the rank of the number of new neighbors since 2 years prior to $y$ for vertices $u$ and $v$
denoted as: $rN_{u,y}^{\Delta 2}$, $rN_{v,y}^{\Delta 2}$
14-19 the PageRank score for vertices $u$ and $v$ in years $y$, $y-1$, $y-2$
denoted as: $PR_{u,y}$, $PR_{v,y}$, $PR_{u,y-1}$, $PR_{v,y-1}$, $PR_{u, y-2}$, $PR_{v, y-2}$
node citation feature 20-25 yearly citation for vertices $u$ and $v$ during the years $y$, $y-1$, $y-2$
denoted as: $c_{u,y}$ , $c_{v,y}$, $c_{u,y-1}$, $c_{v,y-1}$, $c_{u,y-2}$, $c_{v,y-2}$
26-31 total citation for vertices $u$ and $v$ since their first publications to the year $y$, $y-1$, $y-2$
denoted as: $ct_{u,y}$ , $ct_{v,y}$, $ct_{u,y-1}$, $ct_{v,y-1}$, $ct_{u,y-2}$, $ct_{v,y-2}$
32-37 total citation for vertices $u$ and $v$ in three-year period ending with the year $y$, $y-1$, $y-2$
denoted as: $ct^{\Delta 3}_{u,y}$ , $ct^{\Delta 3}_{v,y}$, $ct^{\Delta 3}_{u,y-1}$, $ct^{\Delta 3}_{v,y-1}$, $ct^{\Delta 3}_{u,y-2}$, $ct^{\Delta 3}_{v,y-2}$
38-43 the number of papers mentioning either concept ($u$ or $v$) until the year $y$, $y-1$, $y-2$
denoted as: $pn_{u,y}$ , $pn_{v,y}$, $pn_{u,y-1}$, $pn_{v,y-1}$, $pn_{u,y-2}$, $pn_{v,y-2}$
44-49 The average yearly citation for vertices $u$ and $v$ during the years $y$, $y-1$, and $y-2$
calculated based on the total citations received during the year divided by the number of papers mentioning the vertices from their first publications up to the respective year
denoted as: $cm_{u,y}$ , $cm_{v,y}$, $cm_{u,y-1}$, $cm_{v,y-1}$, $cm_{u,y-2}$, $cm_{v,y-2}$; as an example: $cm_{u,y}$ is $\frac{c_{u,y}}{pn_{u,y}}$
50-55 The average total citation for vertices $u$ and $v$ since their first publications to the year $y$, $y-1$, $y-2$
determined by dividing the cumulative citations by the count of papers that mentioned these vertices since their first publications
denoted as: $ctm_{u,y}$ , $ctm_{v,y}$, $ctm_{u,y-1}$, $ctm_{v,y-1}$, $ctm_{u,y-2}$, $ctm_{v,y-2}$; as an example: $ctm_{u,y}$ is $\frac{ct_{u,y}}{pn_{u,y}}$
56-61 The average total citation for vertices $u$ and $v$ in three-year period ending with the year $y$, $y-1$, $y-2$
calculated by dividing the cumulative three-year period citations by the count of papers that mentioned these vertices since their first publications
denoted as: $ctm^{\Delta 3}_{u,y}$ , $ctm^{\Delta 3}_{v,y}$, $ctm^{\Delta 3}_{u,y-1}$, $ctm^{\Delta 3}_{v,y-1}$, $ctm^{\Delta 3}_{u,y-2}$, $ctm^{\Delta 3}_{v,y-2}$; as an example: $ctm^{\Delta 3}_{u,y}$ is $\frac{ct^{\Delta 3}_{u,y}}{pn_{u,y}}$
62-63 the number of new citations for vertices $u$ and $v$, since 1 years prior to y
denoted as: $cnew^{\Delta 1}_{u,y}$ , $cnew^{\Delta 1}_{v,y}$
64-65 the number of new citations for vertices $u$ and $v$, since 2 years prior to y
denoted as: $cnew^{\Delta 2}_{u,y}$ , $cnew^{\Delta 2}_{v,y}$
66-67 the rank of the number of new citations for vertices $u$ and $v$, since 1 year prior to y
denoted as: $rcnew^{\Delta 1}_{u,y}$ , $rcnew^{\Delta 1}_{v,y}$
68-69 the rank of the number of new citations for vertices $u$ and $v$, since 2 years prior to y
denoted as: $rcnew^{\Delta 2}_{u,y}$ , $rcnew^{\Delta 2}_{v,y}$
70-71 the number of new papers mentioning vertices $u$ and $v$, since 1 year prior to y
denoted as: $pnew^{\Delta 1}_{u,y}$ , $pnew^{\Delta 1}_{v,y}$
72-73 the number of new papers mentioning vertices $u$ and $v$, since 2 years prior to y
denoted as: $pnew^{\Delta 2}_{u,y}$ , $pnew^{\Delta 2}_{v,y}$
74-75 the rank of the number of new papers mentioning vertices $u$ and $v$, since 1 year prior to y
denoted as: $rpnew^{\Delta 1}_{u,y}$ , $rpnew^{\Delta 1}_{v,y}$
76-77 the rank of the number of new papers mentioning vertices $u$ and $v$, since 2 years prior to y
denoted as: $rpnew^{\Delta 2}_{u,y}$, $rpnew^{\Delta 2}_{v,y}$
pair feature 78-80 the number of shared neighbors between vertices $u$ and $v$ for the years $y$, $y-1$, $y-2$
denoted as: $ns_{y}$, $ns_{y-1}$, $ns_{y-2}$
81-83 the geometric coefficient for the pair ($u$ and $v$) for the years $y$, $y-1$, $y-2$
calculated by number_shared_neighbor**2 / (deg_u * deg_v), deg_u is the degree of vertex $u$
denoted as: $geo_{y}$, $geo_{y-1}$, $geo_{y-2}$
84-86 the cosine coefficient for the pair ($u$ and $v$) for the years $y$, $y-1$, $y-2$
calculated by geometric_index**0.5
denoted as: $cos_{y}$, $cos_{y-1}$, $cos_{y-2}$
87-89 the simpson coefficient for the pair ($u$ and $v$) for the years $y$, $y-1$, $y-2$
calculated by number_shared_neighbor / np.min([deg_u, deg_u])
denoted as: $spi_{y}$, $spi_{y-1}$, $spi_{y-2}$
90-92 the preferential attachment coefficient for the pair ($u$ and $v$) for the years $y$, $y-1$, $y-2$
calculated by deg_u*deg_u
denoted as: $pre_{y}$, $pre_{y-1}$, $pre_{y-2}$
93-95 the Sørensen–Dice coefficient for the pair ($u$ and $v$) for the years $y$, $y-1$, $y-2$
calculated by 2*num_shared_neighbor / (deg_u + deg_v)
denoted as: $sod_{y}$, $sod_{y-1}$, $sod_{y-2}$
96-98 the jaccard coefficient for the pair ($u$ and $v$) for the years $y$, $y-1$, $y-2$
calculated by num_shared_neighbor/(deg_u + deg_v - num_shared_neighbor)
denoted as: $jac_{y}$, $jac_{y-1}$, $jac_{y-2}$
pair citation feature 99-101 the ratio of the sum of citations received by concepts $u$ and $v$ in the year $y$ to the sum of number of papers mentioning either concept, similar for years $y-1$, $y-2$
calculated by ($c_{u,y}$ + $c_{v,y}$) / ($pn_{u,y}$ + $pn_{v,y}$)
102-104 the ratio of the product of citations received by concepts $u$ and $v$ in the year $y$ to the sum of number of papers mentioning either concept, similar for years $y-1$, $y-2$
calculated by ($c_{u,y}$ * $c_{v,y}$) / ($pn_{u,y}$ + $pn_{v,y}$)
105-107 the sum of the average citations received by concepts $u$ and $v$ in the year $y$, $y-1$, $y-2$
e.g., calculated by ($cm_{u,y}$ + $cm_{v,y}$) for year y
108-110 the sum of the average total citations received by concepts $u$ and $v$ from their first publication up to the year $y$, $y-1$, $y-2$
e.g., calculated by ($ctm_{u,y}$ , $ctm_{v,y}$) for year y
111-113 the sum of the citations received by concepts $u$ and $v$ in the three-year period ending with year $y$, $y-1$, $y-2$, e.g., calculated by ($ct^{\Delta 3}_{u,y}$ + $ct^{\Delta 3}_{v,y}$) for year y
114-116 the sum of the average citations received by concepts $u$ and $v$ in the three-year period ending with year $y$, $y-1$, $y-2$, e.g., calculated by ($ctm^{\Delta 3}_{u,y}$ + $ctm^{\Delta 3}_{v,y}$) for year y
117-119 the minimum number of the citations received by either concept $u$ or $v$ in the year $y$, $y-1$, $y-2$, e.g., min($c_{u,y}$, $c_{u,y}$)
120-122 the maximum number of the citations received by either concept $u$ or $v$ in the year $y$, $y-1$, $y-2$, e.g., max($c_{u,y}$, $c_{u,y}$)
123-125 the minimum number of the total citations received by either concept $u$ or $v$ since its frist publication to the year $y$, $y-1$, $y-2$, e.g., min($ct_{u,y}$, $ct_{u,y}$)
126-128 the maximum number of the total citations received by either concept $u$ or $v$ since its frist publication to the year $y$, $y-1$, $y-2$, e.g., max($ct_{u,y}$, $ct_{u,y}$)
129-131 the minimum number of total citations received by either concept $u$ or $v$ in the three-year period ending with year $y$, $y-1$, $y-2$, e.g., min($ct^{\Delta 3}_{u,y}$ , $ct^{\Delta 3}_{v,y}$)
132-134 the maximum number of total citations received by either concept $u$ or $v$ in the three-year period ending with year $y$, $y-1$, $y-2$, e.g., max($ct^{\Delta 3}_{u,y}$ , $ct^{\Delta 3}_{v,y}$)
135-137 the minimum number of papers mentioning either concept $u$ or $v$, for year $y$, $y-1$, $y-2$; e.g., min($pn_{u,y}$ , $pn_{v,y}$), min($pn_{u,y-1}$ , $pn_{v,y-1}$), min($pn_{u,y-2}$ , $pn_{v,y-2}$)
138-140 the maximum number of papers mentioning either concept $u$ or $v$, for year $y$, $y-1$, $y-2$; e.g., max($pn_{u,y}$ , $pn_{v,y}$), max($pn_{u,y-1}$ , $pn_{v,y-1}$), max($pn_{u,y-2}$ , $pn_{v,y-2}$)

One need to download the data at 10.5281/zenodo.14527306 and unzip the file in the benchmark_code folder.

benchmark_code
├── loops_fcNN.py: fully connected neural network model
├── loops_transformer.py: transformer model
├── loops_tree.py: random forest model
├── loops_xgboost.py: XGBoost model
└── other python files: Post-processing, make the Figure 6-8 from the evaluation on different models.

Three examples about 10M evaluation samples (2019-2022) with raw outputs from a neural network trained on 2016-2019 data (accessible at 10.5281/zenodo.14527306) are for producing Figure 11 in the fpr_example folder.

About

Forecasting high-impact research topics via machine learning on evolving knowledge graphs

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published