Text analysis of all 163 000+ theoretical high energy physics papers on arXiv (with hep-th as primary or cross-list category), from 1986 to 2023.
Exploration of the following possible tasks: 1) counting; 2) feature extraction; 3) classification; 4) question answering; 5) summarising; 6) recommending papers / research directions. The results are the following:
- interesting temporal trends appear in title words popularity;
-
2-words combinations of title words turn out to correspond to hep-th concepts and allow effective feature extraction and CONCEPT embedding of abstracts;
-
classifiers of article categories are built as Neural Networks (NNs) based on either CONCEPT or SciBERT embedding;
- through a more sophisticated NN, the CONCEPT classifier works also for the subcategories within hep-th category;
-
effective question answering and summarization of article introductions, through high level AI WL functionality;
-
a first basic recommendation algorithm, according to distance in feature space.
In perspective it looks sensible to relate papers in feature space and thus inspire new discoveries.
All this can be found in the notebook named arXivDataAnalysisV1.3 (to unzip).
Then, as a partial aside, in the notebook Affiliation Countries, we also show the computation of total number of papers over affiliated co-authors, for each country in 2023. This is done using directly inspirehep API. The results are the following: as total
or as shares per (1->10^6) capita