GitHub - Daniele-Gregori/ArXiv-Hepth-Data-Analysis: Text analysis of all 163000+ theoretical high energy physics papers on arXiv.

Text analysis of all 163 000+ theoretical high energy physics papers on arXiv (with hep-th as primary or cross-list category), from 1986 to 2023.

Exploration of the following possible tasks: 1) counting; 2) feature extraction; 3) classification; 4) question answering; 5) summarising; 6) recommending papers / research directions. The results are the following:

interesting temporal trends appear in title words popularity;

2-words combinations of title words turn out to correspond to hep-th concepts and allow effective feature extraction and CONCEPT embedding of abstracts;
classifiers of article categories are built as Neural Networks (NNs) based on either CONCEPT or SciBERT embedding;

through a more sophisticated NN, the CONCEPT classifier works also for the subcategories within hep-th category;

effective question answering and summarization of article introductions, through high level AI WL functionality;
a first basic recommendation algorithm, according to distance in feature space.

In perspective it looks sensible to relate papers in feature space and thus inspire new discoveries.

All this can be found in the notebook named arXivDataAnalysisV1.3 (to unzip).

Then, as a partial aside, in the notebook Affiliation Countries, we also show the computation of total number of papers over affiliated co-authors, for each country in 2023. This is done using directly inspirehep API. The results are the following: as total

or as shares per (1->10^6) capita

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
Affiliation Countries.nb		Affiliation Countries.nb
LICENSE		LICENSE
README.md		README.md
arXivDataAnalysisV1.0.nb		arXivDataAnalysisV1.0.nb
arXivDataAnalysisV1.1.nb		arXivDataAnalysisV1.1.nb
arXivDataAnalysisV1.2.nb.zip		arXivDataAnalysisV1.2.nb.zip
arXivDataAnalysisV1.3.nb.zip		arXivDataAnalysisV1.3.nb.zip
data_arXiv_hepth_0_43000.zip		data_arXiv_hepth_0_43000.zip
dateListPlotAll.jpg		dateListPlotAll.jpg
tableTotalWords.jpg		tableTotalWords.jpg
videoWordCloudTitles_hepth_2017-2023styled.mp4		videoWordCloudTitles_hepth_2017-2023styled.mp4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases 4

Packages

Languages

License

Daniele-Gregori/ArXiv-Hepth-Data-Analysis

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Languages

Packages