This library searches Google Scholar profiles to fetch author data, including publications, citations, and co-authors. It offers tools for weighted citation analysis, regression modeling of citation trends, and deviation analysis, and graph visualizations. Whith Scholar analysis, we can visualize citation connections and much more.
Important note This library uses Scholarly library to scrape the data from Google Scholar links. There fore, data fetching process is very slow and for larger data (e.g. getting & plotting an Author and their co-authors), the process takes a lot of time. Thus this library focuses on getting individual author - paper data and performing analysis on them, and not on getting big data.
After installing, run
pip install -r /path/to/requirements.txt
to download all requirements.
Create an Author instance
import <to_name> as gs
author = gs.Author('id') # where id is the unique id found in Google Scholar profile page
When creating an author, author data is fetched and saved in these parameters: Note this step takes a while...
self.scholar_id: str # Unique Author ID
self.name: str # Author full name
self.affiliation: # Author description, eg. Machine Learning engineer
self.interests: str # Author interests from google scholar bio
self.citedby: int # Cited by
self.citedby_5y: int # Cited by in the last 5 years
self.h_index: int # h-index, meaning (eg. h=10 implied 10 publications with at least 10 citations each)
self.h_index_5y: int # 5 year h-index
self.i10_index: int # do not remember what that does will fix
self.i10_index_5y: int # --
self.cites_per_year: dict[int: int] # Cites per year dictionary eg {2010: 1, 2011: 0...}
self.co_authors: list[str] # Co-Authors
self.publications: list[list[]] # List of publications, and data for each publication, such as _______
Pretty prints all fetched autor's data, showing publications based on parameter (set to True by default). Example usage:
Author.pprint_all_author_data(showpublications=False)
>>>
Author Name: Kostas Vandlopoulos
Autor Google Schoalr ID: XXXXXXXXXXXX
Affiliation: Data Analyst
Interests: ['Artificial Intelligence', 'Machine Learning', 'Brain Computer Interfaces']
Cited By: 81
citedby_5y: 81
h_index: 3
h_index_5y: 3
i10_index: 2
i10_index_5y: 2
cites_per_year: {2021: 4, 2022: 17, 2023: 27, 2024: 31}
co_authors: [] # Sometimes doesnt really work for some reason will fix
publications: ['Predictive maintenance-bridging artificial intelligence and IoT', 2021, '2021 IEEE World AI IoT Congress (AIIoT), 0413-0419, 2021', 'jP1qgO4AAAAJ:u5HHmVD_uO8C', 44, '/scholar?hl=en&cites=1217316688053608473', '1217316688053608473', '0413-0419', 'IEEE', {2021: 3, 2022: 8, 2023: 14, 2024: 18}, ['Gerasimos G Samatas', 'Seraphim S Moumgiakmas', 'George A Papakostas']]
Saves Author's personal data in json file:
Author.save_authors_person_data_in_json('output.json')
>>>
# Example format
{
"author_name": "Vandl",
"author_google_scholar_id": "XXXXXXXXXXXX",
"affiliation": "Data Analyst, MLV Research Group",
"interests": [
"Artificial Intelligence",
"Machine Learning",
"Motor Imagery",
"Brain Computer Interfaces",
"EEG"
],
"cited_by": 81,
"cited_by_5y": 81,
"h_index": 3,
"h_index_5y": 3,
"i10_index": 2,
"i10_index_5y": 2,
"cites_per_year": {
"2021": 4,
"2022": 17,
"2023": 27,
"2024": 31
},
"co_authors": []
}
Saves author publication data in json file:
Author.save_authors_paper_data_in_json('output.json')
>>>
# Example format
[
{
"paper_title": "Predictive maintenance-bridging artificial intelligence and IoT",
"publication_year": 2021,
"journal_info": "2021 IEEE World AI IoT Congress (AIIoT), 0413-0419, 2021",
"author_pub_id": "XXXXXXX:XXXXX",
"num_of_citations": 44,
"cited_by_url": "/scholar?hl=en&cites=XXXXX",
"cites_id": "XXXXXXX",
"pages": "0413-0419",
"publisher": "IEEE",
"cites_per_year": {
"2021": 3,
"2022": 8,
"2023": 14,
"2024": 18
},
"all_authors": [
"Tim",
"Sim",
"Bartholomew"
]
},
]
checkpoint_save_author_and_coauthors_in_tree(identifier: str, full_path: str = None, clip: int=-1) -> None
param: identifier: Unique Google Scholar profile ID param: full_path: Directory in which the folder will be saved. By default set to working directory. param: clip Number of Papers to fetch for each Co-author. By default all papers are fetched. Most of times if an author has more than 500 publications the code will throw MaxTriesExceeded Error (request denied). There fore the clip parameter is used to fetch N papers from every co-author (Including original - root author).
Saves co-author relationship tree in a folder. Depth of tree is hard-coded at 2, but this function can be called recursively to increase depth. Note however that more depth takes exponentially more time. Output folder example:
RootAuthorFolderID # folder
---author_name_author_data.json # json file in save_authors_person_data_in_json() format
---author_name_paper_data.json # json file in save_authors_paper_data_in_json() format
---CoAuthor1ID # folder
-----co_author1_name_author_data.json # ---
-----co_author1_name_paper_data.json
---CoAuthor2ID
-----co_author2_name_author_data.json
-----co_author2_name_paper_data.json
By default, if we consider the root of the tree (depth=0) to be the original parameter Author, the depth=1 co-authors are saved inside the folders, and the depth=2 co-authors are saved inside json files. Note that after the function is successfully finished, we can iterate through each-subfolder and call the function on each subfolders name, increasing depth by 1. As mentioned, this will take an enormous amount of time. Note: When calling this function, it is most likely that the code will, at some point, break because of MaxTriesExceeded exception. At this point, every (fetched) folder is saved in the folder, and when IP is changed and function is re-called, the process will continue from that point, displaying a progress percentage.
Given a json file created with save_authors_paper_data_in_json method, returns a nested list of basic data for each publication. Note these are functions and not a static methods on Author Class Example usage and output:
data: [list[list[]] = get_paper_data_from_json('test.json')
print(data)
>>>
[['Predictive maintenance-bridging artificial intelligence and IoT', 2021, '2021 IEEE World AI IoT Congress (AIIoT), 0413-0419, 2021', 'jP1qgO4AAAAJ:u5HHmVD_uO8C', 44, '/scholar?hl=en&cites=1217316688053608473', '1217316688053608473', '0413-0419', 'IEEE', {'2021': 3, '2022': 8, '2023': 14, '2024': 18}], ...]
Returns paper data for author, retrieving-fetcing from ID and not opening from json file. Example usage and output:
data: [list[list[]] = get_paper_params('xxxxxxxx')
print(data)
>>>
[#same as above window]
Returns Author's personal data list given an ID, without having to create Author instance. Example usage and output:
params: [list[]] = get_author_params('xxxxxxxxxxx')
print(params)
>>>
['Vandlopoulos Kostas', # Name
'Data Analyst, # Affiliation
MLV Research Group',
['Artificial Intelligence','Machine Learning'], # Interests
81, # Cited by
81, # Cited by 5-year
3, # h_index
3, # h_index_5y
2, i10
2, i10_5year
{2021: 4, 2022: 17, 2023: 27, 2024: 31}, # Citations per year
[]] # Co-authors
Returns dictionary of Paper name as key, and dictionary of citations per year as value. Is callable only on json as of current version.
foo: dict[str: dict[str: int]] = get_citations_per_year_per_paper('path.json')
print(foo)
>>>
{'Predictive maintenance-bridging artificial intelligence and IoT': {'2021': 3, '2022': 8, '2023': 14, '2024': 18}, 'Computer vision for fire detection on UAVs—From software to hardware': {'2021': 1, '2022': 9, '2023': 9, '2024': 12}, 'Robustly effective approaches on motor imagery-based brain computer interfaces': {'2023': 4, '2024': 1, '2022': 0}, 'Benchmarking convolutional neural networks on continuous EEG signals: The case of motor imagery–based BCI': {}}
param: path: path of tree folder, as created from checkpoint_save_author_and_coauthors_in_tree function. return: List of co-author relationship pairs. Takes a folder as input and returns co-author pairs in a list. Example Usage:
A # folder
---author_name_author_data.json # json file in save_authors_person_data_in_json() format
---author_name_paper_data.json # json file in save_authors_paper_data_in_json() format
---B # folder
-----co_author1_name_author_data.json # In here, there are co-authors E and K
-----co_author1_name_paper_data.json
---C
-----co_author2_name_author_data.json # in here, there are co-authoer G and K
-----co_author2_name_paper_data.json
print(get_co_author_graph_pairs('path\to\A'))
>>>
[[A, B], [A, C], [B,E], [B, K], [C,G], [C,K]]
# NOTE THAT /// ///
# K is a co-author (child) of B and C. There fore when later plotted, this does not always result in a tree but
# a graph.
weight_citations_based_on_function_of_time(cites_per_year_per_paper: dict[str: dict[str: int]], function: str) -> float
Using weight_citations_based_on_function_of_time, we can "weigh" all of the citations of an author based on how recent the citations are compared to how old the paper is, using:
Where:
- papers is the number of all the author's publications.
- y_c is the current year.
- c(p, n) is the number of citations of a paper on the n-th year (e.g., c(10, 2018) = 114 means that the 10th paper had 114 citations in 2018).
-
f(x) is an increasing function of x, in which f(0)=1. For example, an exponential function like
$$f(x) = e^{x/10}$$ can be used. In this case, citations in the first year (x = 0) have no weight added to them, and every citation after first year is multiplied by a positive weight.
Year of the number of citations is subtracted from the publication year before being passed into f so that f(0) = 1 for the publication year, and the following years are passed into the function as 1, 2, 3, etc.
Function works for multiple publications dictionary with nested dictionaries (publications per year) as key.
Mathematic function parameter is a function of x string given in latex format.
For example, using an exponential function like
- f(0) = 1, meaning citations in the first year after a paper's publication are not weighted.
- f(7) ~= 2, meaning that citations 7 years after a paper's publication are counted as ~2 citations, etc.
- Using a function s.t.
$$f(0)=1$$ and$$f\nearrow, \forall x\gt0$$ means that the return of the function will always be greater than the input citations (or equal in case there are no citations after the first year). - Using a different - decreasing function such as
$$f(x)=e^{(-x^2)}$$ we can derive an index about the author's papers being citated (mostly) in their first year after publication.
Let
Then, the weighted citations for this year are:
''
# here a non increasing function is used as an example
pubs = {'Publication 1: {'2000':10, '2001': 10}, ...} # can take many publications
weighted_citations: float = weight_citations_based_on_function_of_time(pubs, '{e}^(-(x^2))')
# f(0) = 1, f(1)~= 0.36, -> (10x1 + 10x0.36)
print(weighted_citations)
>>>
13.6
param: co_author_pairs: List of Lists (pairs) as returned from get_co_author_graph_pairs() function.
param: kwargs: key word arguements to be passed into networkx.draw() function, which eventually plots the graph as shown below.
Example usage and output:
plot_co_author_graph(get_co_author_graph_pairs('unique_id')) # assuming unique_id is a folderthat exists
# in working directory
>>> # here the graph is huge so I took a small screenshot.
The funciton takes a nested dictionary as input in form
d = {paper_name: str:{year: str:citations_per_year: int}}
and returns
nested_dict: dict[dict] = get_citations_per_year_per_paper(json_name='data.json')
print(get_gamma_distribution_best_fit_parameters(nested_dict))
>>>
(2.3411, 1.3653, 0.4675)
Given the paremeters
plot_author_citations(cites_per_year_per_paper,show_regression, other_authors, save, output_directory, file_name) -> None:
Takes a nested dicitonary in form
d = {paper_name: str:{year: str:citations_per_year: int}}
and plots the data as dots. The x-axis represents the years after a paper's publications while the y-axis
shows the citations of that paper, in x year.
cites_per_year_per_paper: Dict[str, Dict[str, int]]: Dictionary in get_citations_per_year_per_paper()
output format.
show_regression: bool = False: Uses get_gamma_distribution_best_fit_parameters()
to calculate and plot the best fitted curve on Author's citation data.
other_authors: List[float] | List[List[float]] = None: List or nested List of gamma distribution parameters of other authors data curves, found with get_gamma_distribution_best_fit_parameters()
. For example if other_authors = [3, 2, 1]
the function will also plot author_plot.png
.
plot_author_citations(get_citations_per_year_per_paper('data.json'), show_regression=True, other_authors=[[100, 5, 2.5],[50, 2, 1]])
>>>
# output image