diff --git a/README.md b/README.md index 7625da5..4520148 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,29 @@ # Article Linking ![Python application](https://github.com/georgetown-cset/article-linking/workflows/Python%20application/badge.svg) -This repository contains a description and supporting code for CSET's current method of -cross-dataset article linking. Note that we use "article" very loosely, although in a way that to our knowledge -is fairly consistent across corpora. Books, for example, are included. +At CSET, we aim to produce a more comprehensive set of scholarly literature by ingesting multiple sources and then +deduplicating articles. This repository contains CSET's current method of cross-dataset article linking. Note that we +use "article" very loosely, although in a way that to our knowledge is fairly consistent across corpora. Books, for +example, are included. We currently include articles from arXiv, Web of Science, Papers With Code, Semantic Scholar, +The Lens, and OpenAlex. Some of these sources are largely duplicative (e.g. arXiv is well covered by other corpora) +but are included to aid in linking to additional metadata (e.g. arXiv fulltext). -For each article in arXiv, WOS, Papers With Code, Semantic Scholar, The Lens, and OpenAlex -we normalized titles, abstracts, and author last names. For the purpose of matching, we filtered out -titles, abstracts, and DOIs that occurred more than 10 times in the corpus. We then considered each group of articles +For more information about the overall merged academic corpus, which is produced using several data pipelines including +article linkage, see the [ETO documentation](https://eto.tech/dataset-docs/mac/). + +## Matching articles + +To match articles, we need to extract the data that we want to use in matching and put it in a consistent format. The +SQL queries specified in the `sequences/generate_{dataset}_data.tsv` files are run in the order they appear in those +files. For OpenAlex we exclude documents with a `type` of Dataset, Peer Review, or Grant. Additionally, we take every +combination of the Web of Science titles, abstracts, and pubyear so that a match on any of these combinations will +result in a match on the shared WOS id. Finally, for Semantic Scholar, we exclude any documents that have a non-null +publication type that is one of Dataset, Editorial, LettersAndComments, News, or Review. + +For each article in arXiv, Web of Science, Papers With Code, Semantic Scholar, The Lens, and OpenAlex +we [normalized](utils/clean_corpus.py) titles, abstracts, and author last names to remove whitespace, punctuation, +and other artifacts thought to not be useful for linking. For the purpose of matching, we filtered out titles, +abstracts, and DOIs that occurred more than 10 times in the corpus. We then considered each group of articles within or across datasets that shared at least one of the following (non-null) metadata fields: * Normalized title @@ -20,26 +36,34 @@ as well as a match on one additional field above, or on * Publication year * Normalized author last names -to correspond to one article in the merged dataset. We add to this set "near matches" of the concatenation -of the normalized title and abstract within a publication year, which we identify using simhash. +to correspond to one article in the merged dataset. We also [link](sql/all_match_pairs_with_um.sql) articles based on +vendor-provided cross-dataset links. + +## Generating merged articles + +Given a set of articles that have been matched together, we [generate](utils/create_merge_ids.py) a single "merged id" +that is linked to all the "original" (vendor) ids of those articles. Some points from our implementation: -To do this, we run the `linkage_dag.py` on airflow. The article linkage runs weekly, triggered by the `scholarly_lit_trigger` dag. +* If articles that have been seen in a previous run and were previously assigned to different merged ids are now matched +together, we assign them to a new merged id. +* If a set of articles previously assigned to a given merged id _loses_ articles (either because it is now assigned to +a different merged id, or because it has been deleted from one of the input corpora), we give this set of articles a +new merged id. +* If a set of articles previously assigned to a given merged id _gains_ articles without losing any old articles, we +keep the old merged id for these articles. -For an English description of what the dag does, see [the documentation](methods_documentation/overview.md). +This implementation is meant to ensure that downstream pipelines (e.g. model inference, canonical metadata assignment) +always reflect outputs on current metadata for a given merged article regardless of downstream pipeline implementation. -### How to use the linkage tables (CSET only) +## Automation and output tables -We have three tables that are most likely to help you use article linkage. +We automate article linkage using Apache Airflow. `linkage_dag.py` contains our current implementation. -- `gcp_cset_links_v2.article_links` - For each original ID (e.g., from WoS), gives the corresponding CSET ID. -This is a many-to-one mapping. Please update your scripts to use `gcp_cset_links_v2.article_links_with_dataset`, -which has an additional column that contains the dataset of the `orig_id`. +* This dag is triggered from the [Semantic Scholar ETL dag](https://github.com/georgetown-cset/semantic-scholar-etl-pipeline/blob/main/s2_dag.py) which runs once a month. +* This dag triggers the [Org Fixes dag](https://github.com/georgetown-cset/org-fixes/blob/main/org_fixes_dag.py). -- `gcp_cset_links_v2.all_metadata_with_cld2_lid` - provides CLD2 LID for the titles and abstracts of each -current version of each article's metadata. You can also use this table to get the metadata used in the -match for each version of the raw articles. Note that the `id` column is _not_ unique as some corpora like WOS -have multiple versions of the metadata for different languages. +The DAG generates two tables of analytic significance: -- `gcp_cset_links_v2.article_merged_metadata` - This maps the CSET `merged_id` to a set of merged metadata. -The merging method takes the maximum value of each metadata field across each matched article, which may not -be suitable for your purposes. +* `staging_literature.all_metadata_with_cld2_lid` - captures metadata for all unmerged articles in a +standard format. It also contains [language ID predictions](utils/run_lid.py) for titles and abstracts based on CLD2. +* `literature.sources` - contains pairs of merged ids and original (vendor) ids linked to those merged ids. diff --git a/methods_documentation/dag_deployment_notes.md b/methods_documentation/dag_deployment_notes.md deleted file mode 100644 index a0bf4d9..0000000 --- a/methods_documentation/dag_deployment_notes.md +++ /dev/null @@ -1,10 +0,0 @@ -On godzilla-of-article-linkage, if it ever needs to be recreated, do: - -``` -sudo apt update -sudo apt install python3-pip -python3 -m pip install tqdm -python3 -m pip install simhash -``` - -And make sure you've got a 2 TB disk mounted at `/mnt/disks/data` diff --git a/methods_documentation/overview.md b/methods_documentation/overview.md deleted file mode 100644 index fa82f34..0000000 --- a/methods_documentation/overview.md +++ /dev/null @@ -1,62 +0,0 @@ -## Normalizing Article Metadata - -We are merging five datasets, all of which are structured differently in our internal database. To -match article metadata, we first need to extract the columns from this data that we want to use -in matching into a consistent set of tables. - -To do this, we run the SQL queries specified in the `sequences/generate_{dataset}_data.tsv` sequence files -within our airflow DAG. Mostly this is fairly straightforward, but it's worth noting that for OpenAlex we exclude -documents with a `type` of Dataset, Peer Review, or Grant. Additionally, we take every combination of the WOS -titles, abstracts, and pubyear so that a match on any of these combinations will result in a match on -the shared WOS id. Finally, for Semantic Scholar, we exclude any documents that have a non-null publication type -that is one of Dataset, Editorial, LettersAndComments, News, or Review. - -Having generated the metadata tables, we now need to normalize the metadata. To do this, we use -the [clean_corpus](../utils/clean_corpus.py) script, which applies several text normalizations to the -data. We "normalize" away even whitespace and punctuation. - -Having normalized our data, we now need to do within and cross-dataset matches, creating one master table -containing all pairs of matched articles. To do this, we use the series of queries in -`sequences/combine_metadata.tsv`. - -For two articles A and B to be considered a match, we require that they have a non-null match on at least one of: - -* Normalized title -* Normalized abstract -* Citations -* DOI - -as well as a match on one additional field above, or on - -* Publication year -* Normalized author last names - -We then add back in any articles that didn't match anything else, and combine the matches into tables that -will be passed to LID and to the simhash and article id assignment code. - -For LID, we run the CLD2 library on all titles and abstracts using the beam script in `utils/run_lid.py`, taking -the first language in the output. Note that this can result in the same article being assigned multiple -languages, since some articles have multiple versions of metadata. - -#### Merged article ID assignment - -To merge articles, we first need to apply one more matching method, which is based on simhash. On each update -of the data, we update a set of saved simhash indexes (one for each year of the data) that cover all articles -we have seen on previous runs of the code. We update these indexes with new articles, and then find similar -articles within the updated indexes. - -Next, we add all the simhash matches as match pairs, and run `utils/create_merge_ids.py`. This script identifies -all groups of articles that have been either directly or transitively matched together. We then assign this set -of articles (the "match set") a "carticle" ID. If a match set has exactly one old carticle id previously assigned -to any of the articles, it keeps that carticle id even if new articles (with no existing carticle id) are added -to the match set. Otherwise, the match set gets a new carticle id. - -Having obtained the carticle ids, we upload them back to BigQuery, and generate the final output tables, -described in the README. - -#### Running LID - -In parallel with creating the matches, we run the CLD2 library on all titles and abstracts using the beam -script in `utils/run_lid.py`. We take the first language in the output as the language of the whole text. -Note that this can result in the same merged carticle being assigned multiple languages, since some articles -have multiple versions of metadata. diff --git a/sequences/generate_cnki_metadata.tsv b/sequences/generate_cnki_metadata.tsv deleted file mode 100644 index a7bb234..0000000 --- a/sequences/generate_cnki_metadata.tsv +++ /dev/null @@ -1,5 +0,0 @@ -cnki_year_doi_authors -cnki_title -cnki_abstract -cnki_metadata -cnki_ids diff --git a/utils/create_merge_ids.py b/utils/create_merge_ids.py index fd3d7a5..ab014ad 100644 --- a/utils/create_merge_ids.py +++ b/utils/create_merge_ids.py @@ -173,7 +173,8 @@ def create_matches( def write_batch(match_batch_with_output_dir: tuple) -> None: """ Write a batch of matches to disk - :param match_batch: tuple of (a tuple containing a list of jsons containing a merged id and orig id, and an identifier for the batch), and a directory where matches should be written + :param match_batch_with_output_dir: tuple of (a tuple containing a list of jsons containing a merged id and orig + id, and an identifier for the batch), and a directory where matches should be written :return: None """ match_batch, output_dir = match_batch_with_output_dir