Skip to content
paulalbert1 edited this page Nov 17, 2018 · 94 revisions

ReCiter is an algorithm and web service for making highly accurate assertions about author identity in publication metadata. It is designed with institutional users in mind so that they can maintain accurate and up-to-date author publication lists for thousands of people. ReCiter is optimized for disambiguating authorship in PubMed and, optionally, Scopus.

ReCiter is freely available and open source. The code is available here.

ReCiter rapidly and accurately identifies articles by specific authors, including those at previous affiliations. It does this by leveraging institutionally maintained identity data (e.g., departments, relationships, email addresses, year of degree, etc.)

With the more complete and efficient searches that result from combining these types of data, you can save time and your institution can be more productive. If you run ReCiter daily, you can ensure that the desired users are the first to learn when a new publication has hit PubMed.

As described in the algorithm documentation, ReCiter is designed to mimic well-informed human judgment.

This page describes how to install and use ReCiter.

Additional information about ReCiter is available on the FAQ page.

The source code (Copyright 2017, Weill Cornell Medical College) is licensed under the Apache License, Version 2.0 (the “License”); you may not use ReCiter except in compliance with the License. You may obtain a copy of the License at http://www.apache.or/licenses/LICENSE-2.0. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “as is” basis, without warranties or conditions of any kind, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Would ReCiter be useful at my institution?

The answer depends upon how publication data is managed at your institution. ReCiter is certainly useful at Weill Cornell. Consider the following use case: The Department of Pediatrics tells you that the institution is about to hire a new hot-shot faculty member from Harvard, Dr. Andrew Schwartz, and wants to have an updated list of publications right away.

  • You reach out to the department, Dr. Schwartz, Office of Faculty Affairs, etc. requesting a CV, but no one gets back to you. You'll have to do this on your own.
  • The PubMed interface retrieves over 3,500 results when searching for "A Schwartz" as an author.
  • We can assume that most of these candidate publications are not authored by the "A Schwartz" we are looking for! Even the ones with a Harvard affiliation may be written by another A. Schwartz.
  • You search for a LinkedIn, Google Scholar, Academia.edu, ORCID or other such profile - any one of these could be incomplete - and find a handful of publications.
  • You look through the candidate articles for clues. What's the institutional affiliation? What department is listed in the affiliation string? Is a known institutional email listed? Are there any grants indexed on which A. Schwartz was listed as a co-author.
  • You're feeling especially attentive to detail, so you look at see if certain journals or co-authors names from a known publication (where Dr. Schwartz's email is indexed), are shared with other candidate publications.
  • This isn't your first rodeo, so you complete this work in about 20 minutes. Good for you!

Properly populated with identity data, the ReCiter system can handle all this for you in one go, quickly and accurately. Furthermore, if it's set up to work automatically on a daily basis, you can do this every day, for thousands of people. At Weill Cornell Medicine, we use ReCiter for all full-time faculty (n=1700) and PhD/MD-PhD students (n=650). We are looking to expand its use for PhD alumni, among others.

How can my institution use ReCiter?

We support two methods of working with ReCiter:

  • stand-alone, for users only, and
  • with Eclipse IDE for those who want to help develop ReCiter or read its source code. ReCiter is an open source application stack. Currently, institutions wishing to use ReCiter need to install a local copy. Your staff will need to install the application and database, then populate your local database with data about your authors.

What technologies are used in ReCiter?

  • ReCiter stores data about researchers and publications in DynamoDB.
  • Its main computation logic is written in Java.
  • It employs the Spring Framework, a Java-based application framework designed to manage RESTful web services and server requests.

What data sources are used by ReCiter for computation?

  • Institution-specific identity data including emails, names, known relationships, grant IDs, departmental affiliations, etc.
  • PubMed search engine, which primarily accesses the Medline database
  • Scopus (optional), a bibliographic database used to harvest affiliations.

What institutional data about authors does ReCiter use?

  • name variants (such as nicknames, name changes, and spelling irregularities)
  • current and former institutional affiliations
  • departmental or other organizational unit affiliations
  • e-mail addresses (personal, institutional, etc.)
  • years of degree (bachelor and any terminal degree)
  • grant identifiers
  • relationships (co-investigatorships, mentor/mentee, people in shared organizational unit, manager, etc.)
  • common institutional affiliations (used for everyone)
  • individual institutions (e.g., undergraduate, doctoral, residency, internship, clinical affiliation; used to limit results when someone has an especially common name)
  • institutions which frequently collaborate with your institution

What are some examples of reasoning used in ReCiter?

  • Dr. Y is more likely to have written this article. It lists his department in the author affiliation field.
  • Dr. X couldn't have written this article. It was published eight years before she got her Bachelor's degree.
  • Dr. Z is more likely to have written this article. It lists two authors who are also included as co-investigators on an active grant.

Can I use ReCiter to identify the publications of authors at other institutions?

ReCiter depends on institutionally-maintained data to make highly accurate assertions of author identity in publication metadata. The more you know about a given person, the better ReCiter will perform.

How important is it to use Scopus?

Use of Scopus, which depends on a standard license, is optional but helpful. In one experiment, we found that using Scopus data improved algorithm recall by approximately 5% and precision by 0.2%, when compared against using PubMed data alone. Using Scopus data tends to be more useful in cases where affiliation data is not present in the PubMed record or the full name of authors isn't indexed. This is more common in older papers. For that reason, Scopus may offer less of an advantage for more recently published papers.

So long as it's okay with your license, it would be possible to use alternatives to Scopus such as Web of Science. Of course, you would need to figure out how to parse these records, among other things. Note that Scopus is only used as a compliment to PubMed. In other words, ReCiter only uses identifiers (PMID, DOI) to find additional data (namely affiliation and full name) about a candidate article.

What are the system requirements?

Running ReCiter on a server

ReCiter will run on Linux, Mac OS X, and Windows versions 7 and higher. A minimum of 4GB of RAM is required; 16GB of RAM are recommended. An Internet connection is required to download article data from scholarly databases.

Running ReCiter on a local machine

ReCiter's API may be run in a browser on any modern machine. The ReCiter server must be accessible to the local machine via a local area network or internet connection.

How accurate is ReCiter?

ReCiter's accuracy (an average of precision and recall as benchmarked against a human-defined gold standard) has been measured at over 95% for current full-time faculty at Weill Cornell Medicine. The exact accuracy of a given person depends on a variety of factors, especially:

  • How much identity data you can provide the ReCiter algorithm
  • How common a person's name is
  • How prolific an author has been For example, ReCiter would typically perform far better for a long-time faculty with a unique name and a lot of publications under his belt as opposed to a student with a common name and only a couple publications. Knowing the email a faculty used at a prior affiliation (your Office of Faculty Affairs has this, at least at Weill Cornell) is a huge help. ReCiter will never be 100% accurate. We are in the process of building an interface to give users an opportunity to provide feedback. This feedback (accepts and rejects) will be fed back into ReCiter and used to further tune ReCiter's judgments. Data on ReCiter's accuracy at Weill Cornell is available here.

Who has access to the data?

  • Data from PubMed about published articles is already publicly available.
  • Each institution can set its own access rules for personnel information that is used by ReCiter to perform its searches.
  • ReCiter will only run for authors whose data you have populated into ReCiter.

What about ORCID?

ORCID is a persistent digital identifier designed to distinguish one researcher from every other researcher. Users create an account at orcid.org and manually claim their publications. A handful of publishers now require that submitters include their ORCID ID, and there are efforts by institutions, especially libraries, to increase adoption. Some issues we have noticed with ORCID:

  • Less than 1% of articles that Weill Cornell cares about have an ORCID identifier indexed in PubMed for even one author.
  • There are a number of duplicate ORCID profiles.
  • The False Negative Problem: A new candidate publication appeared three months ago. Is it not in the person's ORCID profile because he didn't get around to adding it, or because he simply didn't author it?... Our administrators like to be "in the know." Ideally, we would tell them of a new authorship the day it was indexed in PubMed, so this poses a problem.
  • Our users are notoriously overwhelmed. Absent any significant carrot or stick, getting them to maintain a profile in yet another system is an exercise in frustration. It seems trivial, but even getting them to assign the library as delegates would likewise require a lot of persistence.
  • For reporting purposes, we attempt to track publications authored by users that we can no longer contact and would have trouble encouraging them to clean up their ORCID profile. This includes publications by alumni and inactive faculty. These are individuals who would never give us proxy access to their ORCID profiles. For these reasons, ORCID is not quite mature enough for Weill Cornell Medicine to be able to count on it as a reliable source of truth for author identity. If and when that changes, and it starts to become a valuable source of data, ReCiter could be modified to also include ORCID's assertions of authorship.

Who do I contact for help with ReCiter?

Please email Paul Albert at paa2013@med.cornell.edu or Michael Bales at meb7002@med.cornell.edu.

How can I contribute to the ReCiter initiative?

We welcome assistance with ReCiter development. Please email Paul Albert at paa2013@med.cornell.edu or Michael Bales at meb7002@med.cornell.edu if you would like to contribute.

How do I cite ReCiter?

The original ReCiter algorithm may be cited as follows: Johnson SB, Bales ME, Dine D, Bakken S, Albert PJ, Weng C. Automatic generation of investigator bibliographies for institutional research networking systems. Journal of Biomedical Informatics 2014;51:8–14. Available from URL: http://dx.doi.org/10.1016/j.jbi.2014.03.013

Clone this wiki locally