wcmc-its / ReCiter

ReCiter: an enterprise open source author disambiguation system for academic institutions
Apache License 2.0
45 stars 23 forks source link
algorithm aws awscodebuild awscodepipeline clustering dynamodb elasticbeanstalk entity-resolution java machine-learning-algorithms maven pubmed reciter s3 scopus spring-boot

ReCiter

Build Status version codebeat badge License PRs Welcome Pending Pull-Requests Closed Pull-Requests GitHub issues open GitHub issues closed Tags [Github All Releases]()

https://github.com/wcmc-its/ReCiter/blob/master/files/howreciterworks.png

Purpose

ReCiter is a highly accurate system for guessing which publications in PubMed a given person has authored. ReCiter includes a Java application, a DynamoDB-hosted database, and a set of RESTful microservices which collectively allow institutions to maintain accurate and up-to-date author publication lists for thousands of people. This software is optimized for disambiguating authorship in PubMed and, optionally, Scopus.

ReCiter accurately identifies articles, including those at previous affiliations, by a given person. It does this by leveraging institutionally maintained identity data (e.g., departments, relationships, email addresses, year of degree, etc.) With the more complete and efficient searches that result from combining these types of data, you can save time and your institution can be more productive. If you run ReCiter daily, you can ensure that the desired users are the first to learn when a new publication has appeared in PubMed.

ReCiter is fast. It uses an advanced multi-threading strategy known as a work stealing pool to make up to 10 retrieval requests at a time.

ReCiter is freely available and open source under the Apache 2.0 license.

Please see the ReCiter wiki for more information.

https://github.com/wcmc-its/ReCiter/blob/master/files/ReCiter-FeatureGenerator.gif

Technical

Prerequisites

If you want to use Java 8 then update <java.version>1.8</java.version> in pom.xml

It is not necessary to install ReCiter in order to use the API.

Technological stack

Key technologies include:

You may choose to run ReCiter on either:

Architecture

https://github.com/wcmc-its/ReCiter/blob/master/files/ArchitecturalDiagram-NEW.png

Related code repositories

The ReCiter application depends on the following separate GitHub-hosted repositories:

Optionally, users can install:

Installation

ReCiter can be installed to run locally or in AWS via a cloud formation template. A required dependency is the PubMed Retrieval Tool. The Scopus Retrieval Tool is optional, but can improve overall accuracy by several percent.

Local

  1. Clone the repository to a local folder using git clone https://github.com/wcmc-its/ReCiter.git
  2. Go to the folder where the repository has been cloned and navigate to src/main/resources/application.properties and change port and log location accordingly
    • change aws.DynamoDb.local=false to aws.DynamoDb.local=true
    • update location of DynamoDB database, e.g., aws.DynamoDb.local.dbpath=/Users/Paul/Documents/ReCiter/dynamodb_local_latest
    • By default application security is turned on. If you wish to turn it off you must change the flag to false from spring.security.enabled=true to spring.security.enabled=false
    • If you have the security as true you must include the following environment variables -
      export ADMIN_API_KEY=<api-key>
      export CONSUMER_API_KEY=<api-key>
    • If you do not have scopus subscription you should mark this value to false. Change use.scopus.articles=true to use.scopus.articles=false.
  3. Enter ports for server and services in command line. Note that the Scopus service is optional. You must have Pubmed Service and optionally Scopus Service setup before this step. Enter appropriate hostname and the port numbers.
    export SERVER_PORT=5000
    export SCOPUS_SERVICE=http://localhost:5001
    export PUBMED_SERVICE=http://localhost:5002
  4. Run mvn spring-boot:run. You can add additional options if you want like max and min java memory with export MAVEN_OPTS=-Xmx1024m
  5. Go to http://localhost:<port-number>/swagger-ui/index.html or http://localhost:<port-number>/swagger-ui/ (shorthand swagger url) to test and run any API.

Amazon AWS

The ReCiter CDK allows to install the entire infrastructure for ReCiter and its components and its highly configurable. There you will find instruction to install ReCiter and its components.

Configuration

Functionality

How ReCiter works

The wiki article, How ReCiter works, contains a more detailed description on the application works.

Using the APIs

The wiki article, Using the APIs, contains a full description on how to use the ReCiter APIs.

Category Function Relevant API(s)
Manage identity of target users Add or update identity data for target user(s) from Identity table /reciter/identity/ or /reciter/save/identities/
Manage identity of target users Retrieve identity data for target user(s) from Identity table /reciter/find/identity/by/uid/ or /reciter/find/identity/by/uids/ or /reciter/find/all/identity
Gold standard Update the GoldStandard table (includes both accepted and rejected PMIDs) for single user /reciter/goldstandard/
Gold standard Update the GoldStandard table (includes both accepted and rejected PMIDs) for mutliple users /reciter/goldstandard/
Gold standard Read from the GoldStandard table (includes both accepted and rejected PMIDs) for target user(s) /reciter/goldstandard/{uid}
Look up candidate articles Trigger look up of candidate articles for a given user /reciter/retrieve/articles/by/uid
Retrieve suggested articles Read suggested articles from the Analysis table for target user /reciter/article-retrieval/by/uid
Retrieve suggested articles Read suggested articles and see supporting evidence from the Analysis table for target user(s) /reciter/feature-generator/by/uid or /reciter/feature-generator/by/group

See also

Published articles

https://github.com/wcmc-its/ReCiter/blob/master/files/plosone.png

Future work

Both the issue queue and the Roadmap include some areas where we want to improve ReCiter.

Funding acknowledgment

Various components in the ReCiter suite of applications has been funded by:

Follow up

Please submit any questions to Paul Albert. You may expect a response within one to two business days.

We use GitHub issues to track bugs and feature requests. If you find a bug, please feel free to open an issue.

Contributions welcome!