wtsi-npg / npg_seq_pipeline

Processing and analysis of data coming from Illumina sequencing machines
9 stars 17 forks source link

NPG Pipelines for Processing Illumina Sequencing Data

This software provides the Sanger NPG team's automation for analysing and internally archiving Illumina sequencing data on behalf of DNA Pipelines for their customers.

There are two main pipelines:

and the daemons which automatically start these pipelines.

Processing is performed as appropriate for the entire run, for each lane in the sequencing flowcell, or each tagged library (within a pool on the flowcell).

Batch Processing and Dependency Tracking with LSF or wr

With this system, all of a pipeline's jobs for its steps are submitted for execution to LSF or wr batch/job processing system as the pipeline is initialised. As such, a submitted pipeline does not have an orchestration script or daemon running: managing the runtime dependencies of jobs within an instance of a pipeline is delegated to the batch/job processing system.

How is this done? The job representing the start point of a graph is submitted to LSF or wr in a suspended state and is resumed once all other jobs have been submitted thus ensuring that the execution starts only if all steps are successfully submitted. If an error occurs at any point during job submissions, all submitted jobs, apart from the start job, are killed.

Pipeline Creation

Steps of each of the pipelines and dependencies between the steps are defined in JSON input files located in data/config_files directory. The files follow JSON Graph Format syntax. Individual pipeline steps are defined as graph nodes, dependencies between them as directed graph edges. If step B should be executed after step A finishes, step B is considered to be dependant on step A.

The graph represented by the input file should be a directed acyclic graph (DAG). Each graph node should have an id, which should be unique, and a label, which is the name of the pipeline step.

Parallelisation of processing may be performed at different levels within the DAG: some steps are appropriate for

parallelisation.

Visualizing Input Graphs

JSON Graph Format (JGF) is relatively new, with little support for visualization. Convert JGF to GML Graph Modeling Language format using a simple script supplied with this package, scripts/jgf2gml . Many graph visualization tools, for example Cytoscape, support the GML format.

Per Sequencing-Run Pipelines

The processing is performed per sequencing run. Many different studies and sequencing assays for different "customers" may be performed on a single run. Unlike contemporary (2020s) sharable bioinformatics pipelines, the logic for informatics is tied closely to the business logic e.g. what aligner is required with what reference, whether human read separation is required, is determined per indexed library within a lane of sequencing and scheduled for work in parallel.

The information required for the logic is obtained from the upstream "LIMS" via a MLWH (Multi-LIMS warehouse) database and the run folder output by the sequencing instrument.

Analysis Pipeline

Processes data coming from Illumina sequencing instruments. It is labeled the "central" pipeline.

The input for an instance of the pipeline is the instrument output run folder (BCL and associated files) and LIMS information which drives appropriate processing.

The key data products are aligned or unaligned CRAM files and indexes. However per study (a LIMS datum) pipeline configuration allows for the creation of GATK gVCF files, or the running for external tool/pipeline e.g. ncov2012-artic-nf

"central" pipeline

Within this DAG there are two step which are key in producing the main data products:

Archival Pipeline

Archives sequencing data (CRAM files) and other related artifacts e.g. index files. QC metrics. It is labeled the "post_qc_review" pipeline.

"post_qc_review" pipeline

Pipeline Script Outputs

Log file - in the run folder (as in the current pipeline). Example: /nfs/sf55/IL_seq_data/outgoing/path_to_runfolder/bin_npg_pipeline_central_25438_20180321-080455-2214166102.log

File with JSON serialization of definition objects - in the analysis directory directory. Example: /path_to_runfolder/bin_npg_pipeline_central_25438_20180321-080455-2214166102.log.json

File with saved commands hashed by function name, LSF job id and array index - in the analysis directory. Example: /path_to_runfolder/Data/Intensities/BAM_basecalls_20180321-075511/bin_npg_pipeline_central_25438_20180321-080455-2214166102.log.commands4jobs.json

Dependencies

This software relies heavily on the npg_tracking software to abstract information from the MLWH and instrument runfolder, and coordination of the state of the run.

This software integrates heavily with the npg_qc system for calculating and recording for internal display QC metrics for operational teams to assess the sequencing and upstream processes.

For the data processing intensive steps, p4_stage1_analysis and seq_alignment, the p4 software is used to provide disk IO minimised processing of many informatics tools in streaming data flow DAGs.

Also, the npg_irods system is essential for the internal archival of data products.

Data Merging across Lanes of a Flowcell

If the same library is sequenced in different lanes of a flowcell, under certain conditions the pipeline will automatically merge all data for a library into a single end product. Spiked-in PhiX libraries data and unassigned to any tags data (tag zero) are not merged. The following scenarios trigger the merge:

If the data quality in a lane is poor, the lane should be excluded from the merge. The --process_separately_lanes pipeline option is used to list lanes like this. Usually this option is used when running the analysis pipeline. The pipeline caches the supplied lane numbers so that the archival pipeline can generate a consistent with the analysis pipeline list of data products. The same relates to the npg_run_is_deletable script. The cached value is retrieved only if the --process_separately_lanes argument was not set when any of these scripts are invoked.