nf-core/circdna
Pipeline for the identification of extrachromosomal circular DNA (ecDNA) from Circle-seq, WGS, and ATAC-seq data that were generated from cancer and other eukaryotic cells.
1.0.0). The latest
stable release is
1.1.0
.
Introduction
This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- FastQC - Raw read QC
- MultiQC - Aggregate report describing results and QC from the whole pipeline
- Pipeline information - Report metrics generated during the workflow execution
- TrimGalore - Read Trimming
- BWA - Read mapping to reference genome
- Samtools - Sorting, indexing, filtering & stats generation of BAM file
- Circle-Map Realign - Identifies putative circular DNA junctions
- Circle-Map Repeats - Identifies putative repetitive circular DNA
- CIRCexplorer2 - Identifies putative circular DNA junctions
- Circle_finder - Identifies putative circular DNA junctions
- AmpliconArchitect - Reconstruct the structure of focally amplified regions
- Unicycler - DeNovo Alignment of circular DNAs
General Tools
FastQC
Output files
fastqc/*_fastqc.html: FastQC report containing quality metrics.*_fastqc.zip: Zip archive containing the FastQC report, tab-delimited data file and plot images.
FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages.
NB: The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality.
TrimGalore
Output files
trimgalore/*_trimming_report.txt: Trimgalore trimming report.fastqc/*_fastqc.zip: Zip archive containing the FastQC report, tab-delimited data file and plot images.fastqc/*_fastqc.html: FastQC report containing quality metrics.
TrimGalore combines the trimming tool Cutadapt for the removal of adapter sequences, primers and other unwanted sequences with the quality control tool FastQC
BWA
BWA is a software package for mapping low-divergent sequences against a large reference genome.
Such files are intermediate and not kept in the final files delivered to users.
Output files
Output directory: results/Reports/[SAMPLE]/SamToolsStats
[SAMPLE].bam- Alignment file containing information about the read alignment to the reference genome
Samtools
samtools stats
samtools stats collects statistics from BAM files and outputs in a text format.
Plots will show:
- Alignment metrics.
Output directory: results/Reports/[SAMPLE]/SamToolsStats
[SAMPLE].bam.samtools.stats.out- Raw statistics used by
MultiQC
- Raw statistics used by
For further reading and documentation see the samtools manual
Mark Duplicates
GATK MarkDuplicates
By default, circdna will use GATK MarkDuplicates, which locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are defined as originating from a single fragment of DNA.
Output directory: results/markduplicates/bam
[SAMPLE].md.bamand[SAMPLE].md.baiBAMfile and index
For further reading and documentation see the data pre-processing for variant discovery from the GATK best practices.
Samtools view - Duplicates Filtering
By default, circdna removes all duplicates marked by GATK MarkDuplicates using samtools view
Output directory: results/markduplicates/duplicates_removed
[SAMPLE].md.filtered.sorted.bamand[SAMPLE].md.filtered.sorted.baiBAMfile and index
MultiQC
Output files
multiqc/multiqc_report.html: a standalone HTML file that can be viewed in your web browser.multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.multiqc_plots/: directory containing static images from the report in various formats.
MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.
circdna branches
Branch: circle_finder
Circle_finder
Output files
Output directory: results/circlefinder/
[SAMPLE].microDNA-JT.txtBEDfile containing information about putative circular DNA regions
Circle_finder identifies putative circular DNA junctions from paired-end sequencing data. Circle_finder uses split and discordant read information to identify junctions that could be generated through the formation of ecDNAs. For more information please see Circle_finder.
Branch: circexplorer2
CIRCexplorer2
CIRCexplorer2 identifies putative circular DNA junctions from paired-end sequencing data. CIRCexplorer2 was developed to identify circular RNAs from RNA-seq data. However, it can be also used to call putative circular DNAs from genomic data. For more information see CIRCexplorer2 docs
Output files
**Output directory: `results/circexplorer2/`**[SAMPLE].circexplorer_circdna.bedBEDfile containing information about putative circular DNA regions
[SAMPLE].CIRCexplorer2_parse.loglogfile
Branch: circle_map_realign
circle_map_realign uses the functionality of Circle-Map to call putative circular DNAs from mappable regions. To identify circular DNAs it uses information about split and discordant reads and uses realignment steps to identify the exact breakpoint of the circular DNA. For more information, please see the original paper or the GitHub Page
Circle-Map Readextractor
Circle-Map Readextractor extracts read candidates for circular DNA identification.
Output files
Output directory: results/circlemap/readextractor
[SAMPLE].qname.sorted.circular_read_candidates.bamBAMfile containing candidate reads
Circle-Map Realign
Circle-Map Realign detects putative circular DNA junctions from read candidates extracted by Circle-Map Readextractor
Output files
Output directory: results/circlemap/realign
[SAMPLE]_circularDNA_coordinates.bedBEDfile containing information about putative circular DNA regions
Branch: circle_map_repeats
Circle-Map Readextractor
Circle-Map Readextractor extracts read candidates for circular DNA identification.
Output files
Output directory: results/circlemap/readextractor
[SAMPLE].qname.sorted.circular_read_candidates.bamBAMfile containing candidate reads
Circle-Map Repeats
Circle-Map Repeats identifies chromosomal coordinates from repetetive circular DNAs.
Output files
Output directory: results/circlemap/repeats
[SAMPLE]_circularDNA_repeats_coordinates.bedBEDfile containing information about repetetive circular DNAs
Branch: unicycler
This Branch utilises the ability of Unicycler to denovo assemble circular DNAs in combination with the long read mapping capabilities of Minimap2, to identify the origin of the circular DNAs.
Unicycler
Unicycler was originally built as an assembly pipeline for bacterial genomes. In nf-core/circdna it is used to denovo assemble circular DNAs.
Output files
Output directory: results/unicycler/
[SAMPLE].assembly.gfa.gzgfafile containing sequence of denovo assembled sequences
[SAMPLE].assembly.scaffolds.fa.gzfastafile containing sequences of denovo assembled sequences in fasta format with information if denovo assembled seoriginated from a circular DNA.quence forms a circular contig.
Minimap2
Minimap2 uses circular DNA sequences identified by Unicycler and maps it to the given reference genome.
Output files
Output directory: results/unicycler/minimap2
[SAMPLE].pafpaffile containing mapping information of circular DNA sequences
Branch: ampliconarchitect
This pipeline branch ampliconarchitect is only usable with WGS data. This branch uses the utility of PrepareAA to collect amplified seeds from copy number calls, which will be then fed to AmpliconArchitect to characterise amplicons in each given sample.
CNVkit
CNVkit uses alignment information to make copy number calls. These copy number calls will be used by AmpliconArchitect to identify circular and other types of amplicons. The Copy Number calls are then connected to seeds and filtered based on the copy number threshold using scripts provided by PrepareAA
Output files
Output directory: results/ampliconarchitect/cnvkit
[SAMPLE]_CNV_GAIN.bedbedfile containing filtered Copy Number calls
[SAMPLE]_AA_CNV_SEEDS.bedbedfile containing filtered and connected amplified regions (seeds). This is used as input for AmpliconArchitect
[SAMPLE].cnvkit.segment.cnscnsfile containing copy number calls of CNVkit segment.
AmpliconArchitect
AmpliconArchitect uses amplicon seeds provided by CNVkitand PrepareAAto identify different types of amplicons in each sample.
Output files
Output directory: results/ampliconarchitect/ampliconarchitect
amplicons/[SAMPLE]_[AMPLICONID]_cycles.txttxtfile describing the amplicon segments
amplicons/[SAMPLE]_[AMPLICONID]_graph.txttxtfile describing the amplicon graph
cnseg/[SAMPLE]_[SEGMENT]_graph.txttxtfile describing the copy number segmentation file
summary/[SAMPLE]_summary.txttxtfile describing each amplicon with regards to breakpoints, composition, oncogene content, copy number
sv_view/[SAMPLE]_[AMPLICONID].{png,pdf}pngorpdffile displaying the amplicon rearrangement signature
AmpliconClassifier
AmpliconClassifier classifies each amplicon by using the cycles and the graph files generated by AmpliconArchitect.
Output files
Output directory: results/ampliconarchitect/ampliconclassifier
input/[SAMPLE].AmpliconClassifier.inputtxtfile containing the input used forAmpliconClassifierandAmpliconSimilarity.
classification/[SAMPLE]_amplicon_classification_profiles.tsvtsvfile describing the amplicon class of each amplicon in the sample.
ecDNA_counts/[SAMPLE]_ecDNA_counts.tsvtsvfile describing if an amplicon is circular [1 = circular, 0 = non-circular].
gene_list/[SAMPLE]_gene_list.tsvtsvfile detailing the genes on each amplicon.
log/[SAMPLE].classifier_stdout.loglogfile
similarity/[SAMPLE]_similarity_scores.tsvtsvfile containing amplicon similarity scores calculated byAmpliconSimilarity.
bed/[SAMPLE]_amplicon[AMPLICONID]_[CLASSIFICATION]_[ID]_intervals.bedbedfiles containing information about the intervals on each amplicon.unknownintervals were not identified to be located on the respective amplicon.
AmpliconArchitect Summary
The Summary script merges the output of AmpliconArchitect and AmpliconClassifer to give full information about each amplicon in a sample. Please refer to AmpliconClassifier for more accurate ecDNA interval calling. Some intervals classified in the AmpliconArchitect and Summary output are not located on ecDNAs.
Output files
Output directory: results/ampliconarchitect/summary
[SAMPLE].aa_results_summary.tsvtsvfile containing the merged results.
Pipeline information
Output files
pipeline_info/- Reports generated by Nextflow:
execution_report.html,execution_timeline.html,execution_trace.txtandpipeline_dag.dot/pipeline_dag.svg. - Reports generated by the pipeline:
pipeline_report.html,pipeline_report.txtandsoftware_versions.yml. Thepipeline_report*files will only be present if the--email/--email_on_failparameter’s are used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv.
- Reports generated by Nextflow:
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.