Editing BioMicroCenter:Software (section)

== BMC-BCC Pipeline ==

The pipeline processes flowcell directories as they are generated by the Illumina sequencer software and postprocesses the output for use in downstream biological analyses. It is intended to be used by core facilities who own and/or operate Illumina sequencers for automation and consistency of processing Illumina data. The pipeline is a collection of command line utilities written primarily in the Python programming language. The commands are tied together using the ruffus pipelining package.

'''Release Notes 1.8''' (11/11/2020)
* Added the support of NovaSeq flowcells
* Updated fastq format for NextSeq and NovaSeq flowcells
* Generated Sample sheets for NextSeq and NovaSeq flowcells for demultiplexing
* Added the support of cellranger-atac and spaceranger for 10x pipeline
* Added various flags for re-run options
* Refactored codes 
* Multiple projects can be loaded on the same lane

'''Release Notes 1.7''' (09/01/2017)
* Added the support for the 10x pipeline. Cellranger is upgraded from 2.0.2 to 3.0.1 on 1/11/2019

* Added precheck options
** Give warning when samples have indices differ only 1 nt when 2nd index exists
** Give warning when inline indices are used but not given start nt and length
** Check existing project path is writable 
** Check genome exists
  
* Reliability and performance improvement
** Corrected the unmapped read percentage for pair-end read
** Corrected the lane barcode percentage (the percentage is now calculated against the pass filter count instead of the total count)
** Adjusted number of threads based on sample numbers
** Updated PPPQC code to handle flowcell that has reverse read length longer than forward read length
** Updated sam_stat code to use "samtools flagstat" instead of in-house script to compute sam file statistics
** Changed the delivery URL from rowley.mit.edu to bmc-data.mit.edu
** Added project name and URL in pipeline run notification email
** Added project name in delivery email

* Third party software package update
** bcl2fastq from 2.15.0 to 2.19.1
** bwa from 0.7.12 to 0.7.16a
** fastqc from 0.11.4 to 0.11.5
** samtools from 1.3 to 1.5
** bowtie2 from 2.2.6 to 2.3.2

'''Release Notes 1.6''' (03/18/2017)
* Added the support for the SLURM scheduler
* Added the support for CentOS
* Added the delivery of BCL files through web when requested
* Performance improvement (nextSeq fastq generation and PPPQC)

'''Release Notes 1.5.2''' (08/12/2016)
* Added the support for GA/NextSeq 2nd index
* Improved the portabilty of the pipeline through code reorg

'''Release Notes 1.5.1''' (06/12/2016)
* Added the support for HT3DGE (high-throughput digital gene expression) project
* Publish infosite to Filemaker database

'''Release Notes 1.5''' (01/26/2016)
*''' Added phiX percent perfect plot to calculate sequencing error rate for HiSeq and MiSeq'''<p>The percent perfect plot created by the PPPQC script is designed to calculate next generation sequencing error rate. The calculation can be applied to paired end sequencing or single end sequencing of either Nextseq, Miseq, or Hiseq, depending on the specific sequencing run. The script is based on the comparison between the sequenced spike in PhiX reads with PhiX reference genome sequence. To avoid potential alignment issues of sequencing reads with poor quality, the script first aligns the first 30 base pairs of the sequencing reads to identify PhiX reads as well as forward reads and reverse reads. Then full length PhiX forward and reverse reads were retrieved and compared to the reference sequence. The percentage of sequencing reads with zero mismatches, <=1 mismatches, <=2 mismatches, <=3 mismatches, and <=4 mismatches were calculated and plotted at each nucleotide position. For Nextseq sequencing, the reads from each camera were processed separately. For Hiseq sequencing, the reads from each lane were processed separately. For paired end reads, the reads from each mate pair were processed separately. Due to the nature of very low indel reading errors rate by Illumina sequencing, the reads with indels comparing to the reference sequence are excluded from the current calculation.</p>

* '''Added CNV quality control plot for ChIP, ReSeq and CGHSeq sample types'''<p>The CNV quality control plot created by the CNVQC script uses downsampled bam files to plot DNA copy numbers along the reference genome. Both mapability and GC% were considered during the normalizing process. Potential gains were marked in red and losses were marked in green. Currently it supports hg19 and mm9 genomes. </p>

* '''Upgraded software tools including fastqc, bwa, samtools, and bedtools'''
**fastqc upgraded from 0.11.2 to 0.11.4
**bwa upgraded from 0.7.10 to 0.7.12
**samtools upgraded from 0.1.19 to 1.3
**bedtools upgraded from 2.20.1 to 2.25.0

* '''Improved performance and robustness'''
** Added a precheck flag -c to check the filemaker database to avoid human error
** Allowed the recursive pulling of samples in a subpool when creating sample json file
** Improved the robustness of sample json file when handling mixed barcodes
** Added a second person to receive delivery email if specified
** Enabled creating tarball of the flowcell directory after pipeline run ends
** Simplified the process to create a new release
** Reworked the code on publishing project data to avoid intermittent file system error 
** Added flowcell as part of SGE job name to easily identify pipeline runs in the cluster
** Used 32 threads as default instead of 16 after new nodes were added to the rous cluster

'''Release Notes 1.4''' (01/01/2015)
* The quality scores of fastq files are now in Sanger format (previously the quality scores were in the Illumina 1.3+ format)
* Add the support of NextSeq. 

'''Release Notes 1.3''' (07/25/2014)
* Paired end quality control is added for samples aligned to genomes other than phiX. It summarizes basic mapping metrics from the BWA alignments to identify proper mapping reads and provides a distribution of insert lengths based on these mappings. 
* RNAseq quality control is added for RNASeq data for a list of genomes other than phiX. It checks distribution of the reads, 5' to 3' bias, strand specificity and ribosome RNA contamination. It also checks gene expression correlation between samples when applicable. . 
* Software upgrade: BWA is upgraded to 0.7.10 and fastqc is upgraded to 0.11.2
* Improved the algorithms of demultiplexing and handling index mismatch 
* Performance enhancement. It uses 16 threads as default instead of 8 which reduces the pipeline runtime significantly for a HiSeq run. 

'''Release Notes 1.2''' (01/01/2014)
* An information site about the pipeline run is delivered to MIT users
* Sample data directory includes the flowcell code
* Bug fix for pipeline re-run. When the pipeline was re-run, data may be duplicated in the fastq files. This is now fixed. 
* Performance enhancement. Data is written directly to the published directory for users, and copy is avoided whenever possible. This not only reduces disk storage, but also allows users to get their data faster.

'''Release Notes 1.0.2''' (08/19/2013)
* Switch from Bowtie to BWA for default alignments for generating SAM files.<p>The BWA version 0.7.5a is used by default for alignment. For Illumina sequence reads up to 70bp, the alignment is done by aln/samse/sampe (the BWA-backtrack algorithm). For longer sequence read > 70bp, the mem subcommand (the BWA-MEM algorithm) is used.</p>

* Bug fix for large SAM/BAM files<p>When processing large fastq files to generate a sam file, the sam file may be corrupted at the end of the file under certain circumstance if it is larger than 40GB. As a result, the SAM-BAM conversion may get a core dump. This is now fixed.</p>

'''Release Notes 0.9''' (10/18/2011)<p>
Implemented all core functionality:
* setting up and converting qseq files
* qseq to fastq
* fastqc and tag count statistics on flowcell-level sequences
* splitting of barcoded samples into individual directories
* individual fastqc
* genome alignment using bowtie plus statistics 
* contamination qc checking
* tag counts
* conversion of alignments from SAM to BAM
* production of bigWig files from SAM alignments
* publishing user data to web directories
</p>