Beginner’s Tutorial¶
Pre-requisites¶
Following are the requirments before you can use the CloudConductor. Please make sure your system is properly setup for CloudConductor.
- Linux OS
- Python v.2.7.*
- CloudConductor
- Google Cloud SDK
If you have any question about the installation of required tools, please refer to our Installation section which helps you to set up your system for CloudConductor.
Running CloudConductor¶
The CloudConductor reuires four types of configuration files as follows:
Prepare Workflow Config¶
The workflow configuration exemplifies your data processing steps, where output of one tools becomes input of
consecutive tool. Following is workflow example which takes a raw FASTQ
files from RNAseq
experiment, perform
QC, and align to the Human reference genome to produce the aligned reads as BAM
file. You can refer to Workflow
fundamentals for more details.
[split_samples]
module = SampleSplitter
[fastqc]
module = FastQC
docker_image = fastqc
input_from = split_samples
final_output = R1_fastqc, R2_fastqc
[trimmomatic]
module = Trimmomatic
docker_image = trimmomatic
input_from = split_samples
final_output = trim_report
[[args]]
MINLEN = 25
[star_bam]
module = Star
docker_image = star
input_from = trimmomatic
final_output = bam, transcriptome_mapped_bam, raw_read_counts, final_log
[[args]]
ref = star_genome_dir
[star_bam_index]
module = Samtools
docker_image = samtools
submodule = Index
input_from = star_bam
final_output = bam_idx
Prepare Resource Kit Config¶
The resource kit configuration defines the resources needed to run your workflow. The resouces can be path to the
reference files, tool executables, docker images, etc. Following is a resource kit example containing all the required
resource to produce aligned reads from raw FASTQ
file for a RNAseq
experiment. You can refer to Resouce Kit
fundamentals for more details.
[Docker]
[[fastqc]]
image = quay.io/biocontainers/fastqc:0.11.7--pl5.22.0_2
[[[fastqc]]]
resource_type = fastqc
path = fastqc
[[trimmomatic]]
image = quay.io/biocontainers/trimmomatic:0.36--5
[[[trimmomatic]]]
resource_type = trimmomatic
path = trimmomatic
[[star]]
image = quay.io/biocontainers/star:2.6.0b--0
[[[star]]]
resource_type = star
path = STAR
[[samtools]]
image = quay.io/biocontainers/samtools:1.8--3
[[[samtools]]]
resource_type = samtools
path = samtools
[Path]
[[adapters]]
resource_type = adapters
path = gs://davelab_data/tools/Trimmomatic_0.36/adapters/adapters.fa
[[star_genome_dir]]
resource_type = ref
path = gs://davelab_data/ref/hg19/RNA/star
[[ref]]
resource_type = ref
path = gs://davelab_data/ref/hg19/RNA/ensembl.hg19.release84.fa
Prepare Platform Config¶
The platform configuration defines the runtime platform for the CloudConductor to run your workflow. The
Platform Config
set several things for the runtime platform such as which zone, service account key, maximun
retires for command execution, etc. Following is a example of Platform Config
to run on the workflow on Google Cloud
Platform. You can refer to Platform fundamentals for more details.
zone = us-east1-c
randomize_zone = False
service_account_key_file = var/GAP_new.json
report_topic = pipeline_reports
[task_processor]
disk_image = davelab-image-docker
max_reset = 3
is_preemptible = True
cmd_retries = 1
apt_packages = pigz
Prepare Sample Sheet¶
The sample sheet provide sample information to the CloudConductor. The sample information such as the type of the sample (i.e. tumor, normal), sequencing platform on which the sample were sequenced, path to the sample raw data, etc. Following is sample sheet example. You can refer to Sample Sheet fundamentals for more details.
{
"paired_end": true,
"seq_platform": "Illumina",
"samples": [
{
"name": "s1",
"paths": {
"R1": "gs://your_desired_loc/s1_1_I13_0124.fastq.gz",
"R2": "gs://your_desired_loc/s1_2_I13_0124.fastq.gz"
},
"is_tumor": false,
"lib_name": "LIB_NAME"
},
{
"name": "s2",
"paths": {
"R1": "gs://your_desired_loc/s2_1_I13_0124.fastq.gz",
"R2": "gs://your_desired_loc/s2_2_I13_0124.fastq.gz"
},
"is_tumor": false,
"lib_name": "LIB_NAME"
}
]
}
Once, you have preapared all the required files you can run the CloudConductor as follows:
$ ./CloudConductor --name cc_run_1 \
--input sample_sheet.json \
--pipeline_config workflow.config \
--res_kit_config res_kit.config \
--plat_config gcp_platform.config \
--plat_name Google \
--output_dir gs://your_desired_loc/cc_run_1/ \
-vvv