Develop your own module

There are three types of modules that can be developed in CloudConductor:

  • Tool - represents a tool that can have one or multiple functions, represented as submodules
  • Splitter - represents a tool that splits one input data entity into multiple chunks of data of the same type
  • Merger - represents a tool that merges chunks of data of the same type, into one output data entity

Tool

To develop a new Tool, you will need to create a new actual Python module in the CloudConductor’s directory Modules/Tools with the name you are interested to develop. Then, for each task that the new tool performs, create a class that extends Modules/Module.

Let’s name our new tool as NewTool and its subcommand/task as Subcommand. In this case, the Python module Modules/Tools/NewTool.py should look as following:

from Modules import Module

class Subcommand(Module):

    def __init__(self, module_id, is_docker=False):
        """
        Initialize the new Subcommand class.
        
        Args:
            module_id (string) - the unique ID generated by CloudConductor for this object
            is_docker (boolean) - the current module should return a docker specific command
        """
        super(Subcommand, self).__init__(module_id, is_docker)

        # Define list of output_keys the command will generate data for
        self.output_keys = ["output_key1", "output_key2", "output_key3"]

    def define_input(self):
        """
        Define the input of the subcommand
        """
        pass

    def define_output(self):
        """
        Define the output of the subcommand
        """
        pass

    def define_command(self):
        """
        Generate the actual command
        """
        pass

In the new Subcommand class constructor, you should extend the base class Module and specify what are the output keys that the subcommand is generating.

In the define_input() method you should use the inherited method self.add_argument() to define any input key. An input key has three properties that can be set with the self.add_argument() method:

  • is_required - sets if the input_key is mandatory (False by default)
  • is_resource - sets if the input_key represents a resource to be searched in resource kit (False by default)
  • default_value - a default value for the input_key, in case it never gets set (None by default)

For example:

    def define_input(self):
        self.add_argument("R1", is_required=True)
        self.add_argument("R2")
        self.add_argument("bwa", is_required=True, is_resource=True)
        self.add_argument("samtools", is_required=True, is_resource=True)

In the define_output() method you should use the inherited method self.add_output() to define any output_key. You can use self.get_argument() method to obtain any of the arguments value. An output key has two properties that can be set with the self.add_output() method:

  • value - represents the actual value of the output key. If file, you can use the inherited method self.generate_unique_file_name() to obtain a unique file name for a generated output file
  • is_path - sets if the value is a path (i.e. file or directory).

For example:

    def define_output(self):
        bam_output = self.generate_unique_file_name(extension=".bam")
        self.add_output("bam", bam_output)

In the define_command() method you should expect that both the input and output keys are already associated with the correct values. If you need to obtain the value of an input key use self.get_argument() method. If you need to obtain the value of an output key use self.get_output() method. The method define_command() should return an actual command.

For example:

    def define_command(self):
        R1_fastq = self.get_argument("R1")
        R2_fastq = self.get_argument("R2")
        bwa = self.get_argument("bwa")
        samtools = self.get_argument("samtools")

        bam_output = self.get_output("bam")

        return "%s -M %s %s !LOG2! | %s view > %s !LOG2!" % (bwa, R1_fastq, R2_fastq, samtools, bam_output)

Note: When generating the command, you can use the following placeholders and CloudConductor will create a log file for you:

  • “!LOG0!” - pipes the stdout and strerr to /dev/null
  • “!LOG1!” - pipes only the stdout to a log file that will be available after the module finished running
  • “!LOG2!” - pipes only the stderr to a log file that will be available after the module finished running
  • “!LOG3!” - pipes both the stdout and the stderr to a log file that will be available after the module finished running

Example command with placeholders: “tool1 !LOG2! | tool2 !LOG2! | tool3 !LOG3!”

Splitter

There are only two differences between the way splitters and tools are created.

First different is that to create a splitter you will need to extend the Modules/Splitter abstract class instead of Modules/Module.

Second difference is that the output of a tool is a list of output keys associated with values, while the output of a splitter if a list of splits, each split having a list of output_keys associated with values. Consequently, every output key has an additional property and that is split_id, the ID of the split it is associated with. In order to define a new split ID, you will need to call the self.make_split() method and then associate any output key to the newly created split id.

For example:

    def define_output(self):
        nr_splits = self.get_argument("nr_splits")

        for split_ID in xrange(nr_splits):
            self.make_split(split_ID)
            self.add_output(split_id=split_ID, key="square", value=split_ID**2, is_path=False)
            self.add_output(split_id=split_ID, key="cube", value=split_ID**3, is_path=False)

Merger

There is only one difference between the way mergers and tools are created. The difference if that you will need to extend the Module/Merger abstract class instead of Modules/Module. Other than that, the whole logic is similar.