                Automated Design of Southern Blot Probes
                ----------------------------------------
                
                Contact: webmaster@genes2cognition.org


Contents
--------

    A. Copyright & licensing conditions
    B. Introduction
    C. Implementation
    D. Performance
    E. Available documentation
    F. Package directory structure
    G. Environment variables
    H. Program configuration files
    I. Perl modules required
    J. Other applications required



A. Copyright & licensing conditions
-----------------------------------

Scripts, software and documentation copyright 2005-2009
Genes to Cognition Programme (G2C) and Genome Research Limited (GRL).

You may distribute this file/module under the terms of the artistic
licence: http://www.perlfoundation.org/artistic_license_2_0


B. Introduction
---------------

Southern blotting is an experimental procedure where DNA, from a genomic or
other source, is digested with a restriction enzyme and then separated by
size using gel electrophoresis. The fragments are transferred from the gel
onto a membrane ('blotted') which is then incubated with a labelled
single-stranded DNA probe. Such a procedure allows one to locate a particular
sequence of DNA within a complex mixture of DNA. From a gene targeting
perspective, Southern blotting can be used to detect whether a targeting
event has successfully taken place.

Designing a 'good' Southern blot probe for a particular gene or locus
involves finding a stretch of DNA sequence at that locus, generally
500-1000bp long, that has the desirable qualities of being unique to that
locus, with little or no repetitive DNA content. Molecular biologists tend
to design their probes manually, by excising portions of genome sequence from
online genome browsers (such as such as Ensembl) and then pasting them into a
genome-search site enabling them to check the genome for sequence hits.
Ideally a probe sequence should return a single hit to the region it was
designed against, with little or no cross-reactivity to other parts of the
genome. 

If this is not the case, the investigator will likely shift the piece of DNA
chosen a short distance away from what they might consider as the optimal
site, and search again. Another option might be to shorten the putative probe
sequence, and repeat the search, particularly if was obvious one end of the
initial sequence appeared to be lacking the desired specificity, thus giving
rise to the extraneous hits.

With it taking quite a few minutes to perform each round of cutting, pasting
and genome searching that proves necessary to find an acceptable sequence for
a Southern blot probe, one can appreciate that this does not make effective
use of a molecular biologists time, and is very unlikely to find an optimal
probe.

The design strategy outlined is clearly very amenable to automation using
bioinformatics with the added benefit that the number of putative probes that
can be examined during the design process need not be limited to a few (as
when carried out manually) but can be increased to hundreds, or thousands,
allowing a very fine-grained analysis to be performed, significantly
increasing the chances of finding the best, or at least a near-optimal probe 
for the chosen locus.


C. Implementation
-----------------

With the number of genome searches to be carried out potentially taking hours
for each probe to be designed, the writing of a single programme which would
complete the whole task outlined above is not likely to yield a satisfactory
solution.

Instead a more elaborate system is required utilising a database to store and
retrieve the design information for each probe, and subsequently the results
of the many genome searches carried out for putative probes. These results
can then be analysed to find the best probes.

Such a system would also allow more than one computer to be used to carry out
the searches, speeding design, and would also permit the user to modify the
selection parameters for the probe, without requiring one to re-run the
genome searches for a particular probe, should the initial constraints be
found to be too stringent at a particular locus.

A MySQL database (12 tables) was designed for this purpose together with a
set of Perl data objects and adaptors to allow programmes to write and
retrieve from the database. These follow the design paradigm set by the
Ensembl genome analysis system, where one creates a set of (DBEntry) classes
for the 'business' objects used by the system, partnered by a set of
complementary DBSQL classes that hold the cognate SQL necessary for storing
and retrieving from the database. Changes to the database schema can the then
be made without impact on the DBEntry classes. The naming conventions used by
the classes (and data types returned) generally follow those used in the
Ensembl core API, see http://www.ensembl.org/info/docs/api/core


Southern blot design package classes
------------------------------------

    GeneTargeting::DBEntry
        ::ComponentHit
        ::Conf
            ::Exonerate             
        ::DNAProbe
        ::ExternalDB
        ::Hit
        ::Job
        ::Sequence
        ::Xref

    GeneTargeting::DBAdaptor
        ::BaseAdaptor
        ::ConfAdaptor
        ::DBAdaptor
        ::DNAProbeAdaptor
        ::ExternalDBAdaptor
        ::HitAdaptor
        ::JobAdaptor
        ::SequenceAdaptor
        ::SequenceHitAdaptor
        ::XrefAdaptor

    GeneTargeting::Utils            Grab-bag of general utility methods
        ::Config                    Configuration file parsing and setup
        ::Counts                    
        ::GD                        Utilities setting colours etc for GD
        ::Exonerate                 Custom exonerate parser
        ::HTMLReport                Simple module for html output
        ::Primer3                   Primer-picking wrapping code
    
Five programmes were written to perform the Southern blot probe design task
in its entirety, along with three accessory scripts:

1) create_probe_search_db_tables
2) create_probe_search
3) submit_probe_search
4) run_probe_search
5) analyse_probe_search

Accessory scripts
-----------------

6) get_probe_search_cpu_time
7) delete_job_results
8) delete_probe_search


1) create_probe_search_db_tables
Creates the tables constituting the GeneTargeting MySQL database, allowing
one to specify the server parameters on the command line

2) create_probe_search
Given the user-specified chromosomal coordinates for the acceptable design
window for the Southern blot probe, enumerates all the possible probes in the
window, at the chosen granularity, within the size-range chosen for the probe
Creates a number of jobs of class GeneTargeting::DBEntry::Job that are stored
in the database each one of which is executed by an instance of
run_probe_search

3) submit_probe_search
User-executed script to submit a number of jobs to the compute farm to be
utilised, wrapping the underying LSF system 

4) run_probe_search
The 'runnable' script used by the nodes of the compute farm to fetch putative
probes from the database, search them against the genome (specified in the
config file for create_probe_search) and store the results back for later
analysis

5) analyse_probe_search
Used to analyse the results from the all the genome searches, determining
which of the putative probes exceed the minimum acceptable criteria for a
Southern blot probe. Results are outputting as static html, including a
graphic representation of 'unique', 'good' and 'bad' regions in the
previously specified probe design window.

Also picks primers with Primer3 for recovery of the putative probes.

6) get_probe_search_cpu_time
Calculates the total time taken for the execution of all the jobs making up
the Southern blot probe search. Does this by parsing the output files written
by LSF

7) delete_job_results
Deletes the results from the database of a job - presumably which are
erroneous due to a system failure.

8) delete_probe_search
Deletes the whole probe design from the database, when it is no longer needed
or possibly the coorinates were specified incorrectly.


D. Performance
--------------

Results seem favourable when compared to a number of manually-designed probes
(see the paper) that have been used successfully by Genes to Cognition
research programme at the Wellcome Trust Sanger Institute.

Experimentally validation has been performed on a set of the probes
automatically designed by the software (see the paper).


E. Available documentation
--------------------------

All the scripts making up the Southern blot probe design system contain
POD documentation detailing their use and command line parameters.
Should the scripts be run with an invalid parameter combination then the
documentation is automatically displayed in order to guide the user.

The configuration files for each of the scripts (present in the conf
directory of the package) are commented as to the function and meaning of
their various sections and options.

The GeneTargeting API modules utilised for Southern blot design are
documented with POD which can be displayed with 'perldoc Module_name.pm'
or perldoc classname, such as  perldoc GeneTargeting::Utils::Exonerate

Other documentation files include

southern_blot_design/docs/
    southern_blot_probe_design.txt          (this file)    
    example_run.txt
    example_run_output/			    


F. Package directory structure
------------------------------

GeneTargeting/
    conf/                           example programme config files
    docs/                           documentation
    modules/                        
        GeneTargeting/
            DBEntry/                
            DBSQL/
            Utils/
    scripts/                        deployed scripts
        run/                        runner scripts started by pipeline

G. Environment variables
------------------------

GeneTargetingConfDir - full path to the conf directory
GeneTargetingBaseDir - full path to the southern_blot_design directory


H. Program configuration files
------------------------------

Each of the five programmes requires a windows-style .ini configuration file.
These should be stored in the directory pointed to by the environment
variable GeneTargetingConfDir.

Sections of the supplied .ini configuration files group related
configuration options, and are commented.


I. Perl modules required
------------------------

    name                        version tested              
    ----                        --------------              

    Bio                         1.5.0
    Bio::Ensembl                branch 32
    Bio::Tools::Run             1.4
    Config::IniFiles            2.38
    DBI                         1.32
    GD                          2.17

Note other modules may be required by the above modules, please see their
documentation should "Can't locate Module.pm" errors arise.


J. Other applications required
------------------------------

exonerate                       exonerate-1.0.0
primer3                         primer3_1.0.0
LSF                             5.1 from Platform Computing Corp.

    
