|
|
G2C::Informatics
|
Programming
A key aim is to develop software to automate
steps in the gene targeting vector design process, to speed up this
part of mouse knockout generation: at present, it is a bottleneck in the process of going from gene of interest
to transgenic mouse line.
The bioinformatics software package being
developed is implemented in Perl, and makes uses of many open source
software components (such as MySQL, Bioperl, and Ensembl) that have
been developed at the Sanger Institute and elsewhere.
Where the developed methods and software
are likely to be of interest and use to the wider scientific community,
G2C plan to publish the methods and make the software freely available. See the Software page for details.
Network Biology
Molecular interaction networks are being used to allow the
integrated analysis of the diverse biological datasets generated
by G2C. These analyses will inform the construction of static and
dynamic models at various levels of function (e.g. protein
complexes, neuronal networks) and abstraction (from coarse
heuristics to detailed biochemical models). The emphasis is on
producing models that provide insight into the underlying biology
and can be used to direct experimental work.
Current models focus on the NMDA Receptor Complex (NRC/MASC), a major
component of the postsynaptic signal transduction machinery at glutamatergic
synapses. The NRC/MASC co-ordinates a diverse set of effector pathways
underlying neuronal plasticity. Detailed annotation and analysis of the
NRC/MASC proteins revealed simple principles underlying the functional
organization of the complex. The resulting network model correctly predicts
robustness of synaptic plasticity to mutations and drug interference. Existing
models of NRC/MASC (and the synapse proteome in general) are now being refined
through the analysis of gene/protein expression, phospho-proteomics and other
large-scale datasets.
Literature mining
G2C Bioinformatics have developed literature-mining
tools to aid the in-house data curators. The purpose of these tools
is to present the curator with ranked lists of results. A typical
search that a curator might wish to perform is 'from this list of proteins,
show the proteins that interact with each other and are involved
in LTP and are disease implicated'.
Without these tools, the curators face huge volumes of text, multiple protein synonyms and
the laborious process of identifying protein interactions.
Figure 1 below shows a sample abstract mined from
PubMed, with the
relevant search terms highlighted.

| Figure 1: Sample abstract mined from PubMed, with
search terms highlighted |
Gene Prioritizing
We are developing a rule-based/machine-learning
system to prioritize lists of genes for investigation. The rules
used are derived from previous lab investigations, consisting in
part of highly informative gene characteristics, such as protein
interactions, disease implication and so on.
The machine-learning component is in place
to augment the rule-based system, by identifying patterns in highly-ranked results and determining the usefulness of rules inferred
from these patterns.
Lab database development
The Bioinformatics team have developed G2C's Database, which collects the vast volumes of information gleaned from literature mining and data curation, and integrates it. The database contains large volumes of data on genetics, synaptic plasticity, human disease and proteomics, all sorted by the gene of interest.
The lab generates data from many diverse
groups. Various systems are being developed to store this data and allow
the retrieval of information relevant across all the areas. In particular, we are developing a system of in-house gene IDs to manage the constantly-updating lists of identifiers produced by the various gene databases.
|
|