G2C::Software

GeneNomenclatureUtils: Tools for annotating genes and comparing gene lists with community resources.

by Mike DR Croning and Seth GN Grant

Example ids

Example ids

Software Package and Collaborative Development

GeneNomenclatureUtils is hosted @ github

Introduction

Verifying, annotating, storing, and comparing lists of genes is now an essential task in modern experimental and computational biology. These lists can be derived from so-called 'omics' technologies that allow one to investigate gene (transcriptomics) or protein expression (proteomics) across the genome under defined conditions such as drug treatment, developmental stage, or clinical status. Similarly lists of candidate disease genes are being generated from the huge quantities of sequence data being produced by geneticists and clinicians employing new-sequencing technologies to identify sequence variation in large cohorts of human patient samples.

Comparing lists of genes produced by different labs from these disparate experimental sources remains an arduous, but obligatory task for the bioinformatician. This is because gene and protein identifiers are not stable, as they are continually revised and withdrawn over time by the gene nomenclature committees (such as HGNC) and database identifiers provided by other community genomic resources do also change. This 'identifier creep' is confounded by the fact that gene list comparisons often have to be made across species and model organism.

Here we provide an extensive suite of 40+ command-line driven utilities that are designed to simply this process. We provide an automated means to fetch the key nomenclature and other genomic, bibliographic and disease annotation resources from: HGNC, MGI, NCBI Entrez, OMIM, PubMed and UniProt, and store them on local disk. Required MEDLINE records are cached in a local MySQL database. This ensures a high-availability of all the resources, and ensures a consistent set of nomenclature and annotation can be used across the lifetime of an ongoing experiment or analysis.

GeneNomenclatureUtils is free software, collaboratively-developed, easy to install, and should prove useful to both bioinformaticians and experimental investigators, working with lists of genes.

Open Source License

The GeneNomenclatureUtils package is made available under the terms of the Artistic License 2.0.

Read the full license and disclaimer.

Documentation

Tool categories

Nomenclature file download

Use this script to automatically download all the nomenclature and other bioinformatic database files utilised by the GeneNomenclatureUtils package.

  • download_nomenclature_files

Gene and protein nomenclature utilities

These scripts use the Entrez, HGNC, MGI, and UniProt resources to do identifier checking, by going from gene symbol to gene ID (or vice versa), or to fetch othologous gene IDs when going from mouse <-> human genomes.

Approved names and synonyms can be added to files of gene symbols or IDs.

Protein sequences and reported mouse alleles can be fetched for mouse genes.

Entrez Gene: http://www.ncbi.nlm.nih.gov/gene
check_entrez_gene_ids
extract_entrez_gene_info_by_tax_id
extract_from_ncbi_gene2_accession
HGNC: http://www.genenames.org
add_hgnc_id_by_hgnc_symbol
MGI: http://www.informatics.jax.org
add_approved_name_by_mgi_id
add_mgi_id_by_hgnc_id
add_mgi_id_by_mgi_symbol
add_mgi_symbol_by_mgi_id
add_ortho_gene_id_by_mgi_id
add_uniprot_acc_by_mgi_id
add_synonyms_by_mgi_id
combine_by_mgi_id_column
generate_interpro_report_from_mgi_id
get_mgi_alleles_by_mgi_id
get_prot_seqs_by_mgi_id
UniProt: http://www.uniprot.org
check_uniprot_ids
parse_uniprot_accs

List comparison

Allows an arbitrary number of lists of genes or symbols to be compared eaily using a configuation file system so that 10s or 100s of lists can be managed easily.

  • list_comparator

MEDLINE cache

A system to locally cache MEDLINE records, automatically fetched from PubMed, using a MySQL database to both coordinate the requests for the MEDLINE records, and subsequently supply them.

A simple object-orientated interface to the database is provided to utilise the cache: package MedlineCacheDB;

  • medline_cache_create_db_tables
  • medline_cache_dump
  • medline_cache_pubmed_fetcher
  • medline_cache_requester

OMIM disease - http://www.omim.org

Used to add OMIM IDs and OMIM entry titles to a file of human genes specified by Entrez Gene ID. The various 'gene' and/or 'phenotype' records can be added.

  • add_omim_by_entrez_gene_id

PubMed

Uses the orthologous gene identifiers assembled in a file to combine the synonyms and human mutation terms to generate a PubMed URL search, to look for reports of association to disease of human gene mutations

  • generate_pubmed_disease_searches_from_gene_nomenclature_ids

TAB-delimited file manipulation

A comprehensive set of utilities to manipulate tab- delimited text files.

  • add_row_numbers
  • categorise_by_column
  • compare_columns_output_differences
  • count_column_occurrences
  • count_distinct_values_in_column
  • count_row_occurrences
  • count_rows_and_columns
  • dump_column_as_unique
  • extract_from_file_by_id_column
  • extract_from_file_by_column_numerical_range
  • filter_by_column
  • pad_tabs_to_column_width
  • rearrange_and_cut_columns
  • remove_trailing_whitespace
  • sort_file_by_column
  • txt_file_format_sniffer
  • write_spreadsheet

Command Line Parameters

Introduction

Most of the scripts in the GeneNomenclatureUtils package take one Unix-formatted tab-delimited text file (use txt_file_format_sniffer to check your txt file format) as input (specified by --file) and do something (look-up an equivalent ID, filter by, check, etc) the IDs or symbols specified in the column parameter(s) required by the individual scripts.

Columns are numbered from 1, that being the left-most column, and output is written to STDOUT by inserting a column in the lines of the file, as specified by --output_column, with existing columns being pushed to the right.

All of the scripts are documented with their particular mandatory and optional parameters. This is output automatically if an incorrect or insufficient set of parameters are specified, enter:

'./check_entrez_gene_id<cr>' to view the parameters of the script that checks Entrez Gene IDs, which will produce:

NAME - check_entrez_gene_ids
COMMAND LINE PARAMETERS
       Required parameters
         --file                         file to check
         --entrez_gene_id_column        column containing IDs to check (>=1)
         --tax_id                       tax_id
         --output_column                column to output check results (>=1)

       Optional parameters
         --skip_title                   skip first (title) row
         --help|h

DESCRIPTION
<SNIP>

An example command line is usually included in the documentation, that uses the demonstration data and configuration files included in the package.

The most common parameters

--file
The file to use, check, or process
--skip_title
Skip the first line (row) of the tab-delimited file, as it contains titles, rather than data
--output_column
Specify the output column, for the added data, existing columns are pushed rightword
--tax_id
Specify a taxonomy ID e.g. 9606 (human)
--column
Specify the column to check, or process
--columns
Specify a set of columns to check, or process e.g. --columns=1,2,3
--value
Specify a single value to find, count, or filter by
--help|h
Output the documentation
--dir
Specify directory to read or write from
--file_list
Specify a file to read file names from
--config_file
Specify configuration file. Details of the correct format are included in all scripts
--mode
Used to select between the behaviours of scripts with more than one running mode
--quiet
Produce less logging output

Gene nomenclature, id and symbol parameters

This parameter group is used to specify gene symbols and IDs from nomenclature committees and databases including MGI, HGNC, and Entrez Gene

--mgi_id_column
Specify column containing MGI IDs e.g. MGI:95819
--hgnc_symbol_column
Specify column containing HGNC symbols e.g. GRIN1
--hgnc_id_column
Specify column containing HGNC IDs e.g. HGNC:4584
--mgi_symbol_column
Specify column containing MGI IDs e.g. Grin1
--entrez_gene_id_column
Specify column containing Entrez Gene IDs e.g. 2902 (human)

All the above examples about are for the gene encding the NR1 subunit of the glutamate receptor, NMDA subtype

See:

PubMed

--pubmed_id_column
Specify column containing PubMed IDs (PMIDs)

Other parameters

--allow_dups
Allow duplicates
--ascending
Sort or output in an asencding fashion (in string or numeric terms)
--compare
-
--data_file
data file from which to extract
--data_file_id_column
data file column with ids
--descending
Sort or output in a descending fashion (in string or numeric terms)
--file1
First tab-delimited data file
--file1_column
Column to check or process from file1
--file2
Second tab-delimited data file
--file2_column
Column to check or process from file2
--filter_by
-
--has_title
-
--ids_file
file holding IDs
--ids_file_column
column to use from IDs file
--ignore_case
Ignore case when sorting, processing a file
--include_all
-
--lower_bound
-
--negative
-
--output_attrib
-
--output_ids
-
--output_mode
-
--output_unavailable
-
--positive
-
--suppress_title
-
--transfer_column
Specify additional columns to copy from the input file to output
--upper_bound
-
--width
-

Testing and debugging parameters

--test
Switch on testing mode
--debug
Switch on debugging mode

Example Usage

  1. Install the package as described in: GeneNomenclatureUtils/docs/package_installation.txt
  2. Review the remaining documentation
  3. cd GeneNomenclatureUtils/scripts

Example 1 - add MGI IDs to the of mouse gene symbols, in the process checking the symbols

Command
./add_mgi_id_by_mgi_symbol --file=../data/my_list_of_genes_1.txt --mgi_symbol_column=1 --output_column=2 --skip_title > ../example_output/example_1_output.txt
Comment
Note at least two 'NOT_FOUND' in the output file, for gene 'Agpat7', 'Lpaat-alpha'

Example 2 - add orthologous human HGNC IDs to the results of example 1

Command
./add_ortho_gene_id_by_mgi_id --file=../example_output/example_1_output.txt --mgi_id_column=2 --output_column=3 --skip_title --output_attrib=human_hgnc_id > ../example_output/example_2_output.txt
Comment
Inspect the output file, a number of HGNC IDs have been added

Example 3 - add Human gene symbols to the results of example 2

Command
./add_ortho_gene_id_by_mgi_id --file=../example_output/example_2_output.txt --mgi_id_column=2 --output_column=4 --skip_title --output_attrib=human_gene_symbol > ../example_output/example_3_output.txt
Comment
Inspect the output file, note mouse and human gene symbols are generally the same except human are capitalised

Example 4 - add approved names to the results of example 3

Command
./add_approved_name_by_mgi_id --file=../example_output/example_3_output.txt --mgi_id_column=2 --output_column=5 --skip_title > ../example_output/example_4_output.txt
Comment
Inspect file, not names have been added

Example 5 - sort the results of example 4 by the added human gene symbols

Command
./sort_file_by_column --file=../example_output/example_4_output.txt --column=4 --mode=string --ascending --skip_title > ../example_output/example_5_output.txt
Comment
Inspect file, note it has been sorted

Example 6 - Look for duplicates in the MGI IDs present in results of example 5.

Command
./count_distinct_values_in_column --file=../example_output/example_5_output.txt --column=2
Comment
Compare to output below:
    ./count_distinct_values_in_column
    =================================

    Assumes a column title line

    Column number: 2
    Column title : 'MGI ID'

    Distinct values: 13

    Duplicates
    MGI:1336186          - 2
    MGI:1355330          - 2
    MGI:88437            - 2
    NOT_FOUND            - 2

    Total number of duplicates: 4

Example 7 - Make a unique list of MGI IDs from the results of example 5

Command
./dump_column_as_unique --file=../example_output/example_5_output.txt --column=2 --skip_title > ../example_output/example_7_output.txt
Comment
Inspect file, note it still contains a single 'NOT_FOUND'

Example 8 - Add orthologous human Entrez Gene IDs to the results of example 7

Command
./add_ortho_gene_id_by_mgi_id --file=../example_output/example_7_output.txt --mgi_id_column=1 --output_column=2 --skip_title --output_attrib=human_entrez_gene_id
Comment
Inspect file, note the numerical IDs added

Example 9 - Using the result of Example 8 search OMIM for human disease phenotypes

Command
./add_omim_by_entrez_gene_id --file=../example_output/example_8_output.txt --entrez_gene_id_column=2 --skip_title --mode=phenotitle --output_column=3 > ../example_output/example_9_output.txt
Comment
Note at least two of the genes have OMIM entries, for enzymatic deficiencies

Example 10 - Get synonyms for the genes specified by MGI in the results of example 7

Command
./add_synonyms_by_mgi_id --file=../example_output/example_7_output.txt --mgi_id_column=1 --output_column=2 --skip_title > ../example_output/example_10_output.txt
Comment
Inspect the output file, note not all genes have synonyms

Example 11 - Check the txt file format of the first example file

Command
./txt_file_format_sniffer --file=../data/my_list_of_genes_1.txt
Comment
Depending how you obtained the package from Github the results could vary, example output below:
    ./txt_file_format_sniffer
    =========================

    File : ../data/my_list_of_genes_1.txt
    Lines: 18

    Unix/Linux  - YES: Containing: \w\n\w
    DOS/Windows - NO : Containing: \w\r\n\w|\w\n\r\w
    MAC         - NO : Containing: \w\r\w

Example 12 - Rearrange the order of the columns in the example 5 output, to place the MGI IDs first

Command
./rearrange_and_cut_columns --file=../example_output/example_5_output.txt --columns=2,1,3,4,5 > ../example_output/example_12_output.txt
Comment
Inspect file, note rearrangement

Example 13 - Find the InterPro domains (and their frequency) in the mouse genes specified in the results example 12

Command
./generate_interpro_report_from_mgi_id --file=../example_output/example_12_output.txt --mgi_id_column=1 --skip_title --output_mode=by_abundance > ../example_output/example_13_output.txt
Comment
Inspect the output file, note the most frequently occurring InterPro family/domain is IPR016040, NAD(P)-binding domain

Example 14 - Find mice alleles for the genes specified in the results of example 7

Command
./get_mgi_alleles_by_mgi_id --file=../example_output/example_7_output.txt --mgi_id_column=1 --skip_title > ../example_output/example_14_output.txt
Comment
Notice from the STDERR output that (at least) 7 of the genes have reported alleles, and from the Pdpk1 has the most

Example 15 - Convert the results of example 14 into a Microsoft Excel format spreadsheet

Command
./write_spreadsheet --file=../example_output/example_14_output.txt
Comment
Note from the output to screen that 14 rows were written to a file called example_14_output.xls

Example 16 - Generate URLS to search PubMed for human disease association of the genes from example 5

Command
./generate_pubmed_disease_searches_from_gene_nomenclature_ids --file=../example_output/example_5_output.txt --mgi_id_column=2 --hgnc_id_column=3 --mode=disease --skip_title > ../example_output/example_16_output.txt
Comment
You may see warnings from the programme, these are non-fatal, examine the output file and cut and paste the search for Decr2 into http://www.ncbi.nlm.nih.gov/sites/entrez?db=PubMed

Example 17 - Setup the MySQL database for the MEDLINE cache

Command
./medline_cache_create_db_tables --config_file=medline_cache_db.ini
Comment
Assuming you have created an empty database on your MySQL server, and set the connection parameters in GeneNomenclatureUtils/conf/medline_cache_db.ini, two tables will be created 'medline' and 'request'

Example 18 - Load some PubMed IDs requests into the MEDLINE cache db

Command
./medline_cache_requester --config_file=medline_cache_db.ini --file=../data/my_list_of_pubmed_ids.txt --pubmed_id_column=1
Comment
5 PubMed IDs should be parsed from the file:
    ./medline_cache_requester
    =========================

    IDs parsed: 5

    medline_entries_fetched : 0
    pmids_already_available : 0
    pmids_already_requested : 0
    pmids_stored            : 5
    pmids_with_error        : 0
    total_pmids_parsed      : 0

Example 19 - Fetch outstanding MEDLINE records

Command
./medline_cache_pubmed_fetcher --config_file=medline_cache_db.ini
Comment
If the fetch is successful you should get the following output:
    ./medline_cache_pubmed_fetcher
    ==============================

    Total entries to fetch: 5
    URL Request size      : 2

    ..

    pmids_cached    : 5
    pmids_remaining : 0
    pmids_requested : 5

Example 20 - Dump the fetched MEDLINE records to a text file

Command
./medline_cache_dump --config_file=medline_cache_db.ini --file=../data/my_list_of_pubmed_ids.txt --pubmed_id_column=1 > ../example_output/example_20_output.txt
Comment
Inspect the output file it should contain 5 MEDLINE records

Example 21 - Compare two lists of gene identifiers

Command
./list_comparator --config_file=example_list_config.yml --include_all > ../example_output/example_21_output.txt
Comment
Inspect the output file, once should see two files were parsed and compared, and the comparision starts thus:
      Comparison: MY-LIST-2 (11)

    UNION OF ALL SETS: 20

    Gene    MY-LIST-1       MY-LIST-2
    MGI:1915512     YES     NO
    MGI:1196345     YES     NO
    MGI:1914291     YES     NO
    MGI:107436      NO      YES
    MGI:106672      NO      YES
    MGI:108109      YES     NO

Package Structure

The GeneNomenclatureUtils package has the following structure by default.

GeneNomenclatureUtils
 
GeneNomenclatureUtils/conf
Example configuration files
GeneNomenclatureUtils/data
Example data files
GeneNomenclatureUtils/docs
Documentation
GeneNomenclatureUtils/example_output
Example output
GeneNomenclatureUtils/modules
Perl modules
GeneNomenclatureUtils/scripts
The package scripts

License

Copyright (C) 2005-2011 Genome Research Limited (GRL)

Authors: Mike Croning and Seth Grant

Refer to the Perl Artistic License 2.0 for the terms under which you may use modify, and redistribute this source. http://www.opensource.org/licenses/artistic-license-2.0.php

Comments and feedback to: mike_croning@hotmail.com

THIS PACKAGE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.

© G2C 2014. The Genes to Cognition Programme received funding from The Wellcome Trust and the EU FP7 Framework Programmes:
EUROSPIN (FP7-HEALTH-241498), SynSys (FP7-HEALTH-242167) and GENCODYS (FP7-HEALTH-241995).

Cookies Policy | Terms and Conditions. This site is hosted by Edinburgh University and the Genes to Cognition Programme.