Proposed Standard for Genome-based Gene Identification

(Minimally edited February 7, 2006 to update to the current genome builds -dls.)

Scope:  This is a proposal for a working system that will allow us
within the stem cell gap consortium to easily make "good enough"
comparisons between our data.  Thus, there are many, important
fundamental issues that are completely ignored by this proposal. 
Nonetheless, we believe this will be a useful, working procedure that
can be quickly implemented within the consortium.

There are two related but distinct issues that arise when one thinks of
communicating gene lists as genomic coordinates.  The first is how one
identifies an experimental result with a specific gene.  This is a
fairly difficult problem if dealt with rigorously and completely, and
one that is important for the current effort, but which is not the
primary focus of this document.  Given that one has identified an
experimental result (sequence) with a gene, the next question is how one
can most unambiguously and usefully communicate the identity of that
gene.  An example task that illustrates this issue is that of comparing
gene lists.  If Baylor has a list of genes expressed in haematopietic
stem cells identified using SSH/CCS sequence tags, and Princeton has a
list of genes expressed in haematopietic stem cells identified using
cDNA microarrays, an obvious question is what percentage of the genes in
each list are found on the other.  The primary goal of this proposal is
to facilitate such tasks.

Assertion: The primary deliverables for this consortium are gene lists. 
Some genes will be well known genes, some will be proposed new genes.

Assertion: The gene lists will be much more valuable if one can relate
genes between the lists and against external information sources.

Assertion: Many current, commonly used standards for gene identification are
inadequate for this task.
    
	Examples: OMIM Number
	          GeneCards ID
	          HUGO Names
	          RefSeq UIDs

	There are a number of problems with identifiers such as the above:
		1) It may be non-trivial to identify links between (putative) genes in
		   hand and the above sets.
		2) The above sets are guaranteed to be incomplete with respect to the
		   gene lists we create.
		3) Even specifying a system, there might be more than one way of
		   identifying a gene.  Similarly, such lists might inadvertently
		   have two different entries for the same gene.
		   
    Even for comprehensive, sequence-based gene lists (e.g. DoTS), some
identifications we make will be outside such a list.
	          
Assertion: For virtually all the genes we include in gene lists, a
sequence associated with that gene will be available.  For example, for
cDNA microarrays, the sequence of the cDNAs which are spotted will be
known.  For Affymetrix arrays, the sequences of oligonucleotides are
available.  And of course, for sequence based technologies (ESTs, SAGE,
SSH/CCS, etc.) sequence is the primary datum.

Proposal: The primary identifier for all gene lists should be a genome
build, chromosome name, a set of coordinates for the match between the
query sequence and the genome, the coordinates of the boundaries of the
matched gene and the genome, an optional gene name, a confidence
indicator for the match to the gene, a description of the nature of the
boundaries of the gene, and a reference for the genomic coordinates for
the matched gene.  In the simplest case, the set of coordinates for the
query match will consist of two coordinates, the beginning and end of
the match.  In the case of query sequence matches which are
discontinuous (e.g. spliced sequences), more than one pair will be
reported for this match.

For a match to a known gene or gene prediction, the start of
transcription and end of transcription for that gene be provided.  These
will, in general, be well beyond the match of the actual query sequence.

The coordinates will unambiguously identify a single gene in many cases,
but in some cases, there may be multiple, significantly different genes
with very similar or even identical end coordinates.  To allow
investigators to resolve such ambiguities, we provide for an optional
gene name.  Although genes names by themselves are problematic, as
qualifiers to a primary identifier based on genomic coordinates, they
increase the certainty of identification.  For these gene names to be
useful, their source needs to be identified, e.g. as a particular track
on the Santa Cruz genome browser or table in the underlying database.

For matches to previously uncharacterized regions of the genome, the
"gene boundary" coordinates will be identical to the match coordinates,
since there is no basis for reporting anything other than the boundaries
of the match itself.

An intermediate case exists in which a query sequence matches to a
region of the genome where there are no known complete gene sequences or
gene predictions, but where there exists a cluster of tag (EST, SAGE,
etc.) matches.  In that case, the gene boundary will be reported as the
outer limits of the cluster.

Examples:

<gene confidence="high">
  <genome>mm4</genome>
  <chromosome>chr3</chromosome>
  <match type="query">
  	<start block="1">21233245</start>
  	<stop  block="1">21233366</stop>
  </match>
  <match type="gene">
  	<start>21230044</start>
  	<stop>21240000</stop>
  	<coordinate_source>Santa Cruz Browser</coordinate_source>
  </match>
  <genename>
  	<name>Dsi1</name>
  	<name_source>Santa Cruz Browser RefSeq Genes Track</name_source>
  </genename>
</gene>
 
<gene confidence="possible">
  <genome>hg15</genome>
  <chromosome>chrX_un</chromosome>
  <match type="query">
    <start block="1">3334433</start>
    <stop  block="1">3334721</stop>
    <start block="2">3337001</start>
    <start block="2">3337245</stop>
  </match>
  <match type="uncharacterized">
    <start>3334433</start>
    <start>3337245</stop>
  	<coordinate_source>Span of the Tag</coordinate_source>
  </match>
</gene>
 
These examples are not meant to represent the structure of XML schemata
we will use, but simply use XML syntax to communicate an examples to
illustrate the proposal.

Although the genome build will be part of the identifier, the difficulty
of moving between genome builds mandates that we standardize on one
build per species.  The three species with which we are concerned are
human, mouse, and zebrafish.

For human, the current standard is UCSC version hg17 from Santa Cruz
(http://genome.ucsc.edu/index.html).

For mouse, the current standard is UCSC version mm5 from Santa Cruz
(http://genome.ucsc.edu/index.html).

For zebrafish, the current standard is Sanger zv3
(http://www.sanger.ac.uk/Projects/D_rerio/Zv3_assembly_information.shtml).

The current proposal does not specify a way of describing alternative
splicing.  A future version may an additional annotation for expressing
concepts such as "This is a match within a known gene, but does not
match a previously known splicing pattern, and thus may represent a new
splicing form."

Changes in standard version of the genome for each species be done by
group consensus.

Orthologous relationships (e.g. the "same" gene in different species)
are not addressed in the current proposal.

Genes contained with currently unsequenced regions of the genome are not
addressed in the current proposal.

This proposal initially written by David Steffen, September 8, 2003.