|
|||||||
|
|
Operated by the University of Chicago for the U.S. Department of Energy This is a Federal computer (see Security Notice ). For condition of use, see ANL Disclaimer . last modified: |
INTRODUCTION
In the bioinformatics community, the use of multiple
identifiers for the same gene or protein is a general problem. Ideally, there
should be an accepted identifier for every gene, protein, and possibly for
splice variants, post-translationally modified proteins, and other biomolecules.
While the major public databases list exhaustive cross-references, the
maintenance of aliases is a time-consuming effort at every site that maintains
local copies of them. Numerous smaller specialty databases use local identifiers
with limited set of aliases that make their incorporation even more difficult.
The integration of public and private sequence and associated data is also
complicating issues that very often solved by introducing yet another local
unique identifier at the given site. This is a recurring problem in the design
of bioinformatics applications that retrieve data from a variety of sources at
any one time.
There have been several attempts to overcome this
problem. One such solution is the use of Life Science Identifiers (LSIDs), a
mechanism for retrieving data and metadata across
different life science databases, containing diverse information and information
types (1). LSIDs are the standard adopted by the Object
Management Group (OMG) for the identification of life science data.
An Alias server was described recently (2). The strategy involves the
calculation of 64-bit Cyclic Redundancy Check (CRC64) for protein sequences of
several completed genomes and their use as unique keys, which allows the quick
integration of identifiers from multiple sites and can quickly reveal some
identifier collisions. The Alias server provides access through a web interface
or though an API. CRC64 replaced the previously used CRC32 in release 40 of
Swiss-Prot to ensure data integrity in applications. The most comprehensive,
non-redundant protein database, the UniProt Archive (UniParc) uses UniParc
identifiers and provides cross-references to more than 60 databases (3). The
identifiers are composed of ‘UPI’ followed by ten hexadecimal numbers enabling
the entry of around 1E12 unique protein sequences. The NCBI database uses a
unique identifier (GenInfo Identifier, GI) for every protein or DNA sequence
derived from several sources: GenBank (4), PIR (5), Swiss-Prot (6.), PDB (7),
and Entrez (8). NCBI also provides a collection of non-redundant databases
generated using CRC checksums for sequence comparisons, such as BLAST (9). A new
solution for cross-referencing sequence identifiers across databases was
published recently (10). MagicMatch uses MD5 digest of the sequences for the
comparison, which allows very fast referencing of two FASTA files. The
generation of fingerprints and their comparison is much faster than using other
sequence comparison tools.
We have developed a protein sequence identifier
(SEGUID) based on the Secure Hash Algorithm (SHA-1) digest of the primary
sequence because our bioinformatics, analytical, and high-throughput proteomics
pipelines suffered from changing and disappearing protein identifiers. A SEGUID
is stable for the lifetime of a protein and is used as the central identifier
while all other aliases are treated as dynamic properties. Everyone can derive
the same SEGUID from the sequence information, which allows easy data sharing.
The use of SEGUID ensures that proteomics data is resilient to changes in
annotation databases and the reports generated reflect the most recent
annotations collected from sequence databases. Our SEGUID website provides a
number of web applications and web services which are described in this
manuscript. The FTP site provides pre-calculated data, FASTA files, alias
tables, and sample programs describing the web services and their consumption by
other applications.
METHODS
Generation of SEGUID
SEGUID is generated from the sequence string by the
SHA-1 hashing algorithm. The sequence string is converted to upper case and all
non-letter characters are removed. The Digest::SHA1
Perl module (http://search.cpan.org/~gaas/Digest-SHA1-2.10/) was used for
calculating the digest for a given sequence using the base64 representation
without the padding, resulting in a 27 character long digest. Other programming
languages such as Java and C-sharp provide similar facilities with the exception
that the digests are padded with a “=” character.
Web Services
Most of the website functionalities are exposed as
web services SOAP (11). The user can retrieve alias and sequence information
from NCBI ‘GenInfo Identifiers’ (GI), identifiers, and accession numbers used at
UniProt. Normally all sequence-associated aliases are retrieved or the user can
select the return of single entry only. Some prediction data is also available.
The web services were implemented using .NET C-sharp and hosted on a Microsoft
Internet Information Server 6.0 (http://bioinformatics.anl.gov/ws/protseq.asmx?WSDL).
DESCRIPTION
Integration of Public and Private Sequence
Databases
The SEGUID database combines information from NCBI’s
non-redundant database (12), UniProt (13), KEGG (14), and holds private sequence
data obtained from TIGR or JGI. The database entries are divided into sequence
information, sequence annotation with cross-references, and taxonomy information
obtained from NCBI’s taxonomy database. The SEGUID database uses a relational
database, ORACLE10g to store all information, including our proteomics data. The
diagram of the pertinent tables is shown in Figure 1. SEGUIDs are calculated
using the SHA-1 algorithm. SHA-1 is used for computing a 160-bit message digest
of a message or data file of any length less than 264 bits (15). The
SHA-1 algorithm is based on principles similar to those used by MD4 (16). While
SHA-1 is mainly used with the Digital Signature Algorithm in e-mail, financial
transactions, and other applications which require data integrity assurance and
data origin authentication, and it can be employed to generate a hash of an
input. The SHA-1 is designed to have the following properties: it is
computationally not feasible to find a message which corresponds to a given
message digest, or to find two different messages which produce the same message
digest. We are using the latter property of SHA-1 to generate a hash of any
protein sequence and use it as a unique identifier that stays with the sequence
throughout its lifetime. We have listed the checksum and digest methods commonly
used in bioinformatics (Table 1). The SEGUID database provides cross-referencing
between SEGUID identifier, MD5 digest, and CRC64 checksum for every protein
sequence. After investigating the cross-references, we found that while the
commonly used CRC64 checksum provides 1.8E19 different possibilities, there were
two cases of collision: (a) the same 44CAAD88706CC153 checksum was
identified for two distinct immunoglobulin fragments:
>
gnl|sha|BpBeDdcNUYNsdk46JoJdw7Pd3BI|immunoglobulin lambda light chain variable
region [Homo sapiens]
QSALTQPASVSGSPGQSITISCTGTSSDVGSYNLVSWYQQHPGKAPKLMIYEGSKRPSGV
SNRFSGSKSGNTASLTISGLQAEDEADYYCSSYAGSSTLVFGGGTKLTVL
>gnl|sha|X5XEaayob1nZLOc7eVT9qyczarY|immunoglobulin
lambda light chain variable region [Homo sapiens]
QSALTQPASVSGSPGQSITISCTGTSSDVGSYNLVSWYQQHPGKAPKLMIYEGSKRPSGV
SNRFSGSKSGNTASLTISGLQAEDEADYYCCSYAGSSTWVFGGGTKLTVL
and (b) the same E66EA8F06A4EC2B7 checksum was
generated for two distinct HIV proteases:
>gnl|sha|p1T2e3f7x0zLU9B05EuacRbkwpA|protease
[Human immunodeficiency virus type 1]
PQITLWQRPLVTVKIGGQLKEALLDTGADDTVLEEMNLPGKWKPKMIGGIGGFIKARQYD
QIAIEICGHKAIGTVLVGPTPVNIIGRNLLTQIGCTLNF
>gnl|sha|K8VCxuoElTp1sYcSeX58Mu2/R/0|protease
[Human immunodeficiency virus type 1]
PQITLWQRPLVTVKIGGQLKEALLDTGADDTVLEEMNLPGKWKPKMIGGIGGFIKVRQYD
QIQVEICGHKAIGTVLVGPTPVNIIGRNLLTQIGCTLNF
This lookup table can be used for quick integration
with those databases that link either MD5 or CRC64 with their sequence entries.
The SEGUID database contains about 2.7 million protein sequences and 6.5 million
associated annotation entries (NCBI’s environmental protein sequences are stored
in a separate system). Private databases are easily integrated and are
distinguished by their DBNAME entry in the NRINFO table. The entries in the
ORACLE10g database are updated using transactions. Hence, the information is
always available to our bioinformatics and proteomics pipelines. The SEGUIDs are
used as primary keys in the database allowing the creation of 100% non-redundant
protein sequence databases for bioinformatics applications. Releases of this
non-redundant database are available on our FTP site as FASTA formatted or
tab-delimited files (ftp://bioinformatics.anl.gov/seguid
).
The Use of SEGUID as Sequence
Identifier
While SEGUIDs are twenty-seven character-long base64
identifiers and are intended for data exchange between computers, we explored
the possibility of using only the first couple of letters of them for everyday
lab purposes. The calculated probabilities to find a unique SEGUID using only
the first four, five, and six characters were 88.4%, 98.4%, and 99.9%
respectively (Figure 2). While the probability is very low when using the first
three letters only (0.03%), the list of SEGUIDs sharing the same three letters
does not have more than thirty entries. We have implemented a utility similar to
the Google’s Suggest service that allows the fast retrieval of a list of valid
SEGUIDs from the database dynamically as the user enters at least the first
three characters using Asynchronous Javascript and XML (
SEGUIDs are used in our proteomics efforts as central
identifiers. They have been used internally for GELBANK, a database of annotated
two-dimensional images (17), where the annotations of the identified proteins
are automatically updated as new data is available. Our mass spectrometry
pipeline is based on SEGUIDs, which allows the use of private sequence
information obtained for incomplete genomes and easy integration of results once
the complete genome information becomes available. Most bioinformatics pipelines
incorporate data generated by prediction tools using primary sequence
information. Some public servers predict attributes for the same sequence over
and over, which is futile for computationally intensive operations. A
SEGUID-based system can retrieve calculated attributes if available, an approach
we have implemented in our bioinformatics knowledgebase. We have computed
average sequence attributes using almost 500 amino acid properties available
from the AAindex database (18) for all proteins in the database, some of which
are available through a web application for the search of the more than 1.2
billion entries. These files are also available from our FTP site. In addition,
other prediction results are also available, such as trans-membrane prediction
by SOSUI (19) and DAS (20) algorithms.
Species Specific FASTA File
Since the system integrates sequence, alias, and
taxonomy information from the various sources, it is simple to generate various
slices as FASTA files, according to taxonomy information, database origin, and
the combination of them. We have implemented a web application that allows the
on-line generation of such files. These files can be used for similarity search
algorithms such as BLAST and for search engines processing mass spectrometry
data such as Mascot and Sequest.
Alias Table for a Given
Species
We have implemented a web application for the
generation of on-line alias tables simply using a taxonomy identifier as an
input. The utility combines the information available from our annotation and
taxonomy tables. The Excel chart generated can be used by bench the scientist to
quickly identify similarities and differences between annotation information
obtained from different sources.
Other Web Applications
Several web applications are available at the SEGUID
website and some of the functionalities are exposed as web services. The user
can obtain up-to-date annotation data in the form of tab-delimited file or an
on-line table for a set of SEGUIDs. Using a set of SEGUIDs, the user can
generate FASTA files compatible with bioinformatics applications. Another
utility allows the generation of SEGUIDs, MD5 hashes, and CRC64 checksums for a
FASTA file (limited to 1000 sequences).
FUTURE
PROSPECTS
The idea presented here can be applied not only to
protein sequences but also for encoding DNA sequences. The use of MD5 or SHA-1
digests should be transparent (MagicMatch). Given that the possible numbers of
MD5 and SHA-1 digests are enormous, one can combine any future protein and DNA
sequences in the same database without the potential duplication of identifiers.
We have successfully implemented this method in our bioinformatics pipeline for
completed genomes already.
This work was supported by the Office of Biological
and Environmental Research through the Microbial Genome Project, U.S. Department
of Energy, under Contract No. W-31-109-ENG-38.