SEGUID Header
  ::HOME    ::OVERVIEW    ::TOOLS    ::SOAP-Match    ::WEB SERVICES    ::FTP    ::HELP 

Operated by the   University of Chicago   for the   U.S. Department of Energy  
This is a Federal computer (see   Security Notice  ). For condition of use, see   ANL Disclaimer  .

last modified:

INTRODUCTION

 

In the bioinformatics community, the use of multiple identifiers for the same gene or protein is a general problem. Ideally, there should be an accepted identifier for every gene, protein, and possibly for splice variants, post-translationally modified proteins, and other biomolecules. While the major public databases list exhaustive cross-references, the maintenance of aliases is a time-consuming effort at every site that maintains local copies of them. Numerous smaller specialty databases use local identifiers with limited set of aliases that make their incorporation even more difficult. The integration of public and private sequence and associated data is also complicating issues that very often solved by introducing yet another local unique identifier at the given site. This is a recurring problem in the design of bioinformatics applications that retrieve data from a variety of sources at any one time.

There have been several attempts to overcome this problem. One such solution is the use of Life Science Identifiers (LSIDs), a mechanism for retrieving data and metadata across different life science databases, containing diverse information and information types (1). LSIDs are the standard adopted by the Object Management Group (OMG) for the identification of life science data. An Alias server was described recently (2). The strategy involves the calculation of 64-bit Cyclic Redundancy Check (CRC64) for protein sequences of several completed genomes and their use as unique keys, which allows the quick integration of identifiers from multiple sites and can quickly reveal some identifier collisions. The Alias server provides access through a web interface or though an API. CRC64 replaced the previously used CRC32 in release 40 of Swiss-Prot to ensure data integrity in applications. The most comprehensive, non-redundant protein database, the UniProt Archive (UniParc) uses UniParc identifiers and provides cross-references to more than 60 databases (3). The identifiers are composed of ‘UPI’ followed by ten hexadecimal numbers enabling the entry of around 1E12 unique protein sequences. The NCBI database uses a unique identifier (GenInfo Identifier, GI) for every protein or DNA sequence derived from several sources: GenBank (4), PIR (5), Swiss-Prot (6.), PDB (7), and Entrez (8). NCBI also provides a collection of non-redundant databases generated using CRC checksums for sequence comparisons, such as BLAST (9). A new solution for cross-referencing sequence identifiers across databases was published recently (10). MagicMatch uses MD5 digest of the sequences for the comparison, which allows very fast referencing of two FASTA files. The generation of fingerprints and their comparison is much faster than using other sequence comparison tools.

 

We have developed a protein sequence identifier (SEGUID) based on the Secure Hash Algorithm (SHA-1) digest of the primary sequence because our bioinformatics, analytical, and high-throughput proteomics pipelines suffered from changing and disappearing protein identifiers. A SEGUID is stable for the lifetime of a protein and is used as the central identifier while all other aliases are treated as dynamic properties. Everyone can derive the same SEGUID from the sequence information, which allows easy data sharing. The use of SEGUID ensures that proteomics data is resilient to changes in annotation databases and the reports generated reflect the most recent annotations collected from sequence databases. Our SEGUID website provides a number of web applications and web services which are described in this manuscript. The FTP site provides pre-calculated data, FASTA files, alias tables, and sample programs describing the web services and their consumption by other applications.

 

 

 

METHODS

 

 

Generation of SEGUID

 

SEGUID is generated from the sequence string by the SHA-1 hashing algorithm. The sequence string is converted to upper case and all non-letter characters are removed. The Digest::SHA1 Perl module (http://search.cpan.org/~gaas/Digest-SHA1-2.10/) was used for calculating the digest for a given sequence using the base64 representation without the padding, resulting in a 27 character long digest. Other programming languages such as Java and C-sharp provide similar facilities with the exception that the digests are padded with a “=” character.

 

 

Web Services

 

Most of the website functionalities are exposed as web services SOAP (11). The user can retrieve alias and sequence information from NCBI ‘GenInfo Identifiers’ (GI), identifiers, and accession numbers used at UniProt. Normally all sequence-associated aliases are retrieved or the user can select the return of single entry only. Some prediction data is also available. The web services were implemented using .NET C-sharp and hosted on a Microsoft Internet Information Server 6.0 (http://bioinformatics.anl.gov/ws/protseq.asmx?WSDL).

 

 

 

 

DESCRIPTION

 

 

 

Integration of Public and Private Sequence Databases

 

 

The SEGUID database combines information from NCBI’s non-redundant database (12), UniProt (13), KEGG (14), and holds private sequence data obtained from TIGR or JGI. The database entries are divided into sequence information, sequence annotation with cross-references, and taxonomy information obtained from NCBI’s taxonomy database. The SEGUID database uses a relational database, ORACLE10g to store all information, including our proteomics data. The diagram of the pertinent tables is shown in Figure 1. SEGUIDs are calculated using the SHA-1 algorithm. SHA-1 is used for computing a 160-bit message digest of a message or data file of any length less than 264 bits (15). The SHA-1 algorithm is based on principles similar to those used by MD4 (16). While SHA-1 is mainly used with the Digital Signature Algorithm in e-mail, financial transactions, and other applications which require data integrity assurance and data origin authentication, and it can be employed to generate a hash of an input. The SHA-1 is designed to have the following properties: it is computationally not feasible to find a message which corresponds to a given message digest, or to find two different messages which produce the same message digest. We are using the latter property of SHA-1 to generate a hash of any protein sequence and use it as a unique identifier that stays with the sequence throughout its lifetime. We have listed the checksum and digest methods commonly used in bioinformatics (Table 1). The SEGUID database provides cross-referencing between SEGUID identifier, MD5 digest, and CRC64 checksum for every protein sequence. After investigating the cross-references, we found that while the commonly used CRC64 checksum provides 1.8E19 different possibilities, there were two cases of collision: (a) the same 44CAAD88706CC153 checksum was identified for two distinct immunoglobulin fragments:

 

> gnl|sha|BpBeDdcNUYNsdk46JoJdw7Pd3BI|immunoglobulin lambda light chain variable region [Homo sapiens]

QSALTQPASVSGSPGQSITISCTGTSSDVGSYNLVSWYQQHPGKAPKLMIYEGSKRPSGV

SNRFSGSKSGNTASLTISGLQAEDEADYYCSSYAGSSTLVFGGGTKLTVL

>gnl|sha|X5XEaayob1nZLOc7eVT9qyczarY|immunoglobulin lambda light chain variable region [Homo sapiens]

QSALTQPASVSGSPGQSITISCTGTSSDVGSYNLVSWYQQHPGKAPKLMIYEGSKRPSGV

SNRFSGSKSGNTASLTISGLQAEDEADYYCCSYAGSSTWVFGGGTKLTVL

 

and (b) the same E66EA8F06A4EC2B7 checksum was generated for two distinct HIV proteases:

 

>gnl|sha|p1T2e3f7x0zLU9B05EuacRbkwpA|protease [Human immunodeficiency virus type 1]  

PQITLWQRPLVTVKIGGQLKEALLDTGADDTVLEEMNLPGKWKPKMIGGIGGFIKARQYD

QIAIEICGHKAIGTVLVGPTPVNIIGRNLLTQIGCTLNF

>gnl|sha|K8VCxuoElTp1sYcSeX58Mu2/R/0|protease [Human immunodeficiency virus type 1] 

PQITLWQRPLVTVKIGGQLKEALLDTGADDTVLEEMNLPGKWKPKMIGGIGGFIKVRQYD

QIQVEICGHKAIGTVLVGPTPVNIIGRNLLTQIGCTLNF

 

 

This lookup table can be used for quick integration with those databases that link either MD5 or CRC64 with their sequence entries. The SEGUID database contains about 2.7 million protein sequences and 6.5 million associated annotation entries (NCBI’s environmental protein sequences are stored in a separate system). Private databases are easily integrated and are distinguished by their DBNAME entry in the NRINFO table. The entries in the ORACLE10g database are updated using transactions. Hence, the information is always available to our bioinformatics and proteomics pipelines. The SEGUIDs are used as primary keys in the database allowing the creation of 100% non-redundant protein sequence databases for bioinformatics applications. Releases of this non-redundant database are available on our FTP site as FASTA formatted or tab-delimited files (ftp://bioinformatics.anl.gov/seguid ).

 

 

 

The Use of SEGUID as Sequence Identifier

 

 

While SEGUIDs are twenty-seven character-long base64 identifiers and are intended for data exchange between computers, we explored the possibility of using only the first couple of letters of them for everyday lab purposes. The calculated probabilities to find a unique SEGUID using only the first four, five, and six characters were 88.4%, 98.4%, and 99.9% respectively (Figure 2). While the probability is very low when using the first three letters only (0.03%), the list of SEGUIDs sharing the same three letters does not have more than thirty entries. We have implemented a utility similar to the Google’s Suggest service that allows the fast retrieval of a list of valid SEGUIDs from the database dynamically as the user enters at least the first three characters using Asynchronous Javascript and XML ( Ajax). The annotation and sequence information is also displayed for a given SEGUID (Figure 3).

 

SEGUIDs are used in our proteomics efforts as central identifiers. They have been used internally for GELBANK, a database of annotated two-dimensional images (17), where the annotations of the identified proteins are automatically updated as new data is available. Our mass spectrometry pipeline is based on SEGUIDs, which allows the use of private sequence information obtained for incomplete genomes and easy integration of results once the complete genome information becomes available. Most bioinformatics pipelines incorporate data generated by prediction tools using primary sequence information. Some public servers predict attributes for the same sequence over and over, which is futile for computationally intensive operations. A SEGUID-based system can retrieve calculated attributes if available, an approach we have implemented in our bioinformatics knowledgebase. We have computed average sequence attributes using almost 500 amino acid properties available from the AAindex database (18) for all proteins in the database, some of which are available through a web application for the search of the more than 1.2 billion entries. These files are also available from our FTP site. In addition, other prediction results are also available, such as trans-membrane prediction by SOSUI (19) and DAS (20) algorithms.

 

 

 

Species Specific FASTA File

 

Since the system integrates sequence, alias, and taxonomy information from the various sources, it is simple to generate various slices as FASTA files, according to taxonomy information, database origin, and the combination of them. We have implemented a web application that allows the on-line generation of such files. These files can be used for similarity search algorithms such as BLAST and for search engines processing mass spectrometry data such as Mascot and Sequest.

 

 

 

Alias Table for a Given Species

 

We have implemented a web application for the generation of on-line alias tables simply using a taxonomy identifier as an input. The utility combines the information available from our annotation and taxonomy tables. The Excel chart generated can be used by bench the scientist to quickly identify similarities and differences between annotation information obtained from different sources.

 

 

Other Web Applications

 

Several web applications are available at the SEGUID website and some of the functionalities are exposed as web services. The user can obtain up-to-date annotation data in the form of tab-delimited file or an on-line table for a set of SEGUIDs. Using a set of SEGUIDs, the user can generate FASTA files compatible with bioinformatics applications. Another utility allows the generation of SEGUIDs, MD5 hashes, and CRC64 checksums for a FASTA file (limited to 1000 sequences).

 

 

 

 

FUTURE PROSPECTS

 

The idea presented here can be applied not only to protein sequences but also for encoding DNA sequences. The use of MD5 or SHA-1 digests should be transparent (MagicMatch). Given that the possible numbers of MD5 and SHA-1 digests are enormous, one can combine any future protein and DNA sequences in the same database without the potential duplication of identifiers. We have successfully implemented this method in our bioinformatics pipeline for completed genomes already.

 

 

ACKNOWLEDGMENTS

 

This work was supported by the Office of Biological and Environmental Research through the Microbial Genome Project, U.S. Department of Energy, under Contract No. W-31-109-ENG-38.

 

REFERENCES

 

 

  1. Clark T, Martin S, Liefeld T. (2004). Globally distributed object identification for biological knowledgebases. Brief Bioinform. 5(1):59-70.

 

  1. Iragne, F., Barre, A., Goffard, N. and de Daruvar, A. (2004) AliasServer: a web server to handle multiple aliases used to refer to proteins. Bioinformatics 20(14): 2331–2332

 

  1. Leinonen, R., Diez, F.G., Binns, D., Fleischmann, W., Lopez, R. and Apweiler, R. (2004) UniProt archive.Bioinformatics. 20(17):3236-3237

 

  1. Benson,D.A., Karsch-Mizrachi, I., Lipman,D.J., Ostell,J. and Wheeler,D.L. (2005) GenBank. Nucleic Acids Res., 33, D34–D38

 

  1. Wu,C.H., Yeh,L.S.L., Huang,H., Arminski,L., Castro-Alvear,J.,Chen,Y., Hu,Z., Kourtesis,P., Ledley,R.S., Suzek,B.E. et al. (2003) The Protein Information Resource. Nucleic Acids Res., 31, 345–347.

 

  1. Boeckmann,B., Bairoch,A., Apweiler,R., Blatter,M.C., Estreicher,A.,Gasteiger,E., Martin,M.J., Michoud,K., O’Donovan,C., Phan,I. et al.(2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res., 31, 365–370

 

  1. Bourne,P.E., Addess,K.J., Bluhm,WF, Chen,L., Deshpande,N., Feng,Z.,Fleri,W., Green,R., Merino-Ott,J.C., Townsend-Merino,W. et al. (2004)The distribution and query systems of the RCSB Protein Data Bank.Nucleic Acids Res., 32, D223–D225

 

  1. Maglott,D., Ostell,J., Pruitt,K.D. and Tatusova,T. (2005) Entrez Gene:gene-centered information at NCBI. Nucleic Acids Res., 33, D54–D58

 

  1. Altschul,S.E., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410

 

  1. Smith, M., Kunin, V., Goldovsky, L., Enright, A.J. and Ouzounis, C.A. (2005) MagicMatch – cross-referencing sequence identifiers across databases. Bioinformatics 21(16): 3429 – 3430

 

  1. Snell, J., Tidwell, D. and Kulchenko, P. (2002) Programming Web Services with SOAP, 1st edn. O'Reilly Publishers, Sebastopol, CA,

 

  1. Wheeler, D.L. et al. (2005) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res., 33, D39-D45

 

  1. Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS  (2005) The Universal Protein Resource (UniProt). Nucleic Acids Res. 33: D154-159.

 

  1. Kanehisa, M. and Goto, S. (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res., 28, 27-30

 

  1. Secure Hash Standard (1995) Federal Information Processing Standards Publication 180-1

 

  1. Rivest, R.L (1991) The MD4 Message Digest Algorithm, Advances in Cryptology, CRYPTO’90  Proceedings, Springer-Verlag, pp 303-311

 

  1. Babnigg, G. and Giometti, C.S. (2004) GELBANK: a database of annotated two-dimensional gel electrophoresis patterns of biological systems with completed genomes. Nucleic Acids Res. 32: D582-D585.

 

  1. Kawashima S, Ogata H, Kanehisa M (1999) AAindex: Amino Acid Index Database. Nucleic Acids Res. 27(1):368-9

 

  1. Hirokawa T, Boon-Chieng S, Mitaku S. (1998) SOSUI: classification and secondary structure prediction system for membrane proteins. Bioinformatics. 14(4):378-379.

 

  1. Cserzo M, Wallin E, Simon I, von Heijne G, and Elofsson A. (1997) Prediction of transmembrane alpha-helices in prokaryotic membrane proteins: the dense alignment surface method. Protein Eng. 10(6):673-6.