A
SEquence Globally Unique IDentifier (SEGUID) Proteome Database
There
are numerous publicly available protein sequence databases containing millions
of unique entries. The different databases however use their own identifiers
for the same protein sequence. Although these databases normally list the
aliases used at other sources, bringing together data and keeping it up to date
by the end user requires substantial effort. We propose the use of a unique
sequence identifier (SEGUID) that is derived from the primary sequence itself
and easily generated by any user. SEGUIDs are resilient to changes in public
and private databases as they remain constant throughout the lifetime of a
given protein sequence. The SEGUID Proteome Database (http://bioinformatics.anl.gov/seguid/
) provides aliases for the annotated entries available from several public
databases and can be downloaded or generated easily at remote sites. SEGUIDs
have been used in our proteomics laboratory for years and proved to be useful
integrating mass spectrometry results, two-dimensional gel electrophoresis
data, and bioinformatics information. Since SEGUIDs are stable, predictions
based on the primary sequence information need to be calculated only once.
On-line prediction servers could quickly generate SEGUID for a submitted
sequence and provide pre-calculated prediction result if available. We have
generated around 500 different calculations for the more than 2.5 million
sequences and the results are available on-line or from our FTP site (ftp://bioinformatics.anl.gov/seguid
).