Method Summary | |
The high-throughput structure determination pipelines developed by structural genomics programs offer a unique opportunity for data mining. One important question is how protein properties derived from a primary sequence correlate with the protein�s propensity to yield X-ray quality crystals (crystallizability) and 3D X-ray structures. A set of protein properties were computed for over 1300 proteins that expressed well but were insoluble, and for ~720 unique proteins that resulted in X-ray structures. The correlation of the protein�s iso-electric point and grand average hydropathy (GRAVY) with crystallizability was analyzed for full length and domain constructs of protein targets. In a second step, several additional properties that can be calculated from the protein sequence were added and evaluated. Using statistical analyses we have identified a set of the attributes correlating with a protein�s propensity to crystallize and implemented a Support Vector Machine (SVM) classifier based on these. We have created applications to analyze and provide optimal boundary information for query sequences and to visualize the data. | |
Additional Utilities | |
generate single-line sequences from MFASTA file (they can be pasted into Excel) calculate pI/MW/GRAVY for single-line sequences (they can be pasted back into Excel) generate 2D histogram from x:y values using various bin-sizes (the results can be visualized in the following utility) display 2D histograms (2D histograms can be generated by the previous utility) generate MCSG Z-scores for single-line sequences (they can be pasted back into Excel) metrics generation (ROC, AUC, and other metrics) Questions? (gbabnigg) |