CRISP-DM Process for Data Science vs. Hard Science (Chemistry)

 Data ScienceAlchemyHard Science (Chemistry)
Business UnderstandingPlan project objectives & requirements. What data is required to answer what business questions?What constitutes gold here? Can I get gold out of it?Investigation objectives & requirements. What regulation am I trying to prove? What chemistry questions need answered for this project?
Data UnderstandingSet up data mining, get familiar with data & form hypothesis. What insights can I learn from the data? What problems, if any, are in the data?What are my raw materials I am going to collect?Collect samples/data and form hypothesis to analyze/test for. What soil/water samples do I need to collect? How am I going to collect those soil/water samples?
Data PreparationPrepare data by cleaning & joining features identified to use for the model. Is data missing & how do I address it? What features can be used & will they need transformations?Purify raw ingredients. Get rid of junk and contaminants.Perform acid digestion on soil/water samples to purify/concentrate soluable metals in the acid solution. Then remove contaminates through paper filtering of the acid solution.
ModelingFigure out modeling technique(s) best suited for data & apply selected model(s).Do transformation of raw ingredients into gold. (Magic!)Analyze the filtered acid solution by instrument (ICP-OES Inductively Coupled Plasma Optical Emission Spectroscopy). This transforms the metals within the acid solution by using a plasma to heat the metals into an excited state that produces particular wavelengths of light. These wavelength intensities are then counted as observed label data (yhat) from a previously applied single variable (linear) regression line that was previously modeled, evaluated, & validated on the instrument (y, Standardization Curve) using a training set of known known samples & it’s resulting data (x, Supervised Machine Learning).
EvaluationCompare model(s) against test/unknown dataset for best fit against business objectives & hypothesis. What is the best model that provides the best score for the obejectives.The gold is taken to the appraiser & they tell you how much your gold is worth.The wavelength intensities (yhat, observed label data) are then applied to the standardization curve inversely (y, inverse linear regression) to predict the observed concentration, (xhat, observed feature) for the wavelength which was previously trained, modeled, & validated on the instrument. The data for all wavelengths are processed this way & evaluated against a small number of known known samples within the unknown/test data set to check for precision & bias within that batch of analyses. The final reported metal concentrations (xhat, observed features) are then compared to the questions in the Business Understanding section, by comparing to federal regulatory & permit levels.
DeploymentDeploy best fit model to new unknown data & continue to montior model during real time use.The whole village comes to buy your gold.The predicted values (xhat,) over the federal regulatory & permit levels are then deployed as a formal scientific report to the Dept. of Justice used to indict criminal charges & used as evidence in a federal trial against the alledged suspects.