Data Science Process vs Chemistry Process
CRISP-DM Process for Data Science vs. Hard Science (Chemistry)
Data Science | Alchemy | Hard Science (Chemistry) | |
---|---|---|---|
Business Understanding | Plan project objectives & requirements. What data is required to answer what business questions? | What constitutes gold here? Can I get gold out of it? | Plan investigation objectives & requirements. What regulation am I trying to prove? What chemistry questions need answered for this project? |
Data Understanding | Set up data mining, get familiar with data & form hypothesis. What insights can I learn from the data? What problems, if any, are in the data? | What are my raw materials I am going to collect? | Collect samples/data and form hypothesis to analyze/test for. What soil/water samples do I need to collect? How am I going to collect those soil/water samples? Design analytical/data experiment and include required quality controls. |
Data Preparation | Prepare data by cleaning & joining features identified to use for the model. Is data missing & how do I address it? What features can be used & will they need transformations? | Purify raw ingredients. Get rid of junk and contaminants. | Perform an acid digestion on unknown soil/water samples to purify/concentrate soluble metals in the acid solution. (Transformation) Then remove contaminates through paper filtering of the acid solution. (Physically purifying data) Prepare the unknown samples for instrumental analysis with planned quality controls. |
Modeling | Figure out modeling technique(s) best suited for data & apply selected model(s). | Do transformation of raw ingredients into gold. (Magic!) | Analyze the filtered acid solution by instrument (ICP-OES). This transforms the metals within the acid solution by using a plasma to heat the metals into an excited state that produces particular wavelengths of light. (Physically producing data) Known solutions of sample is analyzed by the instrument to produce a multi-point regression model. (Modeling) |
Evaluation | Compare model(s) against test/unknown dataset for best fit against business objectives & hypothesis. What is the best model that provides the best score for the obejectives. | The gold is taken to the appraiser & they tell you how much your gold is worth. | The multi-point regression is optimized to provide the best wavelength signal for what you are trying to predict. (Supervised ML model with labeled data) This can be done by changing settings within the instrument, adding additional quality controls, or changing the way the samples are prepared to provide a higher signal-to-noise ratio. (Hyper-parameter tuning) |
Deployment | Deploy best fit model to new unknown data & continue to montior model during real time use. | The whole village comes to buy your gold. | Once optimized, the unknown samples are then analyzed by the instrument using the optimized SML regression model to predict wavelength signals for each unknown sample. (Deployment) These wavelength signals are then back calculated through inverse regression to provide a quantitative measurement/concentration of the metal analyte in question. The SML regression model is continually assessed for further optimization by use of known/labeled quality control checks. |
The predicted metal concentration values are then compared to the questions in Business Understanding section, by usually comparing to federal regulatory & permit levels. A formal scientific report to the Dept. of Justice is developed as evidence and used to indict criminal charges in a federal trial against the alleged suspects. |