This article explains the very basic and utmost important parts of a Regression/Classification Pipeline (the differences have been shown where required). Additional points can be added based on the domain and industry you’re working for. Generally, model deployment and cloud integration follows this process, but that’s not what we’re talking about today.
Another point which has not been highlighted as such below, is the “data cleaning” to be done before it is wrangled and mined, as this is probably the most important part of diving into analytics: Cleaning the data, and transforming it, such that it makes sense, and such that all the anomalies are caught, before it is put into pre-processing stage.
Here’s the Pipeline:
- Collect data from varied sources, and combine(concat/merge) datasets(if multiple)
- Read the dataset, and check for the features. Understand the features first(let’s say understand how each feature is related to the target, i.e. “credit score” for example.)
- Check for null values, and “describe” the data-set, as in understand the datatypes and how they are spread in the data-set that we have obtained/collected
- Decide how to treat null values. This highly depends on the business case at hand, because it often happens that even columns with more than 70% nulls aren’t imputed, but still kept as valuable information by turning them into dummies(an indicator of the presence of data)
- Check for outliers, and outlier is a holistic term, so even when a variable may show outliers with the naked eye, it is important to understand that it may not always contain outliers as such, because the feature’s understanding will determine whether we are seeing a distant(low occurring/special) value or an outlier: Imagine the classic case of house prices, where we see extremely high house prices
- After outlier treatment(i.e. removing outliers/working with them), we move on feature transformation if required. Some algorithms become biased to features having much higher values than other features, and this mostly happens in a few classification algorithms. Hence, sometimes we do need to transform features. Another reason to transform features could be to include outliers (log transformation for example)
- Now, finally we move on to model building. We can start off with breaking the data into train and test cases, and then training the train data. In case of linear regression, I would prefer to start with statistical modelling(in order to understand features by seeing the related p-values) and decision tree in the case of classification(again, to visualize the important features, which have been used to split nodes at each depth)
- After the initial algorithms, one can either try out other algorithms(to improve accuracy/score), or try feature selection using techniques such as Correlation Heat-maps / VIF(Remove highly correlated variables in short,as they provide the same information to the model), Backward Elimination/Recursive Elimination(directly select important features based on p-values obtained in the Statistical model).
- We are actually just starting up the model building process at this moment, because now we are approaching the time which we’ll spend comparing Rsquares, RMSEs in case of Regression, and Confusion Matrices, Sensitivity, Specificity, F1 score, AUC-ROC curve and AUC in case of classification.
- At this moment, some analytics professionals also try something called as “Polynomial Features” which is a very powerful technique to check for the interaction within and across the features in the data-set, and when you run a feature elimination algorithm on this data-set of all the interactive features, you actually obtain a set of very impressively variant features, out of which you can select the strongest ones, and the best is that most of these features would have been obtained as interactions(which explains so much more about the data!)
- Another thing which is very important, is Regularization, to combat the bias-variance trade-off. Lasso will penalize beta coefficients in a way such that their importance can be increased/decreased or even reduced to zero(kind of like a feature elimination technique, but still very different). Ridge will not remove any variable, but it will penalize coefficients, so it will be useful where we have very less number of features(or a domain where all features are needed to be presented as a business case understanding/outcome), so it will penalize beta coefficients but keep all of them intact for the model.
The base of the pipeline will remain the same, but additional methods can be used as and when you acquire the domain knowledge in the field that you are working in, or wanting to work in. In classification, hyper-parameter tuning is also something that is very important, so that you can build various instances of a base algorithm, by changing how data flows in and out of the algorithm, and how it reacts to that data flow.