
Among all the classifiers provided by Sklearn, two stand out for their similarities: SGDClassifier and LogisticRegression. So, what differentiates the two? In this post, we will explore the key differences and compare SGD Classifier vs Logistic Regression.
Let’s start with the most important ones
[/latexpage]
Optimization Difference
Logistic Regression uses solvers like lbfgs
, saga
, newton-cg
etc to fit the data. These solvers are designed to converge at a global minima using second-order derivates. These algorithms are fast and converge quickly for datasets that can fit memory. Logistic Regression requires access to the entire dataset to train the model. In case, there is additional data, then we need tor retrain the entire model.
SGD on the other hand uses stochastic gradient descent to train the model. Instead of processing the entire dataset at once, it fits one sample at a time. This can be modified to fit mini batches at a time as well by simply passing batches instead of single samples. This makes SGD Classifiers highly scalable.
Parameter Difference
A difference in parameters is the presence of learning_rate
in SGD Classifiers. Logistic regression doesn’t have learning_rate
as a parameter whereas SGD does. SGD uses stochastic gradient descent for optimization which needs a learning rate to make updates to the weights. Logistic Regression solvers typically handle this internally without requiring input from us.
The learning rate is a critical parameter and needs to be carefully selected or at least tuned to ensure we select an optimal learning rate. Too low and the model doesn’t converge quickly. Too high and we get cases where the loss oscillates because the model can’t reach the minima.
The usage of learning rate brings in another aspect which is learning rate schedulers. Generally, it is a good idea to vary the learning rate as we go further into the training process– often decreasing it. This modification of the learning rate is handled by learning rate schedulers. SGD Classifier provides a few options – constant
, optimal
, invscaling
and adaptive
Regularization
Logistic Regression and SGD Classifiers both support Regularization but have differences in implementation. Logistic Regression uses the C
parameter to apply regularization. Lower values imply stronger regularization.
SGD on the other hand, uses the alpha
parameter to directly control regularization strength. Both models have penalty
(which accepts l1, l2, elasticnet
) and l1_ratio
Online Training
Logistic regression doesn’t support online learning and needs access to data all at once which means that the data should fit the memory. SGD on the other hand is well suited for online learning where we can fit one sample at a time or even create mini-batches and train the model in batch mode as well.
Naturally, it doesn’t need all the data to fit the memory at once and works well for cases where we need to regularly update the model with incoming data. SGDClassifier is particularly useful for streaming data or scenarios where memory is a constraint.
Support for Loss Functions
SGD Classifier supports multiple losses like hinge
, logloss
, huber
etc. Logistic Regression supports only logloss
. By simply switching from logloss
to hinge
we can easily switch to linear SVM-like behavior instead of the fixed
of Logistic Regressionlogloss
Conclusion: SGD Classifier vs Logistic Regression
- Online training: this is the biggest upside of using SGD Classifiers. If you need to update your model frequently, then SGD becomes the better choice
- Memory constraints: if the data is too large to fit the memory at once then again SGD Classifier becomes the better choice due to its support for online training
- Data fits the memory: In case data fitting in the memory is not an issue, then using LogisticRegression is the better choice because its solvers efficiently converge on global minima using the entire dataset at once
- Learning rate sensitivity: The performance of the SGD classifier is sensitive to learning rate selection. Learning rate needs to be selected/tuned carefully to get optimal results
- Stochastic nature of SGD: since the SGD classifier fits one sample/batch at a time, noise gets introduced via each sample/batch which acts as a regularization on its own and helps with generalization
That’s all on SGD Classifier vs Logistic Regression. Hope that clears up a few questions (and brings up a few more)! If you’d like to read more, check out this post on KV-Cache
Further Reading
If you are keen to explore more, check the below links
Leave a Reply