Unlock the Power of AutoML: Finding the Best ML Model Effortlessly
Written on
Chapter 1: Understanding the Role of Automation in Data Science
Automation has long been a transformative force in various fields, shaping the way we approach tasks. In data science, automation allows us to enhance our cognitive processes, enabling computers to label and analyze data on a much larger scale than manual methods could ever achieve.
Historically, the journey of teaching machines to learn was a labor-intensive task, but the advent of AutoML has revolutionized this landscape. Constructing a predictive model involves several critical steps: gathering data, preparing datasets, training machine learning models, fine-tuning hyperparameters, deploying the models, and monitoring their performance. AutoML techniques aim to minimize manual intervention in these processes.
Many data scientists leverage Python, with scikit-learn being a popular choice due to its user-friendly APIs that simplify model development. However, there's a powerful tool that enhances scikit-learn's capabilities: auto-sklearn. For a deeper insight into this tool, you can refer to the paper "Efficient and Robust Automated Machine Learning" by Feurer et al., presented at NIPS 2015.
Section 1.1: The Necessity of Auto-Sklearn
While preparing and cleaning data is essential, one of the most daunting tasks for data scientists is selecting the most suitable model. In my experience, my team has spent weeks on this when AutoML wasn't widely adopted. While I recommend traditional methods for beginners, they may not be the most efficient for real-world applications. Continuously tweaking hyperparameters without substantial progress can lead to wasted cognitive effort.
This is where auto-sklearn comes into play, automating hyperparameter tuning and model selection. You can set time limits for the training process, ensuring efficiency. Moreover, it allows for parallel Bayesian optimization on a distributed system.
To install auto-sklearn, you can use the following commands:
pip install auto-sklearn
python -c "import autosklearn; print(autosklearn.__version__)"
# 0.14.6
Refer to the auto-sklearn documentation for additional installation methods.
Subsection 1.1.1: Building an ML Model with Auto-Sklearn
For our demonstration, let's utilize the wine quality dataset from Kaggle, which consists of 11 predictor variables and one response variable, representing the quality of wine across six categories. Our goal is to create a machine learning model that predicts wine quality based on these predictors.
The following code snippet will load the dataset and prepare it for training by removing the ID column and separating the response variable:
import pandas as pd
df = pd.read_csv('./WineQT.csv')
df.drop('Id', axis=1, inplace=True)
y = df.quality.copy()
X = df.drop('quality', axis=1)
Next, we will split the dataset into training and testing sets:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
So far, this process mirrors the traditional workflow using scikit-learn. Now, let's leverage the auto-sklearn API to train multiple models and identify the best one:
from autosklearn.classification import AutoSklearnClassifier
model = AutoSklearnClassifier(
time_left_for_this_task=5*60,
per_run_time_limit=30
)
In this code, we specify the duration for individual runs and the overall training time. These values are illustrative; you should adjust them based on your dataset size and computational resources.
To evaluate the model's performance on the test set, we use:
from sklearn.metrics import accuracy_score
y_hat = model.predict(X_test)
acc = accuracy_score(y_test, y_hat)
print(f"Model Accuracy: {acc}")
This will yield a commendable accuracy score for predicting wine quality.
Chapter 2: The Challenges of Manual Model Building
If we were to manually build a classifier using scikit-learn, it might look like this:
from sklearn.ensemble import RandomForestClassifier
RandomForestClassifier(max_features=15, n_estimators=512)
Finding optimal values for parameters like max_features and n_estimators often requires extensive trial and error. This complexity increases if you need to experiment with various algorithms, making the process tedious.
In contrast, auto-sklearn evaluates approximately 15 different classifiers and numerous hyperparameter configurations within a short time frame. You can also review the performance of the models auto-sklearn evaluated by accessing the leaderboard:
model.leaderboard()
If you wish to delve deeper into the models tested, the show_models method provides more detailed information.
Section 2.1: Handling Large Datasets with AutoML
When dealing with large datasets, AutoML techniques can be resource-intensive and time-consuming. However, by leveraging domain knowledge, we can optimize the search process. For instance, we can narrow down the classifier options to just Gaussian Naive Bayes and K-Nearest Neighbors, and disable feature preprocessing to save time:
automl = autosklearn.classification.AutoSklearnClassifier(
include={
'classifier': ["gaussian_nb", "k_nearest_neighbors"],
'feature_preprocessor': ["no_preprocessing"]
},
exclude=None
)
This strategy grants us greater control over the AutoML search space. However, for smaller datasets, I recommend sticking to the default settings to ensure optimal model performance beyond basic heuristics.
Final Thoughts on the Future of AutoML
In conclusion, AutoML libraries like auto-sklearn address a significant challenge faced by data scientists. Without such tools, valuable time and expertise can be squandered. Nonetheless, it's essential to recognize that an AutoML model may not always be superior. A skilled data scientist might still discover a more effective model through their domain expertise.
Could AutoML represent the future of data science? I believe so. Much like the initial skepticism surrounding automatic transmissions in vehicles, AutoML may not be flawless now, but it has the potential to surpass traditional methods in the near future.
Dataset Credits: The wine quality dataset is by M Yasser from Kaggle, licensed under CC0: Public Domain.
Thanks for reading! Connect with me on LinkedIn, Twitter, and Medium.
The first video titled "MLOPS: From experiment management to model serving and back. A complete use case, step-by-step!" delves into the intricacies of managing machine learning workflows effectively, providing a comprehensive walkthrough.
The second video, "Numerai Quant Club / Why do tree-based models still outperform deep learning on tabular data?" explores the ongoing debate regarding model performance in the context of different data types, offering insights into the continued relevance of tree-based methods.