- Tuning the hyper-parameters of an estimator (sklearn)
- Optimizing hyperparameters with hyperopt
- Complete Guide to Parameter Tuning in Gradient Boosting (GBM) in Python
KazAnova's Competition Pipeline (https://github.com/kaz-Anova)
- Understand the problem
- Exploratory analysis
- Define CV strategy
- Feature engineering
- Modelling
- Ensembling
Understand broadly the problem
- Type of problem
- How BIG is the data
- Hardware needed (CPUs, GPUs, RAM, Disk space)
- Software needed (Tensorflow, Keras, sklearn, LightGBM, xgboost)
- What is the metric being tested on?
- Previous code relevant?
Do some EDA
- Plot histograms of variables. Check that a feature looks similar between train and test.
- Plot features vs the target variable and vs time.
- Consider univariate predictability metrics (IR, R, AUC).
- Binning numerical features and correlation metrics.
Decide CV strategy
- The step is critical. Its success is a good indication for what is going to happen in the competition.
- People have won by just selecting the right way to validate.
- Is time important? Split by time. Time-based validation.
- Different entities than the train. Stratified validation.
- Is it completely random? Random validation (random K-fold).
- Combination of all the above.
- Use test LB to test.
Feature engineering
Modeling
Ensembling
- Image classification: Scaling, shifting, rotations, CNNs. (Data Science Bowls)
- Sound classification: Fourier, MFCC, specgrams, scaling. (Tensorflow speech recognition)
- Text classification: TF-IDF, SVD, stemming, spell checking, stop word's removal, x-grams (StumbleUpon Evergreen Classification)
- Time series: Lags, weighted averaging, exponential smoothing. (Walmart recruitment)
- Categorial: Target encoding, frequency, one-hot, ordinal, label encoding. (Amazon employee)
- Numerical: Scaling, binning, derivatives, outlier removals, dimensionality reduction. (Africa soil)
- Interactions: Multiplications, divisions, group-by features, concatenations. (Homesite)
- Recommenders: Features on transactional history, item popularity, frequency of purchase. (Acquire Valued Shoppers)
Modeling
- Image classification: CNNs (Resnet, VGG, DenseNet, ...)
- Sound classification: CNNs (CRNN), LSTM
- Text classification: GBMs, Linear, DL, Naive Bayes, kNNs, LibFM, LIBFFM
- Time series: Autoregressive models, ARIMA, Linear, GBMs, DL, LSTM
- Categorial: GBMs, Linear, DL, LibFM, LIBFFM
- Numerical: GBMs, Linear, DL, SVMs
- Interactions: GBMs, Linear, DL
- Recommenders: CF, DL, LibFM, LIBFFM, GBMs
- All this time, predictions on internal validation and test are saved.
- Different ways to combine from averaging to multilayer stacking.
- Small data requires simpler ensemble techniques.
- Help to average a few low-correlated predictions with good scores.
- Stacking process repeats the modeling process.
Tips on collaboration
- More fun.
- Learn more.
- Better score.
- Gain in at least 2 ways.
- You can cover more ground.
- Every person seizes the problem from different angles leading to more though solutions.
- Start collaborating after getting some experience to understand the dynamics.
- Start with people around your "rank".
- Look for people that are likely to do different things well or that specialize in certain areas.
Selection final submissions
- Normally select the best submission locally and best on LB.
- It is good to monitor correlations. If submissions exist with high scores but significantly lower correlations, they could be considered too.
Final tips
- You never lose. You may not win prize money, but you always gain in terms of knowledge, experience, meeting/collaborating with talented people in the field, boost your CV.
- The Kaggle community may be the most kind, helpful community I have ever experienced in any social context.
- After the competition look for people sharing approaches.
- Create a notebook with useful method and update it.
Additional Links:
Nice post Data Science online Training Hyderabad
回覆刪除