2018年5月27日 星期日

[competitive-data-science] Week 4

Hyperparameter tuning

KazAnova's Competition Pipeline (https://github.com/kaz-Anova)
  • Understand the problem
  • Exploratory analysis
  • Define CV strategy
  • Feature engineering 
  • Modelling 
  • Ensembling

Understand broadly the problem
  • Type of problem
  • How BIG is the data
  • Hardware needed (CPUs, GPUs, RAM, Disk space)
  • Software needed (Tensorflow, Keras, sklearn, LightGBM, xgboost)
  • What is the metric being tested on?
  • Previous code relevant?

Do some EDA
  • Plot histograms of variables. Check that a feature looks similar between train and test.
  • Plot features vs the target variable and vs time.
  • Consider univariate predictability metrics (IR, R, AUC).
  • Binning numerical features and correlation metrics.

Decide CV strategy
  • The step is critical. Its success is a good indication for what is going to happen in the competition.
  • People have won by just selecting the right way to validate.
  • Is time important? Split by time. Time-based validation.
  • Different entities than the train. Stratified validation.
  • Is it completely random? Random validation (random K-fold).
  • Combination of all the above.
  • Use test LB to test.

Feature engineering
  • Image classification: Scaling, shifting, rotations, CNNs. (Data Science Bowls)
  • Sound classification: Fourier, MFCC, specgrams, scaling. (Tensorflow speech recognition)
  • Text classification: TF-IDF, SVD, stemming, spell checking, stop word's removal, x-grams (StumbleUpon Evergreen Classification)
  • Time series: Lags, weighted averaging, exponential smoothing. (Walmart recruitment)
  • Categorial: Target encoding, frequency, one-hot, ordinal, label encoding. (Amazon employee)
  • Numerical: Scaling, binning, derivatives, outlier removals, dimensionality reduction. (Africa soil)
  • Interactions: Multiplications, divisions, group-by features, concatenations. (Homesite)
  • Recommenders: Features on transactional history, item popularity, frequency of purchase. (Acquire Valued Shoppers)

Modeling
  • Image classification: CNNs (Resnet, VGG, DenseNet, ...)
  • Sound classification: CNNs (CRNN), LSTM
  • Text classification: GBMs, Linear, DL, Naive Bayes, kNNs, LibFM, LIBFFM
  • Time series: Autoregressive models, ARIMA, Linear, GBMs, DL, LSTM
  • Categorial: GBMs, Linear, DL, LibFM, LIBFFM
  • Numerical: GBMs, Linear, DL, SVMs
  • Interactions: GBMs, Linear, DL
  • Recommenders: CF, DL, LibFM, LIBFFM, GBMs

Ensembling
  • All this time, predictions on internal validation and test are saved.
  • Different ways to combine from averaging to multilayer stacking.
  • Small data requires simpler ensemble techniques.
  • Help to average a few low-correlated predictions with good scores.
  • Stacking process repeats the modeling process.

Tips on collaboration
  • More fun.
  • Learn more.
  • Better score.
  • Gain in at least 2 ways.
    • You can cover more ground.
    • Every person seizes the problem from different angles leading to more though solutions.
  • Start collaborating after getting some experience to understand the dynamics.
  • Start with people around your "rank".
  • Look for people that are likely to do different things well or that specialize in certain areas. 

Selection final submissions
  • Normally select the best submission locally and best on LB.
  • It is good to monitor correlations. If submissions exist with high scores but significantly lower correlations, they could be considered too.

Final tips
  • You never lose. You may not win prize money, but you always gain in terms of knowledge, experience, meeting/collaborating with talented people in the field, boost your CV.
  • The Kaggle community may be the most kind, helpful community I have ever experienced in any social context.
  • After the competition look for people sharing approaches. 
  • Create a notebook with useful method and update it.

Additional Links:

1 則留言: