Plover Tsai: [competitive-data-science] Week 4

Hyperparameter tuning

KazAnova's Competition Pipeline (https://github.com/kaz-Anova)

Understand broadly the problem

Do some EDA

Plot histograms of variables. Check that a feature looks similar between train and test.
Plot features vs the target variable and vs time.
Consider univariate predictability metrics (IR, R, AUC).
Binning numerical features and correlation metrics.

Decide CV strategy

The step is critical. Its success is a good indication for what is going to happen in the competition.
People have won by just selecting the right way to validate.
Is time important? Split by time. Time-based validation.
Different entities than the train. Stratified validation.
Is it completely random? Random validation (random K-fold).
Combination of all the above.
Use test LB to test.

Feature engineering

Image classification: Scaling, shifting, rotations, CNNs. (Data Science Bowls)
Sound classification: Fourier, MFCC, specgrams, scaling. (Tensorflow speech recognition)
Text classification: TF-IDF, SVD, stemming, spell checking, stop word's removal, x-grams (StumbleUpon Evergreen Classification)
Time series: Lags, weighted averaging, exponential smoothing. (Walmart recruitment)
Categorial: Target encoding, frequency, one-hot, ordinal, label encoding. (Amazon employee)
Numerical: Scaling, binning, derivatives, outlier removals, dimensionality reduction. (Africa soil)
Interactions: Multiplications, divisions, group-by features, concatenations. (Homesite)
Recommenders: Features on transactional history, item popularity, frequency of purchase. (Acquire Valued Shoppers)

Modeling

Ensembling

Tips on collaboration

You can cover more ground.
Every person seizes the problem from different angles leading to more though solutions.

Start collaborating after getting some experience to understand the dynamics.
Start with people around your "rank".
Look for people that are likely to do different things well or that specialize in certain areas.

Selection final submissions

Normally select the best submission locally and best on LB.
It is good to monitor correlations. If submissions exist with high scores but significantly lower correlations, they could be considered too.

Final tips

You never lose. You may not win prize money, but you always gain in terms of knowledge, experience, meeting/collaborating with talented people in the field, boost your CV.
The Kaggle community may be the most kind, helpful community I have ever experienced in any social context.
After the competition look for people sharing approaches.
Create a notebook with useful method and update it.

Additional Links:

Plover Tsai