ycjuan / libffm

A Library for Field-aware Factorization Machines
BSD 3-Clause "New" or "Revised" License
1.59k stars 460 forks source link

Table of Contents

What is LIBFFM

LIBFFM is a library for field-aware factorization machine (FFM).

Field-aware factorization machine is a effective model for CTR prediction. It has been used to win the top-3 positions of following competitions:

* Criteo: https://www.kaggle.com/c/criteo-display-ad-challenge

* Avazu: https://www.kaggle.com/c/avazu-ctr-prediction

* Outbrain: https://www.kaggle.com/c/outbrain-click-prediction

* RecSys 2015: http://dl.acm.org/citation.cfm?id=2813511&dl=ACM&coll=DL&CFID=941880276&CFTOKEN=60022934

You can find more information about FFM in the following paper / slides:

* http://www.csie.ntu.edu.tw/~r01922136/slides/ffm.pdf

* http://www.csie.ntu.edu.tw/~cjlin/papers/ffm.pdf

* https://arxiv.org/abs/1701.04099

Overfitting and Early Stopping

FFM is prone to overfitting, and the solution we have so far is early stopping. See how FFM behaves on a certain data set:

> ffm-train -p va.ffm -l 0.00002 tr.ffm
iter   tr_logloss   va_logloss
   1      0.49738      0.48776
   2      0.47383      0.47995
   3      0.46366      0.47480
   4      0.45561      0.47231
   5      0.44810      0.47034
   6      0.44037      0.47003
   7      0.43239      0.46952
   8      0.42362      0.46999
   9      0.41394      0.47088
  10      0.40326      0.47228
  11      0.39156      0.47435
  12      0.37886      0.47683
  13      0.36522      0.47975
  14      0.35079      0.48321
  15      0.33578      0.48703

We see the best validation loss is achieved at 7th iteration. If we keep training, then overfitting begins. It is worth noting that increasing regularization parameter do not help:

> ffm-train -p va.ffm -l 0.0002 -t 50 -s 12 tr.ffm
iter   tr_logloss   va_logloss
   1      0.50532      0.49905
   2      0.48782      0.49242
   3      0.48136      0.48748
             ...
  29      0.42183      0.47014
             ...
  48      0.37071      0.47333
  49      0.36767      0.47374
  50      0.36472      0.47404

To avoid overfitting, we recommend always provide a validation set with option -p.' You can use option--auto-stop' to stop at the iteration that reaches the best validation loss:

> ffm-train -p va.ffm -l 0.00002 --auto-stop tr.ffm
iter   tr_logloss   va_logloss
   1      0.49738      0.48776
   2      0.47383      0.47995
   3      0.46366      0.47480
   4      0.45561      0.47231
   5      0.44810      0.47034
   6      0.44037      0.47003
   7      0.43239      0.46952
   8      0.42362      0.46999
Auto-stop. Use model at 7th iteration.

Installation

Requirement: It requires a C++11 compatible compiler. We also use OpenMP to provide multi-threading. If OpenMP is not available on your platform, please refer to section `OpenMP and SSE.'

Data Format

The data format of LIBFFM is: