Meet the Data III

Jeudi 23 février de 9 heures à 18 heures
Maison de la Chimie, 28 rue Saint Dominique – 75007










Thèmes abordés: Approches Data Science, apports et perspectives dans les domaines des “Smart cities”, de l’économie collaborative, des média et de l’entertainment.

Avec des interventions des chercheurs du Collège de France, de l’Ecole Polytechnique Fédérale de Lausanne, du Technion-Israel Institute of Technology, de l’IRCAM… de dirigeants et responsables R&D de Dassault Systèmes, European Institute for Energy Research, Facebook, Gameloft, Havas, Orange, NTT Labs, Veolia, Viaccess-Orca, Vivendi…

Site de l’événement

SAGA algorithm in the lightning library

[Cross-posted from]

Recently we have implemented, (Fabian Pedregosa and Arnaud Rachez), the SAGA[1] algorithm in the lightning machine learning library (which by the way, has been recently moved to the new scikit-learn-contrib project). The lightning library uses the same API as scikit-learn but is particularly adapted to online learning. As for the SAGA algorithm, its performance is similar to other variance-reduced stochastic algorithms such as SAG[3] or SVRG[2] but it has the advantage with respect to SAG[3] that it allows non-smooth penalty terms (such as L1 regularization). It is implemented in lightning as SAGAClassifier and SAGARegressor.

We have taken care to make this implementation as efficient as possible. As for most stochastic gradient algorithms, a naive implementation takes 3 lines of code and is straightforward to implement. However, there are many tricks that are time-consuming and error-prone to implement but make a huge difference in efficiency.

A small example, more as a sanity check than to claim anything. The following plot shows the suboptimality as a function of time for three similar methods: SAG, SAGA and SVRG. The dataset used in the RCV1 dataset (test set, obtained from the libsvm webpage), consisting of 677.399 samples and 47.236 features. Interestingly, all methods can solve this rather large-scale problem within a few seconds. Within them, SAG and SAGA have a very similar performance and SVRG seems to be reasonably faster.

A note about the benchmarks: it is difficult to compare fairly stochastic gradient methods because at the end it usually boils down to how you choose the step size. In this plot I set the step size of all methods to 1/(3L) , where L is the Lipschitz constant of the objective function, as I think this is a popular choice. I would have prefered 1/L but SVRG was not converging for this step size. The code for the benchmarks can be found here.

  1. A. Defazio, F. Bach & S. Lacoste-Julien. “SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives” (2014). 
  2. Rie Johnson and Tong Zhang. “Accelerating stochastic gradient descent using predictive variance reduction.” Advances in Neural Information Processing Systems. 2013. 
  3. Mark Schmidt, Nicolas Le Roux, and Francis Bach. “Minimizing finite sums with the stochastic average gradient.” arXiv preprint arXiv:1309.2388 (2013). 

Come work with us! Post-doc and engineer positions available

The chaire “Economie et gestion des nouvelles données” is recruiting a talented postdoc specialized in large scale computing and data processing. The targeted applications include machine learning, imaging sciences and finance. This is a unique opportunity to join a newly created research group between the best Parisian labs in applied mathematics and computer science (Paris­ Dauphine, INRIA, ENS Ulm, Ecole Polytechnique and ENSAE). The proposed position consists in working in the research of large­ scale data processing methods, and applying these methods on real­ life problems.

The successful candidate will integrate the Sierra INRIA team ( located at the new INRIA Paris center located in downtown Paris. He will benefit from a very stimulating working environment and all required computing resources. He will work in close interaction with the 4 research labs of the chaire, and will also have interactions with industrial partners.

A non­ exhaustive list of methods that are currently investigated by researchers of the group, and that will play a key role in the computational framework developed by the recruited post-doc, includes :
* Large scale non-­smooth optimization methods (proximal schemes, interior points, optimization on manifolds).
* Distributed optimization methods (asynchronous stochastic gradient optimization).
* Machine learning problems (kernelized methods, Lasso, collaborative filtering, deep learning, learning for graphs, learning for time­ dependent systems), with a particular focus on large­ scale problems and stochastic methods.
* Asynchronous parallel optimization methods.
* Imaging problems (compressed sensing, super­resolution).
* Hyperparameter optimization.

## Candidate profile

The candidate should have a good background in computer science with various programming environments (e.g. Python and/or Matlab and/or Java/Scala) and knowledge of high performance computing methods (e.g. parallelization, GPU, cloud computing). He/she should adhere to the open source philosophy and possibly be able to interact with the relevant communities (e.g. Python, scikit-learn, Julia project, etc.). Typical curriculum includes PhD in computer science, applied mathematics, statistics or related fields.
## Application proces

Send a resume and a motivation letter to:
Alexandre d’Aspremont <>, Robin Ryder <>, Fabian Pedregosa <>

For any questions please contact me at .

Slides from last meeting – 2 November 2015

The chair “Économie des nouvelles données” organized a meeting the 2nd of November at the University Paris-Dauphine in which members of this organization presented work in progress and computational tools. Here are some of the slides:

“Performance and scalability for machine learning” by Arnaud Rachez

also available as PDF.

“A leader/follower based recommendation system for bonds” by Guillaume Lecué

also available as PDF.

“Profiling in Python” by Fabian Pedregosa

also available as PDF and demo jupyter notebook.

“Recent developments in scikit-learn” by Fabian Pedregosa

also available as PDF.

CompSquad at scikit-learn coding sprint

Several members of this group were at the scikit-learn coding sprint last week, hosted at Criteo:

This sprint was the occasion to interact with many actors from the scikit-learn community, from novel contributors to experienced ones. It is also the occasion where important discussions take place and when many proposals that have been on stand-by for months (or years!) finally get integrated into the project. My personal highlights of improvements for this sprint are:

  • Multi-layer perceptron, it was long-due that one of the most well-known machine learning methods was implemented in scikit-learn
  • Improved Gaussian Process module
  • Isolation Forests, an outlier detection method using random forests.
  • A complete refactoring of model selection-related modules. Furthermore, the API of cross-validation iterators has to allow (easier) nested cross-validation. Oh, and don’t worry, all the old modules are still in there and will be supported for a couple of years. The documentation for these changes will be up soon.
  • And many others under review …



On the behalf of the interdisciplinary research University Paris Dauphine-Havas Chair «Economics and Management of New Data », Professor Pierre-Louis Lions, Professor at the College de France and President of the Chair’s scientific board, André Lévy-Lang, President of Institut Louis Bachelier and Dominique Delport, Global Managing Director Havas Media Group, Chairman HMG France & UK, have the great honor and pleasure of inviting you to the scientific conference «MEET THE DATA II».

We hope that you will be able to attend these presentations and interaction sessions that will be held on June 15th (8.30am-5.00pm at Le Palais Brongniart (the historical building of the Paris stock exchange).

RSVP to:

PyData Paris – April 2015

Last Friday was PyData Paris, in words of the organizers, ”a gathering of users and developers of data analysis tools in Python”.

The organizers did a great job in putting together and the event started already with a full room for Gael’s keynote

Gael's keynote

My take-away message from the talks is that Python has grown in 5 years from a language marginally used in some research environments into one of the main languages for data science used both in research labs and industrial environment.

My personal highlights were (note that there were two parallel tracks)

  • Ian Ozsvald’s talk on Cleaning Confused Collections of Characters. Ian gave a very practical talk, full of real world examples. The slides have already been uploaded on his website. Many tips and many pointers to libraries. In particular, I discovered fixes text for you.
  • Chloe-Agathe gave a short talk on DREAM challenges. In her talk she mentioned GPy. One year ago, I visited Neil Lawrence at his lab in Sheffield and at that point they were in the process of migrating their Matlab codebase into Python (the GPy project). I’m very glad to see that the project is succeeding and being used by other research institutions.
  • Serge Guelton and Pierrick Brunet presented “Pythran: Static Compilation of Parallel Scientific Kernels”. From their own documentation: “Pythran is a python to c++ compiler for a subset of the python language. It takes a python module annotated with a few interface description and turns it into a native python module with the same interface, but (hopefully) faster”. The project seems promising although I do not have had experience as to judge the quality of their implementation.
  • Antoine Pitrou presented: “Numba, a JIT compiler for fast numerical code”. I must say that I’m an avid user of Numba so of course I was looking forward to this talk. One thing I didn’t know is that support for CUDA is being implemented into Numba via the @cuda.jit decorator. From their website it looks like this is only available in the Numba Pro version (not free).
  • Kirill Smelkov presented wendelin.core, an approach to perform out-of-core computations with numpy. Slides can be found here.
  • Finally, Frances Alted gave the final keynote on “New Trends In Storing And Analyzing Large Data Silos With Python”. Among the projects he mentioned, I found particularly interesting bcolz, his current main project and DyND, a Python wrapper around a multi-dimensional array library.


[Cross-posted from]