Implement an autotuning approach to vectorization decision

Tasks:

Generate a standalone C++ file for benchmarking a given sum factorization kernel
Implement a way to do JIT compilation of benchmark programs (codepy?)
Asynchronize the cost function evaluation (instead of using min)
Add a pickle-cache for benchmarking results
Add an autotune cost model implementation
Think of an interface how benchmark programs communicate their measurements to the code generator