Machine Learning Performance Engineer
We’re looking for a performance-focused ML Engineer to help speed up large-scale model training by optimizing our internal stack and compute infrastructure. You’ll work across the full training pipeline — from GPU kernels to system-level throughput — applying profiling, CUDA-level tuning, and distributed systems techniques. The goal is to reduce training time, boost iteration speed, and use compute more efficiently.
This is a key role in a growing team building deep technical expertise in ML training systems.
software development
Amsterdam, fulltime
Responsibilities
— Optimize our model training pipeline to improve both speed and reliability, enabling faster and more efficient experimentation;

— Apply GPU-level optimization techniques using tools like JAX, Triton, low-level CUDA to improve training performance and efficiency at scale;

— Identify and resolve performance bottlenecks across the entire ML pipeline — from data loading and preprocessing to CUDA kernels;

— Build tools and extend internal infrastructure to support scalable, reproducible, and high-performance training workflows;

— Mentor and support engineers and researchers in adopting performance best practices across the team;

— Help grow the team’s GPU and systems-level capabilities, and contribute to a culture of engineering excellence and rapid experimentation.
Responsibilities
— Optimize our model training pipeline to improve both speed and reliability, enabling faster and more efficient experimentation;
— Apply GPU-level optimization techniques using tools like JAX, Triton, low-level CUDA to improve training performance and efficiency at scale;
— Identify and resolve performance bottlenecks across the entire ML pipeline — from data loading and preprocessing to CUDA kernels;
— Build tools and extend internal infrastructure to support scalable, reproducible, and high-performance training workflows;
— Mentor and support engineers and researchers in adopting performance best practices across the team;
— Help grow the team’s GPU and systems-level capabilities, and contribute to a culture of engineering excellence and rapid experimentation.
Requirements
— Demonstrated experience optimizing neural network training in production or large-scale research settings - e.g. reducing training time, improving hardware utilization, or accelerating feedback cycles for ML researchers;

— Extensive practical experience with ML frameworks such as PyTorch or JAX;

— Hands-on experience with training and optimizing deep learning architectures such as LSTM and Transformer-based models, including different attention mechanisms;

— Experience working with CUDA, Triton, or other low-level GPU technologies for performance tuning;

— Proficiency in profiling and debugging training pipelines, using tools such as Nsight/cprofiler/CUDA/gdb/torch profiler;

— Understanding of distributed training concepts (e.g. data/model/tensor/sequence/pipeline/context parallelism, memory and compute tradeoffs);

— A collaborative and proactive mindset, with strong communication skills and the ability to mentor teammates and partner effectively within the team;

— Strong proficiency in Python for building infrastructure-level tooling, debugging training systems, and integrating with ML frameworks and profiling tools;
Requirements
— Demonstrated experience optimizing neural network training in production or large-scale research settings - e.g. reducing training time, improving hardware utilization, or accelerating feedback cycles for ML researchers;
— Extensive practical experience with ML frameworks such as PyTorch or JAX;
— Hands-on experience with training and optimizing deep learning architectures such as LSTM and Transformer-based models, including different attention mechanisms;
— Experience working with CUDA, Triton, or other low-level GPU technologies for performance tuning;
— Proficiency in profiling and debugging training pipelines, using tools such as Nsight/cprofiler/CUDA/gdb/torch profiler;
— Understanding of distributed training concepts (e.g. data/model/tensor/sequence/pipeline/context parallelism, memory and compute tradeoffs);
— A collaborative and proactive mindset, with strong communication skills and the ability to mentor teammates and partner effectively within the team;
— Strong proficiency in Python for building infrastructure-level tooling, debugging training systems, and integrating with ML frameworks and profiling tools;
What we offer
— Competitive compensation above the market with bonuses twice a year up to 50% of annual salary;

— Sophisticated internal training and development programs;
— Comprehensive health insurance;

— Reimbursement for sports activities;

— Engaging in corporate events twice a year;

— High level of influence and ownership of the process;

— Work closely with experienced team in a flat organizational structure.
What we offer
— Competitive compensation above the market with bonuses twice a year up to 50% of annual salary;
— Sophisticated internal training and development programs;
— Comprehensive health insurance;
— Reimbursement for sports activities;
— Engaging in corporate events twice a year;
— High level of influence and ownership of the process;
— Work closely with experienced team in a flat organizational structure.
Apply