Implementation of complex computational algorithms on GPU and CPU with demanding latency and throughput requirements. Refactor existing solutions to improve their scalability. Commercial experience in developing and debugging high-performance GPU and CPU applications with strong focus on latency and throughput. Hands-on experience with third-party libraries and designing custom CUDA kernels. Proficient with profiling and performance analysis tools (Nsight Systems, Nsight Compute, nvprof). Solid understanding of data structures, algorithms, and object-oriented programming in C++. Proven ability to work effectively in remote or hybrid teams with variable, project-based responsibilities. Curiosity and proactive engagement with emerging trends in GPU/HPC/ML, continuously seeking to learn and apply new techniques.