We have a current opportunity for a Machine Learning Platform Engineer on a permanent basis. The position will be based in Beijing/Shanghai. For further information about this position please apply.
Job Description
-Responsible for the research and development of training platforms, planning and building large-scale, highly available training platforms, and improving the efficiency of training task operation and development.
-Conduct forward-looking technological research and innovation to continuously improve resource utilization and usability.
Job requirements
-Familiar with Linux, familiar with at least one of the following language Go/Python/C++, and familiar with the technical principles of Docker/k8s
-Familiar with distributed systems and containerization technology related technologies
-Experience in developing machine learning platforms and task scheduling frameworks, such as Kubeflow and Volcano
-Ability to independently solve problems and good teamwork spirit
-Strong sense of work responsibility, good learning ability, communication skills, and self motivation
-Have good work document habits, write and maintain high-quality technical documents, ensure timely updates of work-flow and technical information
Good to have:
-Experience in operation, maintenance, and troubleshooting of large-scale training tasks, familiar with common faults and troubleshooting strategies during training