Senior Network Engineer
Listed on 2026-03-06
-
Engineering
Systems Engineer
Job Description
a. Have a global perspective on stability, capable of developing and implementing stability solutions.
b. Pre-event:
Establish and continually optimize monitoring mechanisms for application operations and maintenance; develop and maintain corresponding monitoring platforms/tools.
c. During the event:
Establish and continuously optimize warning mechanisms for application operations and maintenance, ensuring that faults can be quickly discovered, located, and addressed.
d. Post-event:
Quickly analyze, diagnose, and locate problems, and collaborate with relevant personnel to resolve issues; establish and improve the rapid recovery service mechanism to reduce business impact and ensure stable business operations by identifying and eliminating potential risks through stability governance projects and architectural optimizations.
a. Design, develop, and maintain reliable operations and maintenance platforms and tools, such as inspection systems, water level systems, delivery systems, cost management systems, etc., to address issues related to delivery, performance, stability, and cost encountered by production systems, ensuring business availability and enhancing performance and efficiency.
b. Responsible for data-driven analysis of operations and maintenance quality; analyze and study daily operations and maintenance metrics, issues, and risks to establish models and provide optimization suggestions for operations and maintenance.
a. Establish operation and maintenance process specifications and standardization (such as change standards, protection plans, cloud product configuration standards, etc.) to ensure the normativity and standardization of operations and maintenance, thereby enhancing stability.
b. Develop and implement emergency response specifications and standards for application operations and maintenance faults.
c. Develop and implement alarm handling specifications and standards for application operations and maintenance, as well as Service Level Agreements (SLA).
a. Based on business requirements, plan budget preparation, capacity planning, and readiness, and coordinate with development teams for predictions and estimates of resource consumption such as storage and computing.
b. Analyze business demands, ensuring stability while integrating water levels, specifications, and billing rules; control the reasonableness of resource estimation in technical solutions and collaborate with development to reduce resource costs.
a. 24/7 emergency response, daily monitoring alerts, and emergency handling, continuously identifying and rectifying existing issues.
b. Responsible for operations and maintenance support during major events (such as National Day, Spring Festival, New Year’s Day, and significant activities).
c. Develop and drill emergency plans, respond to emergencies, and handle faults.
d. Establish a problem/fault record repository, conduct targeted analysis of the repository, and enhance and optimize the emergency plan repository and standard process repository.
a. Responsible for system architecture upgrades, such as kernel upgrades, architecture upgrades, inter-room service migration, and containerization transformation.
b. Responsible for the design and implementation of disaster recovery architecture, such as local disaster recovery and multi-active geographically distributed setups.
Job RequirementsFluent in Chinese communication skills, able to clearly articulate technical issues and solutions.
Over 3 years of experience in operations and maintenance in related fields such as applications, networks, and containerization.
Basic mastery of professional abilities in architecture design, performance optimization, and stability optimization.
Capable of applying intelligent and automated operations and maintenance platforms and tools, designing and utilizing complex workflows and daily operational templates, quickly identifying,…
(If this job is in fact in your jurisdiction, then you may be using a Proxy or VPN to access this site, and to progress further, you should change your connectivity to another mobile device or PC).