AI-Ops Engineer Job Fremont area,California USA,IT/Tech

Job Description :

Pay Range: $55hr - $60hr

Experience Requirements:

5+ years in IT Operations, Data Engineering, or related fields.
Experience in Azure Data Services, ETL/ELT processes, and ITIL-based operations.
2+ years in AIOps implementation, monitoring, and automation.

Qualifications:

Bachelor s or master s degree in computer science, Engineering, or a related field.

Key

Skills:

Client, Power Shell, or Python, Alerts & Logs Monitoring, Confluence and Share Point

Skill Requirements:

Basic Understanding of Azure Data Services (ADF, Synapse, Databricks).
Experience in monitoring alerts from data pipelines (Azure Data Factory, Synapse, ADLS, MS Fabric etc.)
Familiarity with ETL/ELT concepts, data validation, and pipeline orchestration.
Experience in identifying failures in ETL jobs, scheduled loads, and streaming data services.
Hands-on experience with IT monitoring tools (e.g. Client, Azure Monitor, Dynatrace, or similar tools).
Skilled in creating and updating runbooks and SOPs.
Familiarity with data refresh cycles, batch vs. streaming differences.
Familiarity with ITIL processes for incident, problem, and change management.
Strong attention to detail, ability to follow SOPs, and effective communication for incident updates.
Solid understanding of containerized services (Docker/Kubernetes) and Dev Ops pipelines (Azure Dev Ops, Git Hub Actions), always with an eye on data layer integration.
Proficiency in Jira, Confluence and SharePoint for status updates and documentation.
Understanding of scripting (Power Shell, Python, or Shell) for basic automation tasks.
Ability to interpret logs and detect anomalies proactively.
Analytical thinking for quick problem identification and escalation.

Preferred:

Key Responsibilities:

Monitor and support data pipelines on Azure Data Factory, Databricks, and Synapse.
Perform incident management, root-cause analysis for L1 issues, and escalate as needed.
Surface issues clearly & escalate to appropriate SME teams so they can be fixed at the root avoid repetitive short fixes.
Identify whether issues are at pipeline level, data source level, or infrastructure level and route accordingly.
Document incident resolution patterns for reuse.
Acknowledge incidents promptly and route them to the correct team.
Execute daily health checks, maintain logs, and update system status in collaboration tools.
Work strictly as per SOPs documented by the team.
Maintain and update SOPs, runbooks, and compliance documentation.
Update system health status every 2 hours during the shift in Confluence or SharePoint.
Update incident status every 4 hours for P1/P2 tickets.
Complete service tasks on time as per SLA to release queues quickly.
Ensure compliance with enterprise data security, governance, and regulatory requirements.
Collaborate with data engineers, analysts, Dev Ops/SRE teams and business teams to ensure reliability and security.
Implement best practices in ML operations and productionization.


Increase/decrease your Search Radius (miles)



Job Posting Language