Site Reliability Engineer (Remote - Contract)

Start2Scale

Sydney, NSW

A$920-$990 p/d

Information & Communication Technology → Engineering - Software

Contract

Remote

Posted 1 day ago

About Start2Scale

Start2Scale is a talent acquisition advisory and recruitment consultancy, built for tech companies during their startup and scaleup stage. Our founder Ankit Sharma brings experience supporting world-class tech companies including Atlassian, Microsoft, Google, AWS and multiple tech startups across the region.

We've been lucky enough to support numerous VC-backed startups across Australia and internationally. Our data-led approach focuses on building world-class, resource-efficient, and mission-aligned teams.

About our client

Our client is a leading tech company and AI market leader based out of San Francisco with offices in Sydney, France, NYC, and London. They are part of the next generation of innovative companies creating developer-friendly APIs that are better than building from scratch.

Their customers include household names like Big-W, Under Armour, Petsmart, Stripe, Gymshark, and Walgreens, among thousands of others who rely on their technology to connect users with what matters most.

The Opportunity: Site Reliability Engineer - Events Platform

Join the Events Platform team - the backbone that powers data-driven decision making across the entire organization. You'll be responsible for ensuring systems that ingest, store and process billions of events daily operate with world-class reliability and scale. The events platform and the 'single source of truth' initiative is the foundation upon which all product insights, analytics, and machine learning capabilities are built. If you're excited about building and operating infrastructure at massive scale with sub-second latency, this role is for you.

As a Site Reliability Engineer on this critical team, you'll own the reliability, observability, and performance of the event ingestion and processing systems that power customer telemetry. Working with both internal engineering teams and external stakeholders, you'll have broad impact across the company while ensuring infrastructure operates at world-class scale.

What you'll do:

Reliability & Performance

Define and implement service-level objectives (SLOs) in partnership with engineering teams, ensuring event platforms meet stringent uptime and latency requirements

Design and execute comprehensive load and stress testing strategies to validate throughput thresholds and identify performance bottlenecks before they impact production

Architect and implement monitoring, alerting, and dashboard systems that provide deep visibility into API and platform health

System Resilience

Drive architectural improvements that enhance system resilience through graceful degradation patterns and robust fault handling mechanisms

Evolve observability practices by implementing distributed tracing, structured logging, and comprehensive metrics across all services

Design and implement safe deployment strategies including canary releases, blue/green deployments, and automated rollback mechanisms

Incident Response & Continuous Improvement

Lead incident response efforts, conduct thorough root cause analysis, and own remediation follow-ups to prevent recurring issues

Collaborate with engineering teams to improve system reliability through code reviews, architectural guidance, and best practice evangelization

Ensure production resilience using comprehensive telemetry and alerting across high-throughput data processing systems

You might be a fit if you:

Essential Skills

SRE Fundamentals: Deep understanding of SLIs, SLOs, error budgets, and how to apply them in high-scale production environments

Observability Expertise: Hands-on experience with monitoring and observability tools such as Prometheus, Grafana, OpenTelemetry, or Datadog

Performance Testing: Proficiency with load testing frameworks like k6, Locust, or Gatling, with experience designing realistic test scenarios

Cloud Infrastructure: Strong experience with GCP, Kubernetes orchestration, and infrastructure as code using Terraform and Helm

CI/CD: Proven experience designing and implementing CI/CD pipelines with deployment automation

Incident Management: Track record of effective incident response, root cause analysis, and implementing sustainable fixes

Nice to have:

Experience with backend systems and debugging in Go will be HIGHLY regarded.

Background in high-throughput data processing systems and event-driven architectures

Previous experience with data pipeline reliability patterns

Familiarity with modern AI technologies and ML pipeline infrastructure

Track record of improving developer experience and platform adoption

Experience working with APIs and services that operate at billions of requests scale

The team's current tech stack:

Backend: Go (Primary)

Infrastructure: Google Cloud Platform - BigQuery, Pubsub, Kubernetes, CloudRun, Redis, Dataflow

Observability: Prometheus, Grafana, distributed tracing

Deployment: CI/CD pipelines, Terraform, Helm

Work Arrangement:

This is a 5-month contract role offering flexible remote work within Australia. The company emphasizes impact, contribution, and output over physical location, operating as a high-trust environment where team members have autonomy to choose where and when they work most effectively.

Ready to Apply?

If you're passionate about building world-class reliability systems at massive scale, and you thrive in an ownership-driven environment based on autonomy and diversity, we'd love to hear from you.

This is an exceptional opportunity to join a market-leading AI product company that's revolutionizing how businesses connect with their users, with direct impact on infrastructure that powers data-driven decision making across the entire organization.

Start2Scale is committed to building an inclusive workplace and welcome applications from talented people regardless of race, age, ancestry, religion, sex, gender identity, sexual orientation, marital status, color, veteran status, disability and socioeconomic background.