Site Reliability Engineer (Remote - Contract)
Start2Scale
Posted 1 day ago
About Start2Scale
Start2Scale is a talent acquisition advisory and recruitment consultancy, built for tech companies during their startup and scaleup stage. Our founder Ankit Sharma brings experience supporting world-class tech companies including Atlassian, Microsoft, Google, AWS and multiple tech startups across the region.
We've been lucky enough to support numerous VC-backed startups across Australia and internationally. Our data-led approach focuses on building world-class, resource-efficient, and mission-aligned teams.
About our client
Our client is a leading tech company and AI market leader based out of San Francisco with offices in Sydney, France, NYC, and London. They are part of the next generation of innovative companies creating developer-friendly APIs that are better than building from scratch.
Their customers include household names like Big-W, Under Armour, Petsmart, Stripe, Gymshark, and Walgreens, among thousands of others who rely on their technology to connect users with what matters most.
The Opportunity: Site Reliability Engineer - Events Platform
Join the Events Platform team - the backbone that powers data-driven decision making across the entire organization. You'll be responsible for ensuring systems that ingest, store and process billions of events daily operate with world-class reliability and scale. The events platform and the 'single source of truth' initiative is the foundation upon which all product insights, analytics, and machine learning capabilities are built. If you're excited about building and operating infrastructure at massive scale with sub-second latency, this role is for you.
As a Site Reliability Engineer on this critical team, you'll own the reliability, observability, and performance of the event ingestion and processing systems that power customer telemetry. Working with both internal engineering teams and external stakeholders, you'll have broad impact across the company while ensuring infrastructure operates at world-class scale.
What you'll do:
Reliability & Performance
Define and implement service-level objectives (SLOs) in partnership with engineering teams, ensuring event platforms meet stringent uptime and latency requirements
Design and execute comprehensive load and stress testing strategies to validate throughput thresholds and identify performance bottlenecks before they impact production
Architect and implement monitoring, alerting, and dashboard systems that provide deep visibility into API and platform health
System Resilience
Drive architectural improvements that enhance system resilience through graceful degradation patterns and robust fault handling mechanisms
Evolve observability practices by implementing distributed tracing, structured logging, and comprehensive metrics across all services
Design and implement safe deployment strategies including canary releases, blue/green deployments, and automated rollback mechanisms
Incident Response & Continuous Improvement
Lead incident response efforts, conduct thorough root cause analysis, and own remediation follow-ups to prevent recurring issues
Collaborate with engineering teams to improve system reliability through code reviews, architectural guidance, and best practice evangelization
Ensure production resilience using comprehensive telemetry and alerting across high-throughput data processing systems
You might be a fit if you:
Essential Skills
SRE Fundamentals: Deep understanding of SLIs, SLOs, error budgets, and how to apply them in high-scale production environments
Observability Expertise: Hands-on experience with monitoring and observability tools such as Prometheus, Grafana, OpenTelemetry, or Datadog
Performance Testing: Proficiency with load testing frameworks like k6, Locust, or Gatling, with experience designing realistic test scenarios
Cloud Infrastructure: Strong experience with GCP, Kubernetes orchestration, and infrastructure as code using Terraform and Helm
CI/CD: Proven experience designing and implementing CI/CD pipelines with deployment automation
Incident Management: Track record of effective incident response, root cause analysis, and implementing sustainable fixes
Nice to have:
Experience with backend systems and debugging in Go will be HIGHLY regarded.
Background in high-throughput data processing systems and event-driven architectures
Previous experience with data pipeline reliability patterns
Familiarity with modern AI technologies and ML pipeline infrastructure
Track record of improving developer experience and platform adoption
Experience working with APIs and services that operate at billions of requests scale
The team's current tech stack:
Backend: Go (Primary)
Infrastructure: Google Cloud Platform - BigQuery, Pubsub, Kubernetes, CloudRun, Redis, Dataflow
Observability: Prometheus, Grafana, distributed tracing
Deployment: CI/CD pipelines, Terraform, Helm
Work Arrangement:
This is a 5-month contract role offering flexible remote work within Australia. The company emphasizes impact, contribution, and output over physical location, operating as a high-trust environment where team members have autonomy to choose where and when they work most effectively.
Ready to Apply?
If you're passionate about building world-class reliability systems at massive scale, and you thrive in an ownership-driven environment based on autonomy and diversity, we'd love to hear from you.
This is an exceptional opportunity to join a market-leading AI product company that's revolutionizing how businesses connect with their users, with direct impact on infrastructure that powers data-driven decision making across the entire organization.
Start2Scale is committed to building an inclusive workplace and welcome applications from talented people regardless of race, age, ancestry, religion, sex, gender identity, sexual orientation, marital status, color, veteran status, disability and socioeconomic background.
About Start2Scale
This company does not have any further information provided at this time. We encourage you to research the company by searching for them to learn more about the company or role in question before applying.
6 Month Contract - Senior/Principal Engineers | Nodejs | IAC / AWS | FinTech
Novus

Senior Data Engineer / 12 months Contract / Sydney / $1,000 a day
Allura Partners
Senior Data Engineer / 12 months Contract / Sydney / $1,000 a day
Allura Partners
Shopify Developer - Contract (Remote, Flexible, 10+ hrs/week)
Word of Mouth Digital
Senior Cloud Engineer
Talenza
Senior Software Engineer - Java daily rates up to $1100!
Preacta Recruitment

DevOps Engineer
FinXL IT Professional Services
Senior Data Engineer
Talenza