Staff Software Engineer, ML Infrastructure
Tasks
- Build feedback loops between cloud inference edge devices and data pipelines
- Define SLOs and observability standards for ML services
- Design and evolve real time cloud inference systems
- Drive architecture decisions for Kubernetes based ML platform
- Establish best practices for model lifecycle management and on call
- Identify and remove bottlenecks in ML deployment infrastructure
- Improve throughput latency and cost for production CV models
- Lead incident response and postmortems for ML systems
- Lead technical reviews for system design and capacity planning
- Mentor engineers through design reviews code reviews pairing and guidance
- Set technical direction for ML infrastructure
- Shape production LLM serving patterns and evaluation pipelines
- Write documentation runbooks and architecture decision records
Perks/Benefits
- Employee resource groups
- Free home security system
- Hybrid work model
- Inclusive safe work environment
- Professional home monitoring
- Wellness support
Skills/Tech-stack
AWS EKS | AWS IAM | Amazon S3 | Amazon Web Services | Autoscaling | Batching | C plus plus | CI/CD | Caching | Capacity Planning | Distributed Systems | GPU scheduling | GPU-based inference | Go | High Throughput | High-Throughput Systems | Incident Response | Infrastructure as Code | KServe | KV cache | Kafka | Kubernetes | LLM serving | Load Balancing | Low Latency | Low-Latency Systems | Multi-tenancy | NVIDIA Triton | Observability | Postmortems | Python | Queuing | Ray | Rust | SLOs | Terraform | VLLM | Web Services | “as-code”
Education
N/A
Roles
Regions
Countries
States
Cities
Related jobs
-
Cyber Security Analyst/Engineer USD 100K-130K800-171 | ACAS | AWS GovCloud | Big-IP | DDoS MitigationMid-level Full TimeMCB Quantico, VA, United States7h ago
-
Senior Splunk Engineer USD 147K-234KAcceleration | Access Control | Alerting | Capacity Planning | Common Information ModelSenior-level Full TimeWork from home, VA, United States R7h ago
-
DevSecOps Engineer - Senior USD 140K-170KAgile | Automation | CI/CD | Infrastructure as Code | ScriptingSenior-level Full TimeWashington, DC, United States7h ago
-
Endpoint Security Engineer - Senior USD 135K-175KAuditing | Classified environment | Compliance | Cross domain | Cross-domain securitySenior-level Full TimeWashington, DC, United States7h ago
-
Senior Application Security Engineer USD 120K-187KAWS | Amazon QuickSight | JFrog Artifactory | JFrog Xray | JavaScriptHybrid workSenior-level Full TimeAlpharetta, GA, United States7h ago
-
Cybersecurity Engineer USD 140K-160KBash | CI/CD | DISA STIG | FIPS | Infrastructure as CodeMedical benefitsMid-level Full TimeColorado Springs, CO, USA7h ago
-
Information Technology Senior Consultant 5-BP-181 USD 78K-250KApache Kafka | CI/CD | Docker | Flux | Git401k company match | Bonus opportunities | Dental insurance | Diverse Inclusive Workplace Culture | Employee referral programSenior-level Full TimeHanover, MD8h ago
-
ElasticSearch Engineer USD 130K-190KAccess Control | Agile | Beats | CI/CD | ConfluenceCareer growth opportunities | Health benefits | Work-life balanceMid-level Full TimeChantilly, Va9h ago
-
DevOps / Site Reliability Engineer ID70127 USD 150K-190KAPI deployment | Access Management | Amazon Web Services | Azure | CI/CDFlextime | Mentorship | Office options | Personalized growth roadmaps | Remote optionsSenior-level Full TimeBoston, United States10h ago
-
DevOps / Site Reliability Engineer ID70127 USD 150K-190KAPIs | Amazon Web Services | Azure | CI/CD | CSPMFlexible schedule | Mentorship | Office options | Personalized growth roadmaps | Professional growthSenior-level Full TimeBoca Raton, United States10h ago
-
DevOps / Site Reliability Engineer ID70127 USD 150K-190KAPIs | Access Management | Amazon Web Services | Azure | CI/CDEducation budget | Fitness budget | Flexible schedule | Mentorship | Personalized growth roadmapsSenior-level Full TimeLos Angeles, United States10h ago
-
DevOps / Site Reliability Engineer ID70127 USD 150K-190KAPIs | ASPM | Amazon Web Services | Azure | CI/CDFlexible schedule | Mentorship | Office options | Professional growth | Remote work optionsSenior-level Full TimeWest Palm Beach, United States11h ago
-
DevOps / Site Reliability Engineer ID70127 USD 150K-190KAPIs | Amazon Web Services | CI/CD | Cloud Security | Cloud Security PostureCompetitive compensation | Education budget | Fitness budget | Flextime | MentorshipSenior-level Full TimeFort Lauderdale, United States11h ago
-
DevOps / Site Reliability Engineer ID70127 USD 150K-190KAPI deployment | AWS | Automated Runbooks | Azure | CI/CDEducation budget | Fitness budget | Flextime | Mentorship | Office optionsSenior-level Full TimeTampa, United States11h ago
-
DevOps / Site Reliability Engineer ID70127 USD 150K-190KAPI deployment | ASPM | Access Management | Amazon Web Services | Automated RunbooksEducation budget | Fitness budget | Flexible schedule | Mentorship | Professional growthSenior-level Full TimeJersey City, United States11h ago
-
DevOps / Site Reliability Engineer ID70127 USD 150K-190KAPI deployment | AWS | Access Management | Automated Runbooks | AzureEducation budget | Fitness budget | Flexible schedule | Mentorship | Office optionsSenior-level Full TimeHouston, United States11h ago
-
DevOps / Site Reliability Engineer ID70127 USD 150K-190KAccess Management | Amazon Web Services | Automated Runbooks | CI/CD | Cloud SecurityEducation budget | Exciting projects | Fitness budget | Flexible schedule | MentorshipSenior-level Full TimeMiami, United States11h ago
-
DevOps / Site Reliability Engineer ID70127 USD 150K-190KAPIs | ASPM | AWS | Access Management | AzureCompetitive compensation | Education budget | Fitness budget | Flexible schedule | MentorshipSenior-level Full TimeBaltimore, United States11h ago
-
DevOps / Site Reliability Engineer ID70127 USD 150K-190KAWS | Access Management | Automated testing | Azure | CI/CDEducation budget | Fitness budget | Flextime | Mentorship | Office optionsSenior-level Full TimeOrlando, United States11h ago
-
DevOps / Site Reliability Engineer ID70127 USD 150K-190KAPI deployment | Access Management | Amazon Web Services | Automation | AzureEducation budget | Fitness budget | Flexible schedule | Mentorship | Office optionsSenior-level Full TimeDallas, United States11h ago
-
DevOps / Site Reliability Engineer ID70127 USD 150K-190KAPIs | Access Management | Amazon Web Services | CI/CD | Cloud SecurityFlexible schedule | Mentorship | Office options | Professional growth | Remote optionsSenior-level Full TimeIrving, United States11h ago
-
DevOps / Site Reliability Engineer ID70127 USD 150K-190KAPIs | Access Management | Amazon Web Services | Azure | CI/CDFlexible schedule | Mentorship | Remote and office options | Training and tech talksSenior-level Full TimeRichmond, United States11h ago
-
DevOps / Site Reliability Engineer ID70127 USD 150K-190KAPI Integration | AWS | Automated Runbooks | Azure | CI/CDEducation budget | Fitness budget | Flextime | Mentorship | Personalized growth roadmapsSenior-level Full TimeTallahassee, United States11h ago
-
DevOps / Site Reliability Engineer ID70127 USD 150K-190KAccess Management | Amazon Web Services | Azure | CI/CD | Cloud SecurityEducation budget | Fitness budget | Flexible schedule | Mentorship | Office work optionsSenior-level Full TimeAustin, United States11h ago
-
DevOps / Site Reliability Engineer ID70127 USD 150K-190KAccess Management | Amazon Web Services | Azure | CI/CD | Cloud SecurityFlexible schedule | Mentorship | Professional growth | Remote and office optionsSenior-level Full TimeAtlanta, United States11h ago