Revel IT presents the below position in partnership with Level D&I Solutions.
ABOUT OUR PARTNERSHIP:
At Revel IT, one of our core values is that we believe all people have value. We also believe a key to someone’s success is helping them find the right role and culture to work in. Our commitment to leveling the playing field is significant enough that we’ve helped create and we partner with a firm focused on Diversity and Inclusion- Level D&I Solutions.
Looking for a Site Reliability Engineer who will build reliable, high capacity and well-performing systems in support of our client’s mission. As an SRE, you will care about telemetry, cost, security, performance and reliability in infrastructure. You will collaborate in a DevOps model with product development teams; designing, deploying and managing automation tools that increase predictability as well as time to market while reducing cost.
- Code:, Java, PHP, NodeJS, and GoLang
- RDBMS: Oracle, PostGreSQL, MySQL
- Cache: Couchbase, Redis, ElastiCache, DynamoDB
- Containers: ECS, K8S, Docker
- Cloud: Amazon AWS
- Telemetry: New Relic, CloudWatch
- Build: Jenkins, CircleCI, GitHub Actions
- Run: PagerDuty, Exigence
- Cloud Engineering
- Hands-on design, analysis, development and troubleshooting of highly-distributed large-scale production systems and event-driven, cloud-based services
- Ensure repeatability, traceability, and transparency of our infrastructure automation (infrastructure-as-code, monitoring-as-code)
- Participate in continual learning of the AWS ecosystem, game day scenarios, and professional conferences
- Collaborative solutioning of enterprise applications with development teams utilizing our software stack
- Produce Base AMIs and rotate/patch all hosts every 30 days
- Actively monitor AWS Cost Explorer, and utilize optimizer to decrease costs while maintaining Service Level Objectives
- Observability Engineering
- Ownership of reliability, uptime, system security, cost, operations, capacity, resiliency and performance-analysis thereof
- Define, monitor and report on service level indicators for applications workloads
- Support on-call rotations for operational duties that have not been addressed with automation, with an eye for correcting issues that result in on-call alarms
- Maintain telemetry that improve the visibility to our applications’ performance and business metrics and keep operational workload in check
- Develop, communicate, collaborate, and monitor standard processes to promote the long-term health and sustainability of operational development tasks.
- Support healthy software development practices, including complying with agile software development methodology, building standards for code reviews, work packaging, and continuous delivery
- Partner with CyberSecurity and develop plans and automation to respond to new risks and vulnerabilities
- Systems Engineering
- Collaborate with Systems Admins to coordinate middleware, network, storage, database, Windows, Linux, VMware maintenance
- Automate legacy on-prem system maintenance and migrate to cloud via thoughtful redesign
- Resiliency Engineering
- Collaborate with dev teams to identify failure points and blast radius of systems
- Validate effectiveness of monitoring and observability configurations
- Coordinate failure injection testing
- Observe and document steady state production levels, growth patterns
- Plan and forecast for seasonal growth, communicate trend lines with leadership, enhance infrastructure scaling plans to accommodate 2x planned load
- Coordinate improvements of existing software and infrastructure to meet resiliency goals
- Experience as a software engineer, with practical experience developing, debugging, and deploying enterprise applications
- Experience with infrastructure automation technologies (like Terraform, Puppet, Ansible)
- Expertise in container/container-fleet-orchestration technologies like ECS or Kubernetes
- Cloud and container native Linux administration/build/management skills (AWS AMIs, Packer, etc.)
- Versatility with troubleshooting diverse sets of hosting technologies strongly desired. These include web server platforms, application platforms, operating systems, network components, virtualization technologies, storage, and database platforms.
- Expertise with continuous-deployment based software development lifecycles (e.g. CI/CD)
- Cloud database operations and deployment experience (RDS MySQL/Postgres/Aurora)
- Experience with application caching strategies and high concurrency workloads
- Expertise with Lean/Agile deployment processes (Blue/Green, ZDT, Canary, load balancers/DNS strategies)
- Familiarity with telemetry SaaS systems like New Relic
- Strong problem solving, root cause analysis and systems engineering skills
- Excellent presentation and communication skills
- Ability to design and manage escalation response plans from monitoring, react, respond, remediate and retrospect in culturally aligned (proactive, customer focused, collaborative, data-driven) ways.
- Demonstrated expertise building and managing highly scaled production infrastructure in the cloud (AWS required; GCP, Azure, OpenStack a plus)
- Expertise with SDLC branching, SCM, and code deployment systems (Git/Gitflow, Jenkins, CircleCI, TravisCI, etc.)
Nice to Have:
- Being able to translate between development, operations, security, product, and management dialects is a highly-sought skill.
- Ability to translate knowledge and ideas into written-word as documentation.
- BS Degree in Computer Science (or related technical field and/or equivalent industry experience) preferred
ABOUT LEVEL D&I:
Level helps clients address gaps in D&I practices and recruiting. Our mission is to create a Level playing field for people of all backgrounds and promote innovation through diversity of thought. The founders of Level are two former Revel employees.
Apply with Github Apply with Linkedin Apply with Indeed