Skip to Main Content
← Back to Jobs

Site Reliability Engineer SRE

Bay Systems Berkeley, CA
  • Posted: over a month ago
  • $38 Hourly
  • Contractor
  • Benefits: Vision, Medical, Dental

Position Title: Site Reliability Engineer 1 (SRE‚

Location: Lawrence Berkeley National Laboratory

Start Date: ASAP – 1-year Contract – Medical; Dental; Vision

Pay Rate: $38.00/Hour

Position Summary:

The Site Reliability Engineer 1 is an entry level position at the National Energy Research Scientific Computing (NERSC) Center at the Lawrence Berkeley National Laboratory. This

Position is an integral member of a team that provides multipurpose engineering support for

a high performance computational facility and an unclassified high speed network to

advance scientific discovery. Not only will this position provide monitoring services of

both facilities, they will also improve the existing tools used to ensure the reliability of the

systems, appropriately manage outages to minimize downtime, to ensure that the correct solutions to issues are applied and the follow through to determine reason for outages.

Duties/ Responsibilities:

  • We are looking to hire individuals to join our team of Site Reliability Engineers to ensure that
  • NERSC and ESnet are accessible, reliable, secure and available to our users on a 24x7 basis.
  • You will be performing tasks such as the following:
  • Using your Linux system administration skills, monitor and manage the reliability of the systems under the responsibility of the Control Room Bridge. Under the supervision of the Project Lead, assist to develop and maintain monitoring tools used to support the HPC community within NERSC using programming languages like C, C++, python, java or perl.
  • Provide input in the design of software, workflows and processes that improve the monitoring capability of the group to ensure the high availability of the HPC services provided by NERSC and ESnet.
  • Assist in the testing and implementation of new monitoring tools, workflows and new capabilities for providing high availability for the systems in production.
  • Assist in direct hardware support of our data clusters through managing component upgrades and replacements (dimms, hard drives, cards, cables, etc) to ensure the efficient return of nodes to production service.
  • Maintain outage documentation through a trouble ticketing system.
  • Assist in investigating and evaluating new technologies and solutions to push the group’s capabilities forward, getting ahead of our users’ needs, convincing staff incentivized to transform, innovate and continually improve.

SKILLS / QUALIFICATIONS:

  • Bachelor’s Degree in a Computer Science or similar discipline or equivalent experience.
  • Minimum of 3 years related experience including 1 year as a system administrator or system engineering in a high‐volume customer‐facing environment supporting data clusters, managing the replacement of hardware, and ensuring its continuous availability to the user community.
  • This includes assisting in the deployment of new nodes and internal switches into production, resolving ticket incidents and working with vendors on hardware warranty replacements.
  • Minimum of 1 year of experience in UNIX or Linux, Networking, IT infrastructure environment and management experience in a distributed‐computing environment.
  • Experience with or have taken the appropriate semesters of classes in programming languages such as C, C++, perl, java and Python or a scripting language
  • Experience in or have taken classes in the areas of TCP/IP related technologies (networking protocols, network programming, e.g. TCP/IP, UDP, ICMP, etc., MAC addresses, IP packets, DNS, OSI layers, and load balancing, etc.)
  • Experience in or have taken the appropriate classes or certifications in areas of enterprise support of Unix variants (Linux/Solaris/BSD)
  • Strong understanding of monitoring implementations and administration
  • Past experience in Incident Management and good understanding of IT service management.
  • Experience in working in a 24/7 team. Exposure to Oracle and high end Storage Infrastructure (Hitachi/EMC Tier 1) or have taken the appropriate classes in these areas.
  • Exposure to or taken classes in configuring distributed, server‐based infrastructure supporting a high volume of transactions in a mission critical environment in a Linux environment
  • One to two years of experience in large data communications networks and IT infrastructure supporting critical systems and applications.
  • Knowledge of network security: configuring/maintaining ACLs, knowledge of firewall Strong communication skills and ability to work effectively across multiple business and technical teams.
  • Demonstrated ability to deliver results on time with high quality

REQUIRED QUALIFICATIONS:

  • Bachelor’s Degree in a Computer Science or similar discipline or equivalent experience.
  • Minimum of 3 years of related experience including 1 year as a system administrator or system engineering in a high‐volume customer‐facing environment supporting data clusters, managing the replacement of hardware, and ensuring its continuous availability to the user community. This includes assisting in the deployment of new nodes and internal switches into production, resolving ticket incidents and working with vendors on hardware warranty replacements.
  • Minimum of 1 year of experience in UNIX or Linux, Networking, IT infrastructure environment and management experience in a distributed-computing environment.
  • Experience with or have taken the appropriate semesters of classes in programming languages such as C, C++, perl, java and Python or a scripting language
  • Experience in or have taken classes in the areas of TCP/IP related technologies (networking protocols, network programming, e.g. TCP/IP, UDP, ICMP, etc., MAC addresses, IP packets, DNS, OSI layers, and load balancing, etc.)
  • Experience in or have taken the appropriate classes or certifications in areas of enterprise support of Unix variants (Linux/Solaris/BSD)
  • Strong understanding of monitoring implementations and administration
  • Past experience in Incident Management and good understanding of IT service management.
  • Experience in working in a 24/7 team.
  • Exposure to Oracle and high end Storage Infrastructure (Hitachi/EMC Tier 1) or have taken the appropriate classes in these areas.
  • Exposure to or taken classes in configuring distributed, server‐based infrastructure supporting a high volume of transactions in a mission critical environment in a Linux environment
  • One to two years of experience in large data communications networks and IT infrastructure supporting critical systems and applications.
  • Knowledge of network security: configuring/maintaining ACLs, knowledge of firewalls
  • Strong communication skills and ability to work effectively across multiple business and technical teams
  • Demonstrated ability to deliver results on time with high quality
  • Excellent problem solving skills.

Bay Systems

Bay Systems is a Aerospace & Defense Federal contractor in San Francisco Bay Area with an expanding client portfolio, including Dept. of defense, Dept. of Energy, NASA etc. Currently, we represent one of the fastest growing enterprises in the Applied Sciences and information technology field.

What email should the hiring manager reach you at?

By clicking the button above, I agree to the ZipRecruiter Terms of Use and acknowledge I have read the Privacy Policy, and agree to receive email job alerts.

What email should we contact you at once we get salary info from the hiring manager?

By clicking the button above, I agree to the ZipRecruiter Terms of Use and acknowledge I have read the Privacy Policy, and agree to receive email job alerts.

Our qualification feature is only available to registered members - what email address would you like for us to keep on file?

By clicking the button above, I agree to the ZipRecruiter Terms of Use and acknowledge I have read the Privacy Policy, and agree to receive email job alerts.