The Scientific Computing Department (SCD) provides world class leading edge compute and data storage infrastructure to support the work of world class science both within STFC, the UK and internationally.
The Research Infrastructure (RI) group (within SCD) defines, develops and manages the underpinning scientific computing infrastructure used to provide an extensive range of national and international science projects including the GridPP Tier1 service, the JASMIN super data cluster, STFC’s central HPC cluster as well as computing support for STFC facilities such as the ISIS Neutron Source, the Central Laser Facility plus the Diamond Light Source. The RI group currently has a vacancy for a highly motivated Linux Systems and Services Administrator to join the team that manages the extensive computing infrastructure that supports the large science projects as well as general purpose computing resources used by the SCD. The current infrastructure includes 20000+ CPU cores, 30PB+ online data storage, near line tape storage, complex virtualisation and cloud installations all underpinned by high performance networking.
The post is based at the STFC Rutherford laboratory in Oxfordshire.
List of Duties/Work Programme/Responsibilities
part of the team responsible for installing, maintaining and supporting both the High Performance Computing (HPC) and High Throughput Computing (HTC) services as well as general purpose scientific computing resources managed by the RI group
ensuring services run by the RI group run smoothly and meet their operational commitments
investigating and resolving operational problems and incidents affecting production services, often acting as a first line of response and escalating to specialists as required
work alongside other team members, other groups in SCD, STFC and external collaborators to ensure that the services are able to meet the scientists’ requirements
investigate, recommend and deploy new technologies, services and management tools as required to enhance the service levels provided by the RI group
part of an on-call team.
Technical Skills Required
experience in the management of Linux (ideally RHEL/SL/CENTOS ) machines, ideally in a production environment
experience in user support roles preferably in a production environment
Ideally, additional experience in at least one of:
performance and exception monitoring tools such as Nagios/Icinga or Ganglia
scientific computing workflows on HPC or HTC clusters including workload schedulers such as Platform LSF or HTCondor
managing a service platform to a high level of availability
experience in the configuration and management of large numbers of Linux servers
large capacity storage solutions
ethernet and TCP/IP networking
Personal Skills and Attributes
good communication skills both verbal and written
a proactive attitude to problem solving, service delivery and continuous improvement
the ability to work as a team member towards the delivery of both common team and personal objectives
commitment to acquiring new skills given the opportunities to work with a large range of cutting edge technologies deployed within the SCD