The Research Infrastructure (RI) group (within SCD) defines, develops and manages the underpinning scientific computing infrastructure used to provide an extensive range of national and international science projects including the GridPP Tier1 service, the JASMIN super data cluster, STFC’s central HPC cluster as well as computing support for STFC facilities such as the ISIS Neutron Source, the Central Laser Facility plus the Diamond Light Source.
The RI group currently has a vacancy for a highly motivated IT professional or scientist with experience of delivering scientific computing services or e-infrastructure to join the team that manages the extensive computing infrastructure that supports the large science projects as well as general purpose computing resources used by the SCD.
The current infrastructure includes 20000+ CPU cores, 30PB+ online data storage, near line tape storage, complex virtualisation and cloud installations all underpinned by high performance networking.
The post is based at the STFC Rutherford laboratory in Oxfordshire.
List of Duties/Work Programme/Responsibilities
- Ensuring services run by the RI group run smoothly and meet their operational commitments by:
- ensuring operational problems and incidents affecting production services are identified and resolved in a timely manner, often acting as a first line of response and escalating to specialists as required
- planning for interventions, upgrades and planned downtime periods across a complex interrelated set of services
- tracking service performance against agreed SLAs/SLDs and reporting to relevant stakeholders on service performance and issues
- ensuring the team follows agreed best practices and driving a process of continuous improvements to ensure the high availability of production systems and services
- working alongside other team members, other groups in SCD, STFC and external collaborators to ensure that the services offered are able to meet the scientists’ requirements
- investigate, recommend and deploy new technologies, services and management tools as required to enhance the service levels provided by the RI group
- part of an on-call team
Technical Skills Required
- ability to manage a large multi user service platform to a high level of availability
- ability to support users of computing systems, responding to problems and developing solutions as required in a timely manner
- strong problem solving and analysis skills
Ideally, additional experience in at least one of:
- performance and exception monitoring tools such as Nagios/Icinga or Ganglia
- managing scientific computing workflows on HPC or HTC clusters
- system administration procedures and practices
Personal Skills and Attributes
- good communication skills both verbal and written
- a proactive attitude to service delivery and continuous improvement
- the ability to negotiate and reach consensus with multiple stakeholders often with conflicting requirements
- the ability to work well as a team member or team leader ensuring that the team meets its commitments
- the ability to plan and manage workloads towards the delivery of both common team and personal objectives
- a commitment to acquiring new skills given the opportunities to work with a large range of cutting edge services deployed within the SCD
Any other Relevant Information
occasional UK and Overseas travel
For more information and to apply, click here.