HPCRSE@RSECon24: 3rd annual meeting of the HPC RSE community

Submitted by Dr Marion Weinzierl, Senior Research Software Engineer, ICCS University of Cambridge

This year at RSECon24 we celebrated the 3rd meeting of the HPC RSE community, with members joining from the UK and internationally. We took this chance to discuss recent developments and topics in our space, share experiences and best practices, and hopefully foster collaboration and a community feeling.

This session has grown from a satellite event at RSECon in 2022 to the very first RSECon Birds of a Feather session in 2023 to the HPC RSE March 2024 Online Meetup and finally to this year’s event at the Community Day, which attracted over 100 participants. It’s noteworthy that other Birds of a Feather sessions were invited during the RSECon24 Community Day – we will take this as proof that HPC RSEs are trailblazers!

A photo of a lecture theatre full of people watching a presentation.

Samantha Ahern standing at a lectern, giving a presentation. The slide on the screens behind her says "what are upcoming training priorities?"

The organising team has grown as well: the original team in 2022 consisted of just Andy Turner (EPCC) and Marion Weinzierl (ICCS, University of Cambridge). In 2023 they were joined by Ed Hone (University of Exeter) and the team doubled in size with Nick Brown (EPCC), Eirini Zormpa (Imperial College London) and Tuomas Koskela (UCL) all joining this year.

Technical Presentations and Discussion

The first technical presentation was given by Charles Ferembaugh from Los Alamos National Laboratory. He spoke about refactoring and porting efforts of the xRAGE Fortran code, a collaborative effort by a team of computer scientists, software engineers and domain scientists. After many failed attempts at porting the code to large GPU systems, a successful port was made using C++ and Kokkos kernels.

The second technical presentation was given by Tom Meltzer from the University of Cambridge. He spoke about how most software developers rely on proprietary tools to debug parallel software, even though this leaves them dependent on institutions paying for expensive licences. He then introduced an open-source debugger, mdb, that he is developing to address the gap in the market.

The technical presentations were followed by a panel discussion. The panel consisted of the two presenters, Charles Ferembaugh and Tom Meltzer, Nick Brown and Miren Radia (DiRAC/University of Cambridge). The discussion started with the observation that only few codes can take advantage of exascale systems right now, due to technical debt accrued from a previous era of computing. Porting these codes to run on new systems takes significant refactoring or rewriting effort. Eight to ten years ago, the only viable option for achieving performance on large GPU systems was to rewrite the code from Fortran to C++. The landscape is now shifting to mixed-language solutions, imposing additional demands on developers to gain expertise in both Fortran and C++. It seems like mixed language solutions are here to stay: despite the advanced features Fortran offers, productivity is hampered by poor support from compiler vendors. There is, however, hope that LLVM and projects like Flang will mitigate the compiler issue.

Fortran was discussed at length, and adaptation of modern software development tools, such as the Fortran package manager fpm was raised. Previously in the conference, the Back to the Fortran Future satellite event had touched on the same topic. The panel was concerned about the uncertainty of long-term support of such projects and called for large institutions to support community projects instead of in-house solutions.

Portability dominated the latter part of the panel discussion. The audience noted that AMD GPUs are often readily available for purchase, but porting CUDA code to run on them is a challenge. Technologies such as Kokkos and AMReX can ameliorate the issue by abstracting the hardware-specific backend away from the user. Exascale software projects should avoid technologies that lock them into vendor-specific hardware. Directive-based approaches are another way of addressing the issue, but compiler support is unfortunately lacking to make them equally portable. To port apps from one HPC system to another, projects should prioritise maintainability and users, have a robust testing system and ensure portability of build systems by using tools like Spack.

HPC Services Lightning Updates and Posters

The technical work of the first session would not be possible without HPC service providers. As such, we felt it important to give service providers an opportunity to update their existing users on their latest activities and to advertise what they have to offer to potential new users. We had seven service providers participate, which involved a two-minute lightning talk and a poster presentation during the coffee break.

Community and Training Presentations and Discussion

Marcus Keil opened the second part of the session, with a talk on PyProfQueue, a Python package designed to simplify profiling batch queue work on HPC systems. The aim of this package is to encourage research groups and the RSE community to regularly profile (and optimise) their code. They hope that this will reduce computing resource waste and aid in meeting sustainability goals without compromising on scientific output.

Samantha Ahern then led an interactive discussion on advanced training needs for RSEs working in HPC. This resulted in a spirited exchange of ideas, with the audience expressing a clear desire for more training opportunities. During the discussion on next steps, many expressed that the key barrier to accessing training was their time, rather than the availability of quality of resources.

After this discussion both speakers were joined by Eirini Zormpa and Juan Herrera for a panel discussion on HPC community and training. The panel covered topics ranging from

engagement with other similar groups such as HPC-SIG
whether HPC training should be adapted to reach the wider range of RSEs using HPC for AI and machine learning
how attendance at RSE training could be improved by better coordination of training as well as freeing up time for RSEs to engage with the training programmes.

Wrap up

The discussions could have gone on for longer, but we eventually had to bring the session to a close. However, there is clearly an appetite for continuing annual meetings at RSECon, as well as online meetings throughout the year. The organising team will tackle setting up a Special Interest Group under the RSE Society as a next step – hopefully we can report progress on this next time we meet!

The most natural place for us to communicate is the #hpc channel on the UK RSE Slack, but you can also find the organising team on WHPC, hpc.social, HPC SIG, and many more! If you want to learn more or want to get involved, ping us a message or grab us when you see us around!