PTX Backend by WillTrojak · Pull Request #18 · PyFR/GiMMiK

WillTrojak · 2026-05-15T12:23:17Z

This adds a PTX backend to GiMMiK. The key features are:

Mild optimisation of exist CUDA algorithms.
Optional async loads for some sparse kernels
Added dense generation for Hopper and above

Optimisations have focused on FP64, FP32 is future work.

FreddieWitherden · 2026-05-15T18:31:49Z

I know this is an utter pain but for FP32/FP64 can you confirm correctness for all relevant PyFR matrices at a suite of N values for all instances where a kernel is expected to work on A100/H100/B100)?

FreddieWitherden · 2026-05-15T18:33:25Z

+                         .param .u64 _c)
+{
+% endif
+    .reg .u32 n, id, tid_x, tid_y;


Ensure we throw higher up if n is too big.

Checking here

We don't handle n being too large in any of the other backends.

https://github.com/PyFR/GiMMiK/blob/master/gimmik/kernels/cuda/cstream.mako#L20 in the embedded case we do (argument case doesn't but that is not currently used for CUDA).

FreddieWitherden · 2026-06-03T17:58:32Z

JSON looks solid. See if we can factor out some of the common code so that other backends (CUDA) can also use it. Also just makes the code easier to evaluate standalone. I'll start trying to chunk through the kernels, but it would be great if you could give a once sentence sketch of their general approach.

FreddieWitherden · 2026-06-03T18:01:15Z

+    }
+
+    # Map Supported CC -> Minimum PTX version
+    PTX_SM = {(8, 0): (7, 0), (9, 0): (8, 6), (10, 0): (8, 7), (10, 3): (8, 7),


Is this okay when new GPUs are released?

Yeah, when new GPUs are released, the behaviour will fall back to the default config. This won't give the best performance but it will work.

FreddieWitherden · 2026-06-25T14:52:19Z

Does it make sense to move config up a level so it is configs/ptx/ rather than it being under kernels?

FreddieWitherden · 2026-07-02T08:37:05Z

+## Main loop over B-chunks (double-buffered)
+%  for bb in range(len(bchunks)):
+<%
+        buf_cur = bb % 2


Check indentation here.

FreddieWitherden · 2026-07-02T09:29:07Z

        pass

+    def _get_config(self, key):
+        if key not in self._config_cache:


FreddieWitherden · 2026-07-02T09:30:04Z

        # At single precision suffix all floating point constants by 'f'
-        if dtype == 'float':
+        # (PTX doesn't use an 'f' suffix for FP literals)
+        if dtype == 'float' and self.platform != 'ptx':


Have an attr like _needs_fp32_suffix = True|False to avoid the PTX check.

FreddieWitherden · 2026-07-02T10:30:36Z

+        cfg = [k for k in cfgs if self._usable_config(k, dtype, cc, smem_info)]
+
+        for k in cfg:
+            if prepared := self._get_render_args(


Probably cleaner not to use walrus here.

FreddieWitherden · 2026-07-02T10:31:42Z

+    def _sparse_args(self, tpl, params, block, dtype, dsize, args, meta):
+        blockx = block[0]
+        args |= {'has_zero_rows': bool(self.has_zero_rows),
+                 'row_nz': [[(kx, self.A[j, kx]) for kx in range(self.k)


Messy; NumPy should help here.

FreddieWitherden · 2026-07-02T10:32:03Z

+        args |= {'has_zero_rows': bool(self.has_zero_rows),
+                 'row_nz': [[(kx, self.A[j, kx]) for kx in range(self.k)
+                     if self.A[j, kx] != 0] for j in range(self.m)],
+                 'preload_c': bool(params.get('preload_c', False)),


Is preload_c not always in params? Try to avoid overly defensive.

FreddieWitherden · 2026-07-02T10:33:07Z

+        if tpl.startswith('dmma-asmem'):
+            args |= {
+                'a_copy_threads': 32 * warps,
+                'block_stealing': bool(params.get('block_stealing', False)),


Same here, try to avoid being overly defensive.

FreddieWitherden · 2026-07-02T10:33:24Z

+        tpl = kernel_cfg['template']
+        nn = params['nn']
+        warps = params['warps']
+        tile = kernel_cfg['tile']


Can put some of these definitions onto the same line.

FreddieWitherden · 2026-07-02T10:34:09Z

+            'b_smem_kgroup_stride': 4 * n_per_cta * args['dwidth_i'],
+            'b_smem_ntile_stride': setup['tile_n'] * args['dwidth_i'],
+            'blockx_total': 32 * warps * msplit,
+        } | offsets


Maybe merge offsets in in the return statement, so return tpl, args | offsets, meta

FreddieWitherden · 2026-07-02T10:34:26Z

+        ptx_shape = f'm{tile_m}n{tile_n}k{tile_k}'
+
+        m_groups, k_groups = tile_m // 8, tile_k // 4
+        a_regs = m_groups * k_groups


No space around * in general.

FreddieWitherden · 2026-07-02T10:34:39Z

+            return None
+
+        if (width == 2
+                and (self.aligne is None or self.aligne % 2


Icky identation.

FreddieWitherden · 2026-07-02T14:35:49Z

+    @staticmethod
+    def _pred_emit(instr, *preds, pred_reg=None, indent=8 * ' '):
+        # Handle whether an instruction needs a predicate or not
+        actual = [p for p in preds if p is not None]


Why would None get passed in?

Will Trojak and others added 6 commits December 2, 2025 22:13

[wip] added ptx generator for bstream

0cd7485

Addtional sparse and dense work

626c2f5

Dense and sparse optimisation

bbbb8ef

Added warp specialised dense kernel

393b409

Performance tuning and cleanup

67d1beb

Whitespace

e2a818b

WillTrojak mentioned this pull request May 15, 2026

Support for GiMMiK PTX Provider PyFR/PyFR#556

Open

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/kernels/ptx/bstream-msplit.mako Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/kernels/ptx/base.mako Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/kernels/ptx/bstream-msplit.mako Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/kernels/ptx/bstream-msplit.mako Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/kernels/ptx/bstream-msplit.mako Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/kernels/ptx/cstream-ksplit.mako Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/kernels/ptx/bstream.mako

Cleanups, formating and addressign comments

7d7299a

FreddieWitherden reviewed May 19, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 19, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 19, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 19, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

WillTrojak added 4 commits June 3, 2026 10:25

msplit dmma

d4e1216

Refactored arg setup

b9ac47c

Updated Blackwell profile

66e3796

updated sm100 config

010d13c

FreddieWitherden reviewed Jun 3, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed Jun 3, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed Jun 3, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed Jun 3, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed Jun 3, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed Jun 3, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed Jun 3, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

Cleanups and added default config.

1e6e55d

FreddieWitherden mentioned this pull request Jun 18, 2026

Add tuned HIP GiMMiK preload-C and width variants with non-temporal loads and stores #19

Open

WillTrojak added 2 commits June 24, 2026 02:22

FP32 configs and kernels

1acb3b5

Refactoring and cleanup

4bf8c91

FreddieWitherden reviewed Jul 2, 2026

View reviewed changes

Comment thread gimmik/base.py

pass

def _get_config(self, key):

if key not in self._config_cache:

FreddieWitherden Jul 2, 2026

Copy link
Copy Markdown

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EAFP

FreddieWitherden reviewed Jul 2, 2026

View reviewed changes

Uh oh!

Conversation

WillTrojak commented May 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FreddieWitherden commented May 15, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FreddieWitherden commented Jun 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FreddieWitherden commented Jun 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!