LLVM Discussion Forums

[RFC][Standard] Add `std.clamp*` operation

Hi,

I’d like to propose std.clamp operation to the Standard dialect. It takes a value, a lower bound and an upper bound and returns a value within (lower, upper) region. This operation is found in many programming models, like C++, OpenCL, SYCL, CUDA. Besides, having clamp as a single operation would allow compiler to make assumptions about actual value range.

Link to review: https://reviews.llvm.org/D84011

Can you please elaborate how this is intended to be used?

Apart from the obvious foldings / canonicalization, which passes would benefit from using this op, in what scenario etc.
Also, does this just lower to a double-select on all HW or do you see backend-specific variations?
Also, how does it / should it behave wrt vectorization ?

Thanks!

My current use of clamp is literally clamping values.
However, I’ve been thinking that it also can be used in static analysis. Consider a simple OpenCL convolution kernel. Your global ids are coordinates of output image, and you get the coordinates of source image by adding offset to the global index. The clamp operation allows compiler safely assume that all your data access is within memory allocation.
Another use that comes to my mind is when values are clamped to some range, compiler can sometimes change the datatype to a narrower to save some memory budget. Like in the following pseudo code all the y values actually fit fp16 or other similar data type.

    float* x;
    float* y;
    for (int i = 0; i < N; i++) {
      float r = f(x[i]);
      y[i] = clamp(r, -1.0, 1.0);
    }

Lowering of clamp can be a bit tricky however. For scalars it is truly just a double-select. SIMT or SIMD, however, would require some work to do. OpenCL C, GLSL, CUDA seem to provide this kind of functionality, that can be reused. On x86 I was thinking of using masked vector intrinsics to generate correct loads and stores.

Another lowering path which should be considered is using a combination of min and max operations, but as far as I can tell, those are not currently available in standard dialect.

So I don’t think I would buy a new higher-level op just for this purpose: my knee-jerk reaction is “just use double-select” :slight_smile: The main reason is that new ops may require a lot of work to properly canonicalize, fold and compose with the rest of the infra and to generate good LLVMIR (at least for now, see some discussion below).

If the op implements a functionality that is not possible to obtain with existing abstractions or that improve static analysis then these are usually strong signals, but I don’t see them here yet.

Right, this is similar to some of the Halide strategies, which works well on GPU. This is one of the reasons I am interested in more elaboration here: clamping along parallel dimensions is fine and just results in a race at read / write time to the same memory location. However on reduction dimensions one really needs to insert the neutral element for the op or the computation will be different. So I don’t think it is magically “safe”.

OTOH there may be other compelling uses I am not aware of.

On the perf. front, GPU + predication + HW usually do a really great job. For CPU, like you mention this is much trickier.

Yup, this is what the vector.transfer_read / transfer_write do. The transfer_read takes an additional scalar value to pad with (the neutral for the reduction case mentioned above). This lowers to llvm.masked_load/store and works reasonably standalone but do cap performance. Many canonicalizations and foldings are necessary for performance but once they are there, the CPU sings. A clamping abstraction will need similar canonicalizations and foldings which is what makes me hesitant.

There is affine.min and affine.max which for all intended purposed do the job. This has been relaxed to also work outside the affine world, here is a small snippet of IR of some stuff I am working on these days:

scf.for %arg5 = %c0 to %1 step %c16 iter_args(%arg6 = %28) {
  %36 = affine.min #map4(%arg5)[%4]
  %37 = subview %arg0[%arg3, %arg5] [%12, %36] [1, 1]  : memref<?x?xf32> to memref<?x?xf32, #map2>
  %38 = affine.min #map4(%arg5)[%5]
  %39 = subview %arg1[%arg5, %arg4] [%38, %14] [1, 1]  : memref<?x?xf32> to memref<?x?xf32, #map2>
  %40 = cmpi "slt", %c0, %12 : index
  %41 = scf.if %40 -> (vector<6x16xf32>) {
    %133 = vector.transfer_read %37[%c0, %c0], %cst : memref<?x?xf32, #map2>, vector<16xf32>
    %134 = vector.insert %133, %cst_1 [0] : vector<16xf32> into vector<6x16xf32>
    scf.yield %134 : vector<6x16xf32>
  } else {
    %133 = vector.insert %cst_2, %cst_1 [0] : vector<16xf32> into vector<6x16xf32>
    scf.yield %133 : vector<6x16xf32>
  }
...

I find that with this combination of things I am able to get to peak without needing to introduce a clamp.

More advanced scenarios are also being developed by @aartbik with gather / scatter ops -> masked llvm form. These introduce fundamentally new capabilities so they will be easier to embrace.

In light of the above, my recommendation would be to reuse existing abstractions or find a motivating counter-example where the existing is insufficient (for example, it is possible that n-D clamping is much nicer to vectorize to n-D vectors).

In the absence of that, I am generally worried by new higher-level ops that overlap due to the amount of work necessary to canonicalize, fold and integrate with transformations (esp. when proper OpInterfaces do not exist yet).

Do you see a way to reuse existing infra or have stronger concrete examples that would make it easier to embrace a ClampOp ?

Thanks!

I didn’t know I could use affine.min/max outside affine scope. I guess, that does pretty much what I need. So, I’ll close the differential. Thanks for advice.

Yes this is unfortunately confusing but affine.min, affine.max, affine.apply, AffineExpr and AffineMap are really standard entities that are understood by Affine Dialect analysis and transforms.

The other affine abstractions may have more stringent affine behavior (e.g. not everything can be a symbol, affine scope etc). In any case multiple affine analyses and transforms require that a symbol follows the constraints of an affine scope.

As a concrete example, the composition of AffineMap does not know about what is allowed to be a symbol; it does not even know about SSA values (it does when used in an affine.apply or affine.min/max).

That’s actually very interesting. I’d actually expect more operations to define an affine scope. For example, the GPU kernel’s ND-range can be represented as a set of nested loops (maybe a few ifs in case of non-uniform ND-range), so many (all?) of the affine passes can work on GPU kernel. This isn’t something supported at the moment, as far as I can see.

I can see some use of integral vector std.clamp as an intermediate step when progressively lowering towards SIMD saturation arithmetic, but even here min/max pairs work just as well.

GPUs have specific support for clamp (in LLVM both the NVPTX and AMDGPU targets have intrinsics).

I don’t remember the details, but without fast-math you may not always be able to easily recover a clamp intrinsic in the backend from a sequence of instructions (something to do with handling correctly NaN, Inf, and denormals). @jurahul may know more?

When fast-math is off in the end the clamp operation should be ultimately lowered to whatever the spec of the floating point environment you are using (which for example in OpenCL is a min+max sequence). If std.clamp* is a way to get access to the underlying support for architectures with a special support for the operation then I guess it depends what kind of behavior this underlying hardware instruction has. What is the semantic of std.clamp itself? Is it max + min the definition of std.clamp or is it whatever the underlying platform we are compiling defines as clamp (I believe all the one you cited define clamp as min+max, but some other platform might define clamp differently)? In the case its defined to be platform specific then going from instructions to clamp might be more tricky.

My understanding that is that clamp (and the underlying min/max) in general can have 2 behaviors:

  1. The IEEE 754-2008 behavior where min/max of a non-NaN an a NaN return the non-NaN value, and returns NaN only if both inputs are NaN. This corresponds to the DX10+ and OpenCL definition.
  2. NaN propagating behavior, where output is NaN if either inputs is NaN

Some architecture may only support (2) natively, which means implementing (1) would potentially need a SW WA, and that could have made any pattern matching in the backend difficult.

I think the correct behavior (1) for clamp can be implemented as a sequence of min/max assuming they are also correct as defined in (1). Correctly behaving min() could also be implemented using a select (result = a > b ? b : a ) but the condition would be something like “(a > b || is_nan(a)) && !is_nan(b)” to get the correct behavior, which a single comparison does not generally compute directly.

That’s right - the gpu.func op is actually missing the AffineScope trait.