intel/llvm - llvm - Gitea: Git with a cup of tea

intel/llvm

mirror of https://github.com/intel/llvm.git synced 2026-02-03 02:26:27 +08:00

Author	SHA1	Message	Date
Joseph Huber	0584e6c166	[libc] Explicitly pin memory for the HSA memory transfer (#73973 ) Summary: This portion of code handles mapping the RPC client memory over to the device. HSA copies need to be between two slices of memory that HSA has allocated. Previously we used coarse-grained memory to act as the host source. However, the support for this varies depending on the kernel and version and should not be relied upon. This patch changes that handling to use the `hsa_amd_memory_lock` API to explicitly pin memory to a location sufficient for a DMA transfer to the GPU.	2023-11-30 13:46:52 -06:00
Joseph Huber	8341a40ec1	[libc] Update the AMDGPU implementation to use code object 5 (#72580 ) Summary: This patch includes the necessary changes to make the `libc` tests running on AMD GPUs run using the newer code object version. The 'code object version' is AMD's internal ABI for making kernel calls. The move from 4 to 5 changed how we handle arguments for builtins such as obtaining the grid size or setting up the size of the private stack. Fixes: https://github.com/llvm/llvm-project/issues/72517	2023-11-21 07:14:10 -06:00
Jon Chesterfield	f0e100a05a	[amdgpu][openmp] Treat missing TIMESTAMP_FREQUENCY as non-fatal (#70987 ) If you build with dynamic_hsa, the symbol is known and compilation succeeds. If you then run with a slightly older libhsa, this argument is not recognised and an error returned. I'd rather the program runs with a misleading omp wtime than refuses to run at all.	2023-11-01 22:43:34 +00:00
Joseph Huber	9e390a1408	[libc][Obvious] Fix missing semicolon in AMDGPU loader implementation Summary: Title	2023-10-30 14:58:46 -05:00
Jon Chesterfield	896749aa0d	[amdgpu][openmp] Avoiding writing to packet header twice (#70695 ) I think it follows from the HSA spec that a write to the first byte is deemed significant to the GPU in which case writing to the second short and reading back from it later would be safe. However, the examples for this all involve an atomic write to the first 32 bits and it seems a credible risk that the occasional CI errors abound invalid packets have as their root cause that the firmware notices the early write to packet->setup and treats that as a sign that the packet is ready to go. That was overly-paranoid, however in passing noticed the code in libc is genuinely invalid. The memset writes a zero to the header byte, changing it from type_invalid (1) to type_vendor (0), at which point the GPU is free to read the 64 byte packet and interpret it as a vendor packet, which is probably why libc CI periodically errors about invalid packets. Also a drive by change to do the atomic store on a uint32_t consistently. I'm not sure offhand what __atomic_store_n on a uint16_t* and an int resolves to, seems better to be unambiguous there.	2023-10-30 18:35:52 +00:00
alfredfo	f350532099	[libc] Fix accidental LIBC_NAMESPACE_clock_freq (#69620 ) See-also: https://github.com/llvm/llvm-project/pull/69548	2023-10-19 19:39:02 +02:00
Guillaume Chatelet	b6bc9d72f6	[libc] Mass replace enclosing namespace (#67032 ) This is step 4 of https://discourse.llvm.org/t/rfc-customizable-namespace-to-allow-testing-the-libc-when-the-system-libc-is-also-llvms-libc/73079	2023-09-26 11:45:04 +02:00
Joseph Huber	59896c168a	[libc] Remove the 'rpc_reset' routine from the RPC implementation (#66700 ) Summary: This patch removes the `rpc_reset` function. This was previously used to initialize the RPC client on the device by setting up the pointers to communicate with the server. The purpose of this was to make it easier to initialize the device for testing. However, this prevented us from enforcing an invariant that the buffers are all read-only from the client side. The expected way to initialize the server is now to copy it from the host runtime. This will allow us to maintain that the RPC client is in the constant address space on the GPU, potentially through inference, and improving caching behaviour.	2023-09-21 11:07:09 -05:00
Joseph Huber	701e6f7630	[libc][fix] Fix buffer overrun in initialization of GPU return value Summary: The HSA API explicitly states that the size is a count of uint32_t's not a byte count. This was erroneously being used as a simple memcpy, causing some weird behaviour. Fix this by correctly passing `1` to initialize a single integer to zero.	2023-09-02 17:59:01 -05:00
Joseph Huber	7fd9f0f4e0	[libc] Remove `MAX_LANE_SIZE` definition from the RPC server This `MAX_LANE_SIZE` was a hack from the days when we used a single instance of the server and had some GPU state handle it. Now that we have everything templated this really shouldn't be used. This patch removes its use and replaces it with template arguments. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D158633	2023-08-23 12:09:30 -05:00
Joseph Huber	c381a94753	[libc] Remove test RPC opcodes from the exported header This patch does the noisy work of removing the test opcodes from the exported interface to an interface that is only visible in `libc`. The benefit of this is that we both test the exported RPC registration more directly, and we do not need to give this interface to users. I have decided to export any opcode that is not a "core" libc feature as having its MSB set in the opcode. We can think of these as non-libc "extensions". Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D154848	2023-07-21 15:36:36 -05:00
Jon Chesterfield	095e69404a	[libc][amdgpu] Accept deadstripped clock_freq global If the clock_freq symbol isn't used, and is removed, we don't need to abort the loader. Can instead just not set it. Reviewed By: jhuber6 Differential Revision: https://reviews.llvm.org/D155832	2023-07-20 14:23:08 +01:00
Jon Chesterfield	d483824fc8	[libc][amdgpu] Tolerate different install directories for hsa.h HSA headers might be under a hsa/ directory or might not. This scheme matches the one used by the openmp amdgpu plugin. Reviewed By: jhuber6, jplehr Differential Revision: https://reviews.llvm.org/D155812	2023-07-20 13:43:17 +01:00
Joseph Huber	5db39796bf	[libc] Support timing information in libc tests This patch adds the necessary support to provide timing information in `libc` tests. This is useful for determining which tests look what amount of time. We also can use this as a test basis for providing more fine-grained timing when implementing things on the GPU. The main difficulty with this is the fact that the AMDGPU fixed frequency clock operates at an unknown frequency. We need to read this on a per-card basis from the driver and then copy it in. NVPTX on the other hand has a fixed clock at a resolution of 1ns. I have also increased the resolution of the print-outs as the majority of these are below a millisecond for me. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D154446	2023-07-05 14:27:08 -05:00
Joseph Huber	719d77ed28	[libc] Begin implementing a library for the RPC server This patch begins providing a generic static library that wraps around the raw `rpc.h` interface. As discussed in the corresponding RFC, https://discourse.llvm.org/t/rfc-libc-exporting-the-rpc-interface-for-the-gpu-libc/71030, we want to begin exporting RPC services to external users. In order to do this we decided to not expose the `rpc.h` header by wrapping around its functionality. This is done with a C-interface as we make heavy use of callbacks and allows us to provide a predictable interface. Reviewed By: JonChesterfield, sivachandra Differential Revision: https://reviews.llvm.org/D147054	2023-06-15 11:02:23 -05:00
Joseph Huber	a621308881	[libc] Implement basic `malloc` and `free` support on the GPU This patch adds support for the `malloc` and `free` functions. These currently aren't implemented in-tree so we first add the interface filies. This patch provides the most basic support for a true `malloc` and `free` by using the RPC interface. This is functional, but in the future we will want to implement a more intelligent system and primarily use the RPC interface more as a `brk()` or `sbrk()` interface only called when absolutely necessary. We will need to design an intelligent allocator in the future. The semantics of these memory allocations will need to be checked. I am somewhat iffy on the details. I've heard that HSA can allocate asynchronously which seems to work with my tests at least. CUDA uses an implicit synchronization scheme so we need to use an explicitly separate stream from the one launching the kernel or the default stream. I will need to test the NVPTX case. I would appreciate if anyone more experienced with the implementation details here could chime in for the HSA and CUDA cases. Reviewed By: sivachandra Differential Revision: https://reviews.llvm.org/D151735	2023-06-05 17:56:53 -05:00
Joseph Huber	182e5acb11	[libc] Check the RPC server once again after the kernel exits We support asynchronous sends, that means that the kernel can issue a send, then exit the kernel as we do with the `EXIT` syscall. Because of the condition it's therefore possible for the kernel to exit and break from the loop before we check the server again. This can potentially cause us to ignore an `EXIT` call from the GPU. Reviewed By: JonChesterfield, lntue Differential Revision: https://reviews.llvm.org/D150456	2023-05-12 12:49:19 -05:00
Joseph Huber	30093d6be2	[libc][obvious] Fix undefined variable after name change I forgot that we still used these variables in the loaders. Differential Revision: https://reviews.llvm.org/D150362	2023-05-11 09:00:08 -05:00
Jon Chesterfield	bbeae142bf	[libc][rpc] Allocate a single block of shared memory instead of three Allows moving the pointer swap between server and client into reset. Single allocation simplifies whatever allocates the client/server, currently the libc loaders. Reviewed By: jhuber6 Differential Revision: https://reviews.llvm.org/D150337	2023-05-11 03:04:56 +01:00
Jon Chesterfield	f497611f43	[libc][rpc] Allocate locks array within process Replaces the globals currently used. Worth changing to a bitmap before allowing runtime number of ports >> 64. One bit per port is likely to be cheap enough that sizing for the worst case is always fine, otherwise in the future we can change to dynamically allocating it. Reviewed By: jhuber6 Differential Revision: https://reviews.llvm.org/D150309	2023-05-11 00:41:51 +01:00
Joseph Huber	aea866c12c	[libc] Support concurrent RPC port access on the GPU Previously we used a single port to implement the RPC. This was sufficient for single threaded tests but can potentially cause deadlocks when using multiple threads. The reason for this is that GPUs make no forward progress guarantees. Therefore one group of threads waiting on another group of threads can spin forever because there is no guarantee that the other threads will continue executing. The typical workaround for this is to allocate enough memory that a sufficiently large number of work groups can make progress. As long as this number is somewhat close to the amount of total concurrency we can obtain reliable execution around a shared resource. This patch enables using multiple ports by widening the arrays to a predetermined size and indexes into them. Empty ports are currently obtained via a trivial linker scan. This should be imporoved in the future for performance reasons. Portions of D148191 were applied to achieve parallel support. Depends on D149581 Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D149598	2023-05-05 10:12:19 -05:00
Joseph Huber	901266dad3	[libc] Change GPU startup and loader to use multiple kernels The GPU has a different execution model to standard `_start` implementations. On the GPU, all threads are active at the start of a kernel. In order to correctly intitialize and call the constructors we want single threaded semantics. Previously, this was done using a makeshift global barrier with atomics. However, it should be easier to simply put the portions of the code that must be single threaded in separate kernels and then call those with only one thread. Generally, mixing global state between kernel launches makes optimizations more difficult, similarly to calling a function outside of the TU, but for testing it is better to be correct. Depends on D149527 D148943 Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D149581	2023-05-04 19:31:41 -05:00
Joseph Huber	507edb52f9	[libc] Enable multiple threads to use RPC on the GPU The execution model of the GPU expects that groups of threads will execute in lock-step in SIMD fashion. It's both important for performance and correctness that we treat this as the smallest possible granularity for an RPC operation. Thus, we map multiple threads to a single larger buffer and ship that across the wire. This patch makes the necessary changes to support executing the RPC on the GPU with multiple threads. This requires some workarounds to mimic the model when handling the protocol from the CPU. I'm not completely happy with some of the workarounds required, but I think it should work. Uses some of the implementation details from D148191. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D148943	2023-05-04 19:31:41 -05:00
Joseph Huber	d0ff5e4030	[libc] Update RPC interface for system utilities on the GPU This patch reworks the RPC interface to allow more generic memory operations using the shared better. This patch decomposes the entire RPC interface into opening a port and calling `send` or `recv` on it. The `send` function sends a single packet of the length of the buffer. The `recv` function is paired with the `send` call to then use the data. So, any aribtrary combination of sending packets is possible. The only restriction is that the client initiates the exchange with a `send` while the server consumes it with a `recv`. The operation of this is driven by two independent state machines that tracks the buffer ownership during loads / stores. We keep track of two so that we can transition between a send state and a recv state without an extra wait. State transitions are observed via bit toggling, e.g. This interface supports an efficient `send -> ack -> send -> ack -> send` interface and allows for the last send to be ignored without checking the ack. A following patch will add some more comprehensive testing to this interface. I I informally made an RPC call that simply incremented an integer and it took roughly 10 microsends to complete an RPC call. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D148288	2023-04-19 20:02:31 -05:00
Joseph Huber	bc11bb3e26	[libc] Add the '--threads' and '--blocks' option to the GPU loaders We will want to test the GPU `libc` with multiple threads in the future. This patch adds the `--threads` and `--blocks` option to set the `x` dimension of the kernel. Using CUDA terminology instead of OpenCL for familiarity. Depends on D148288 D148342 Reviewed By: jdoerfert, sivachandra, tra Differential Revision: https://reviews.llvm.org/D148485	2023-04-19 08:01:58 -05:00
Joseph Huber	dfc162ad3f	[libc] Free the GPU memory allocated in the device loaders Summary: This part was ignored and we just hoped that shutting down the runtime freed these correctly. But it's best to be specific and free the memory we've allocated.	2023-04-03 11:55:32 -05:00
Joseph Huber	2bef46d2ad	[libc] Add a loader utility for NVPTX architectures for testing This patch adds a loader utility targeting the CUDA driver API to launch NVPTX images called `nvptx_loader`. This takes a GPU image on the command line and launches the `_start` kernel with the appropriate arguments. The `_start` kernel is provided by the already implemented `nvptx/start.cpp`. So, an application with a `main` function can be compiled and run as follows. ``` clang++ --target=nvptx64-nvidia-cuda main.cpp crt1.o -march=sm_70 -o image ./nvptx_loader image args to kernel ``` This implementation is not tested and does not yet support RPC. This requires further development to work around NVIDIA specific limitations in atomics and linking. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D146681	2023-03-24 20:04:42 -05:00
Joseph Huber	6bd4d717d5	[libc] Add environment variables to GPU libc test for AMDGPU This patch performs the same operation to copy over the `argv` array to the `envp` array. This allows the GPU tests to use environment variables. Reviewed By: sivachandra Differential Revision: https://reviews.llvm.org/D146322	2023-03-20 13:16:58 -05:00
Joseph Huber	ae30ae23aa	[libc][NFC] Add some missing comments to the RPC implementation Summary: These comments were accidentally dropped from the committed version. Add them back in.	2023-03-20 09:30:12 -05:00
Joseph Huber	8e4f9b1fcb	[libc] Add initial support for an RPC mechanism for the GPU This patch adds initial support for an RPC client / server architecture. The GPU is unable to perform several system utilities on its own, so in order to implement features like printing or memory allocation we need to be able to communicate with the executing process. This is done via a buffer of "sharable" memory. That is, a buffer with a unified pointer that both the client and server can use to communicate. The implementation here is based off of Jon Chesterfields minimal RPC example in his work. We use an `inbox` and `outbox` to communicate between if there is an RPC request and to signify when work is done. We use a fixed-size buffer for the communication channel. This is fixed size so that we can ensure that there is enough space for all compute-units on the GPU to issue work to any of the ports. Right now the implementation is single threaded so there is only a single buffer that is not shared. This implementation still has several features missing to be complete. Such as multi-threaded support and asynchrnonous calls. Depends on D145912 Reviewed By: sivachandra Differential Revision: https://reviews.llvm.org/D145913	2023-03-17 12:55:31 -05:00
Joseph Huber	67d78e3c6f	[libc] Add a loader utility for AMDHSA architectures for testing This is the first attempt to get some testing support for GPUs in LLVM's libc. We want to be able to compile for and call generic code while on the device. This is difficult as most GPU applications also require the support of large runtimes that may contain their own bugs (e.g. CUDA / HIP / OpenMP / OpenCL / SYCL). The proposed solution is to provide a "loader" utility that allows us to execute a "main" function on the GPU. This patch implements a simple loader utility targeting the AMDHSA runtime called `amdhsa_loader` that takes a GPU program as its first argument. It will then attempt to load a predetermined `_start` kernel inside that image and launch execution. The `_start` symbol is provided by a `start` utility function that will be linked alongside the application. Thus, this should allow us to run arbitrary code on the user's GPU with the following steps for testing. ``` clang++ Start.cpp --target=amdgcn-amd-amdhsa -mcpu=<arch> -ffreestanding -nogpulib -nostdinc -nostdlib -c clang++ Main.cpp --target=amdgcn-amd-amdhsa -mcpu=<arch> -nogpulib -nostdinc -nostdlib -c clang++ Start.o Main.o --target=amdgcn-amd-amdhsa -o image amdhsa_loader image <args, ...> ``` We determine the `-mcpu` value using the `amdgpu-arch` utility provided either by `clang` or `rocm`. If `amdgpu-arch` isn't found or returns an error we shouldn't run the tests as the machine does not have a valid HSA compatible GPU. Alternatively we could make this utility in-source to avoid the external dependency. This patch provides a single test for this untility that simply checks to see if we can compile an application containing a simple `main` function and execute it. The proposed solution in the future is to create an alternate implementation of the LibcTest.cpp source that can be compiled and launched using this utility. This approach should allow us to use the same test sources as the other applications. This is primarily a prototype, suggestions for how to better integrate this with the existing LibC infastructure would be greatly appreciated. The loader code should also be cleaned up somewhat. An implementation for NVPTX will need to be written as well. Reviewed By: sivachandra, JonChesterfield Differential Revision: https://reviews.llvm.org/D139839	2023-02-13 13:49:01 -06:00

31 Commits