I believe there are a number of somewhat separate concerns here, with maybe different levels of leverage of the existing code.
Firstly: The LLVM architecture-independent IR is certainly capable of representing instructions with arbitrary data sizes. It is possible to describe arbitrary bitwidth data in an architecture independent way. The LLVM backend infrastructure is slightly less flexible: although it can represent arbitrary data types, portions of the infrastructure require these types used in machine code to be enumerated a-priori. Here as long as you have a small number of machine types, things should still be OK and you can leverage most of the code generation mechanisms. Some parts of the code generation infrastructure do require power-of-two instruction sizes, but you might be able to hack around that relatively easily.
Secondly: What is the semantics of memory? load and store operations in LLVM middle-end IR are fundamentally associated with a byte-oriented model of memory and many concepts (such as memory alignment) are described on bytes. LLVM represents some aspects of this with a ‘Data Layout’ that only represents some concepts with byte-wise granularity and many existing passes that deal with the layout of data in memory assume memory has bytes. Most of this code would probably have to be updated, or avoided for your target. If you can avoid re-interpreting memory as different data types, then you might be able to avoid the worst of this aspect.
Thirdly: some parts of LLVM assume that data stored in memory (particularly pointers) are not only byte-level granularity but also have power of two alignment. So if you have pointers stored in data memory, you’ll probably run into this. It’s possible if you have a pure Harvard architecture, then you might avoid this.
My best guess is that while not impossible, this probably gets relatively little leverage from alot of the LLVM infrastructure and will likely require some form of invasive patches/hacks on LLVM to make it work well.