I think you would need the ESP32 to connect to another host. Doing Speech-to-Text, LLM, Text-to-speech is pretty intensive. Even if you connect to a Raspberry Pi.
But totally possible! It's a great idea and would love too help you build it :)
Wire some Open Source together and just start with a small collection of ogg files.
Maybe possible to propose as challenge an ESP32 project to play music according to what is stored on the memory card (completely offline)?
There are already voice recognition happening offline, but there isn't yet something that can find relevant music and play it offline.
But totally possible! It's a great idea and would love too help you build it :)
Wire some Open Source together and just start with a small collection of ogg files.