I would like to be able to send and receive audio signal, receive audio and have a serial connection running at the same time. I know that the Teensy boards (Cortex M4) can do it. What is required to have an audio, MIDI and serial class device? Is it the same process setting the USB device for all Cortex M microcontrollers?
It does not depend on the microcontroller. The USB controller on the chip gives you the physical transfer capability, everything above that is your firmware. The host sends you request frames and you need to reply with all sorts of device descriptors. Those tell the host what kind of things your gizmo implements, with all the features and capabilities. The host then sends you a bunch of configuration messages and then you are ready to transfer data on your device(s).
The only limit for the number of devices you implement on the micro is the number of available end-points on the USB controller. You have the control end-point and then each transfer direction for each device needs an end-point. If you want double-buffering, that usually consumes two end-points instead of one. For audio you probably want that, for serial, not necessarily, for MIDI, I don't really know.
But the important thing is, defining the USB device(s) is your task, the micro's USB controller only supplies the conduit between the host's USB drivers and your firmware. The USB controllers on the various Cortex-Mx micros vary widely, you need to write a separate driver for each different kind.
So either you find some ready-made firmware framework for your specific micro that has all the capabilities you need, or you need to roll up your sleeve and try to read the USB standards, swear loudly, madly look for some explanations on the 'Net (there are many quite good ones) then read the standards again (now they will make sense) and start writing code.
It is a bit like wanting to see your gizmo from a web browser. The microcontroller gives you the Ethernet controller that can send and receive Ethernet frames. You need to write a driver for that, but that's not enough. At a minimum you will need to at least partially implement the protocols IP, ARP, TCP and HTTP in firmware before your device can send a "Hello, World!" message that pops up in your browser.