They could be doing roop or IP-adapter with inpainting, or both. Perhaps the hardest part is to match the skin color with the face. They must have some kind of global generation like IP-adapter, and followed by swapping the face with roop.
I am curious to try it out in A1111 if you can find a sample dataset of inputs and outputs.