Generative AI Push Poses Risk Harmful for Many Users

Read Time: 1 minutes

AI is developing quickly in digital art, pushing the limits of creativity. Pose generation is a captivating tool that uses body language to express emotions and capture the spirit of a scenario. This blog series examines a generative AI model's ability to generate postures for different situations while evaluating its effectiveness in picture analysis, camera position, and aesthetics.

Machine learning systems that understand the general kinetic interaction between the various components of the human body are necessary for this. Over the past three decades, paradigms, such as inverse kinematics, have emerged in the older field of CGI-based synthesis to give traditional VFX practitioners virtual human beings.

Previous Restrictions

In terms of generative AI, machine systems that have been trained on large video datasets and have been successful in generalizing various human motion types are needed to create "fake people" that appear natural and can walk around and interact with their surroundings. We refer to these things as motion priors.

However, as users of this module will know, Open Pose frequently has trouble understanding complex poses. The trained model may not have learned enough about the requested pose, and the generalized knowledge in base Stable Diffusion may not be able to fill the gap if the user is using a LoRA or Dream Booth model or some other customization/personalization technique meant to render a specific character.

Some Methods

The base V1.5 Stable Diffusion model is frozen during training, preventing any loss in rendering quality. This allows Stable-Pose to function as an ancillary framework in a typical installation. The trainable ViT module that Stable-Pose uses works on an input "skeleton" image.

The diffusion process and the denoising U-Net, to the left of the image, stay frozen, as seen in the above schema. As the system iterates through the observed and inferred armature joints, we note in the centre-top that the extracted pose and the Text component of the prompt are sent via the creative supplementary ViT, where the computation of masked attention occurs.

Data

The Canadian/Italian collaboration UBC Fashion; the 2022 Hong Kong-led collaboration Dance Track; the 2016 ETH/Disney collaboration the DAVIS dataset; the Chinese 2023 collection Human-Art; the LAION subset LAION-Human also known as Human-SD, which presumably has some advantage since it contains data likely already to have been trained into the base Stable Diffusion model; and the Stable-Pose and previous architectures were tested against five high-volume, human-focused datasets.

Settings and Prior Frameworks

The researchers used the Adam optimizer for training, with a learning rate of 1×10-5. They also used a depth of 2 and a patch size of 2 for the PMSA ViT module, and the kernel sizes of the two associated sequential Gaussian filters were 13 and 23, respectively.

All methods were trained for ten epochs on the Human Art dataset, while competing networks Uni-ControlNet, GLIGEN, and Human-SD were trained for ten epochs on the LAION-Human dataset. Base Stable Diffusion SD, V1.5 and T2I-Adapter were two competing networks tested.