Item request has been placed! ×
Item request cannot be made. ×
loading  Processing Request

Cross-Modal Contrastive Learning for Text-to-Image Generation based on Machine Learning Models

Item request has been placed! ×
Item request cannot be made. ×
loading   Processing Request
  • Publication Date:
    October 31, 2024
  • معلومة اضافية
    • Document Number:
      20240362830
    • Appl. No:
      18/770154
    • Application Filed:
      July 11, 2024
    • نبذة مختصرة :
      A computer-implemented method includes receiving, by a computing device, a particular textual description of a scene. The method also includes applying a neural network for text-to-image generation to generate an output image rendition of the scene, the neural network having been trained to cause two image renditions associated with a same textual description to attract each other and two image renditions associated with different textual descriptions to repel each other based on mutual information between a plurality of corresponding pairs, wherein the plurality of corresponding pairs comprise an image-to-image pair and a text-to-image pair. The method further includes predicting the output image rendition of the scene.
    • Claim:
      1. A computer-implemented method, comprising: receiving, by a computing device, training data comprising a plurality of textual descriptions, and one or more real image renditions associated with each of the plurality of textual descriptions; training a cross-modal contrastive generative adversarial network (GAN) for text-to-image generation based on the training data, wherein the GAN comprises: a generator comprising one or more attentional self-modulation layers to generate one or more generated image renditions associated with each of the plurality of textual descriptions, a contrastive discriminator to determine whether a given image is a real image rendition of the one or more real image renditions or a generated image rendition of the one or more generated image renditions, and wherein the training is based on a plurality of contrastive losses to capture inter-modality and intra-modality correspondences; and providing the trained GAN for text-to-image generation.
    • Claim:
      2. The computer-implemented method of claim 1, wherein the generator comprises a single-stage generator configured to generate, at a given resolution, a given generated image rendition of the one or more generated image renditions.
    • Claim:
      3. The computer-implemented method of claim 1, wherein the generator comprises a single-stage generator without object-level annotation in images.
    • Claim:
      4. The computer-implemented method of claim 1, wherein the contrastive discriminator is trained as an encoder to compute global image and region features for the plurality of contrastive losses.
    • Claim:
      5. The computer-implemented method of claim 1, wherein the plurality of contrastive losses is based on normalized temperature-scaled cross-entropy losses.
    • Claim:
      6. The computer-implemented method of claim 1, wherein the training of the GAN comprises causing two image renditions associated with a same textual description to attract each other and two image renditions associated with different textual descriptions to repel each other based on mutual information between a plurality of corresponding pairs, wherein the plurality of corresponding pairs comprise an image-to-image pair and a text-to-image pair.
    • Claim:
      7. The computer-implemented method of claim 6, wherein the text-to-image pair comprises an image and an associated textual description.
    • Claim:
      8. The computer-implemented method of claim 6, wherein the text-to-image pair comprises portions of an image and corresponding portions of an associated textual description.
    • Claim:
      9. The computer-implemented method of claim 6, wherein the mutual information is based on a contrastive loss between: (a) an image and an associated textual description, (b) a known image and a predicted image for a same associated textual description, and (c) portions of an image and corresponding portions of an associated textual description.
    • Claim:
      10. The computer-implemented method of claim 6, wherein the training of the GAN to cause two image renditions associated with a same textual description to attract each other and two image renditions associated with different textual descriptions to repel each other further comprises: determining similarity measures between pairs of image renditions, and wherein the training of the GAN comprises: causing a first similarity measure for two image renditions associated with the same textual description to be less than a first threshold value, and causing a second similarity measure for two image renditions associated with different textual descriptions to be greater than a second threshold value.
    • Claim:
      11. The computer-implemented method of claim 6, wherein the training of the GAN comprises generating one or more object level pseudo-labels for an image based on the text-to-image pair.
    • Claim:
      12. The computer-implemented method of claim 1, wherein the contrastive discriminator generates a local feature representation for an image, and wherein a dimension of the local feature representation matches a dimension for a local feature representation of an associated textual description.
    • Claim:
      13. A computer-implemented method, comprising: receiving, by a computing device, a particular textual description of a scene; applying a cross-modal contrastive generative adversarial network (GAN) for text-to-image generation to generate an output image rendition of the scene, wherein the GAN comprises: a generator comprising one or more attentional self-modulation layers to generate one or more generated image renditions associated with the particular textual description, a contrastive discriminator to determine whether a given image is a real image rendition or a generated image rendition, and the GAN having been trained based on a plurality of contrastive losses to capture inter-modality and intra-modality correspondences; and predicting the output image rendition of the scene.
    • Claim:
      14. The computer-implemented method of claim 13, further comprising: obtaining, from a deep bidirectional transformer, a global feature embedding for the particular textual description, and a local feature embedding for a portion of the particular textual description.
    • Claim:
      15. The computer-implemented method of claim 13, wherein the scene describes virtual reality or augmented reality, and wherein the predicting of the output image rendition further comprising: generating an image rendition of the scene as described, in a format suitable for virtual reality or augmented reality.
    • Claim:
      16. The computer-implemented method of claim 13, further comprising: receiving, by the computing device, an image description in audio format, and wherein the particular textual description is a transcribed version of the audio format.
    • Claim:
      17. The computer-implemented method of claim 16, further comprising: receiving, by the computing device, an image style for the image description, and wherein the predicting of the output image rendition comprises generating the output image rendition to conform to the image style.
    • Claim:
      18. The computer-implemented method of claim 13, wherein the particular textual description describes a plurality of scenes, and the predicting of the output image rendition further comprising: generating a plurality of video frames of video content corresponding to the respective plurality of scenes.
    • Claim:
      19. The computer-implemented method of claim 13, wherein the generator comprises a single-stage generator configured to generate, at a given resolution, a given generated image rendition of the one or more generated image renditions.
    • Claim:
      20. A computing device, comprising: one or more processors; and data storage, wherein the data storage has stored thereon computer-executable instructions that, when executed by the one or more processors, cause the computing device to carry out operations comprising: receiving, by the computing device, a particular textual description of a scene; applying a cross-modal contrastive generative adversarial network (GAN) for text-to-image generation to generate an output image rendition of the scene, wherein the GAN comprises: a generator comprising one or more attentional self-modulation layers to generate one or more generated image renditions associated with the particular textual description, a contrastive discriminator to determine whether a given image is a real image rendition or a generated image rendition, and the GAN having been trained based on a plurality of contrastive losses to capture inter-modality and intra-modality correspondences; and predicting the output image rendition of the scene.
    • Current International Class:
      06; 06; 06; 06; 10
    • الرقم المعرف:
      edspap.20240362830