Your Image Description

Image.txt: Transform Image Into Unique Paragraph

Motivation

How to represent images as high-quality text has always been a popular problem. The traditional idea of Image Caption and Dense Caption such as Show, and Tell relies on a lot of manual annotation. The first relies on annotation platforms such as Amazon AMT to write a description for each image. A series of rules are added, such as number of nouns, color, etc. Usually a short sentence is used to describe a diagram. However, this plain markup idea creates a serious One-to-many problem. For example, a picture corresponds to a lot of text. Due to the asymmetry of information between pictures and texts, the results of training on such data can easily fall into mediocre solutions. (A problem often encountered in Pretrain as well) LLM (Large Language Model), especially ChatGPT, has shown an unparalleled logical ability. We were surprised to find that by giving Bounding Box and Object information to GPT4, GPT4 could naturally reason about the location of objects and even imagine the connections between them.

Demo Video

demo video

Image.txt: Transform Image Into Unique Paragraph

GPU Memory

Your Image Description

Main Pipeline

Your Image Description

Reasoning Details

Your Image Description

Visualization

The text to image model is conrolnet with canny from diffuser.

Your Image Description

Your Image Description

Your Image Description

How to use image.txt in gradio

Your Image Description

More examples

Your Image Description A dog sitting on a porch with a bike. Your Image Description Your Image Description
Input BLIP2 Image Caption GRIT Dense Caption Semantic Segment Anything
The final generated paragraph with ChatGPT is:
This image depicts a black and white dog sitting on a porch beside a red bike. The dense caption mentions other objects in the scene,
such as a white car parked on the street and a red bike parked on the side of the road. The region semantic provides more specific information,
including the porch, floor, wall, and trees. The dog can be seen sitting on the floor beside the bike, and there is also a parked bicycle and tree in the background. The wall is visible on one side of the image, while the street and trees can be seen in the other direction.

Retrieval Result on COCO

Method Trainable Parameter Running Time IR@1 TR@1
Image-text 230M 9H 43.8 33.2
Generated Paragraph-text 0 5m 49.7 36.1

Others:

If you have more suggestions or functions need to be implemented in this codebase, feel free to drop me an email awinyimg dot gmail dot com or open an issue.

Acknowledgement

This work is based on ChatGPT, Edit_Anything, BLIP2, GRIT, OFA,Segment-Anything, Semantic-Segment-Anything, ControlNet.