Image.txt

Image.txt: Transform Image Into Unique Paragraph

Code 🤗 Demo Model

Motivation

How to represent images as high-quality text has always been a popular problem. The traditional idea of Image Caption and Dense Caption such as Show, and Tell relies on a lot of manual annotation. The first relies on annotation platforms such as Amazon AMT to write a description for each image. A series of rules are added, such as number of nouns, color, etc. Usually a short sentence is used to describe a diagram. However, this plain markup idea creates a serious One-to-many problem. For example, a picture corresponds to a lot of text. Due to the asymmetry of information between pictures and texts, the results of training on such data can easily fall into mediocre solutions. (A problem often encountered in Pretrain as well) LLM (Large Language Model), especially ChatGPT, has shown an unparalleled logical ability. We were surprised to find that by giving Bounding Box and Object information to GPT4, GPT4 could naturally reason about the location of objects and even imagine the connections between them.

Demo Video

Image.txt: Transform Image Into Unique Paragraph

GPU Memory

Main Pipeline

Reasoning Details

Visualization

The text to image model is conrolnet with canny from diffuser.

Your Image Description

How to use image.txt in gradio

More examples

	A dog sitting on a porch with a bike.
Input	BLIP2 Image Caption	GRIT Dense Caption	Semantic Segment Anything

The final generated paragraph with ChatGPT is:


      This image depicts a black and white dog sitting on a porch beside a red bike. The dense caption mentions other objects in the scene, 
      
such as a white car parked on the street and a red bike parked on the side of the road. The region semantic provides more specific information, 
      
including the porch, floor, wall, and trees. The dog can be seen sitting on the floor beside the bike, and there is also a parked bicycle and tree in the background. The wall is visible on one side of the image, while the street and trees can be seen in the other direction.

Retrieval Result on COCO

Method	Trainable Parameter	Running Time	IR@1	TR@1
Image-text	230M	9H	43.8	33.2
Generated Paragraph-text	0	5m	49.7	36.1

Others:

If you have more suggestions or functions need to be implemented in this codebase, feel free to drop me an email awinyimg dot gmail dot com or open an issue.

Acknowledgement

This work is based on ChatGPT, Edit_Anything, BLIP2, GRIT, OFA,Segment-Anything, Semantic-Segment-Anything, ControlNet.