You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I have a question for visual grounding.
I have a 720x1280 image and i want to describe the region in [0,0, 512,512] (x1,y1, x2,y2)so I follow the CogVLM1's suggestion to change the coordinate this way ( https://github.com/THUDM/CogVLM?tab=readme-ov-file#cookbook )
Format of coordination: The bounding box coordinates in the model's input and output use the format [[x1, y1, x2, y2]], with the origin at the top left corner, the x-axis to the right, and the y-axis downward. (x1, y1) and (x2, y2) are the top-left and bottom-right corners, respectively, with values as relative coordinates multiplied by 1000 (prefixed with zeros to three digits).
So my prompt is the following but the model tend to provide me a description of the whole image, Is my prompt right?
Tell me what you see within the designated area [[000,000,400,712]] in the picture
# this is how I get the region value
origin region [0,0,512,512]
target format: [[000,000,512/1280*1000, 512/720*1000]] >> [[000,000,400,712]]
example:
Tell me what you see within the designated area [[000,000,400,712]] in the picture. Describe each object in a simple sentence is enough.
image
CogVLM2's result
CogVLM2: Within the designated area, the foreground displays a green bus, parked cars, and a pedestrian crossing sign, while the background includes a blue bus stop sign, trees, and a building, all under a clear sky.<|end_of_text|>
The text was updated successfully, but these errors were encountered:
Hi, I have a question for visual grounding.
I have a 720x1280 image and i want to describe the region in
[0,0, 512,512]
(x1,y1, x2,y2)so I follow the CogVLM1's suggestion to change the coordinate this way ( https://github.com/THUDM/CogVLM?tab=readme-ov-file#cookbook )So my prompt is the following but the model tend to provide me a description of the whole image, Is my prompt right?
example:
image
CogVLM2's result
The text was updated successfully, but these errors were encountered: