SamGIS - Some notes about Segment Anything
From the Segment Anything paper
"SAM" is a foundation model aiming for performing "zero-shot" image segmentation:
- it's build and trained with a large image dataset with a massive amount of segmentation masks
- the SAM team propose the "promptable" segmentation task, where the goal is to return a valid segmentation mask given any segmentation prompt.
Since this model should perform "zero-shot" segmentation the model must support flexible prompts, needs to compute masks in amortized real-time to allow interactive use and must be ambiguity-aware. That's the model architecture:
- source 1: an image encoder computes an image embedding
- source 2: a fast prompt encoder embeds prompts
- output: a fast mask decoder combines these two sources to predict segmentation masks
Because annotation masks are not abundant online, especially of high quality, the SAM developers opted for developing a "data engine", developing both the model and the dataset annotations (from manual stage to semi-automated to fully automated). Images in SA-1B span a geographically and economically diverse set of countries and we found that SAM performs similarly across different groups of people.
Segment Anything Tasks
Task
Here SAM team translate prompts from NLP to segmentation (selecting/de-selecting points, box, mask, free-form text). Like a language model should output a coherent response to an ambiguous prompt, the promptable segmentation task should return a valid segmentation mask given any prompt.
Pre-Training
The promptable segmentation task suggests a natural pre-training algorithm that simulates a sequence of prompts (e.g., points, boxes, masks) for each training sample and compares the model’s mask predictions against the ground truth.
Segment Anything Model
Image encoder
The algorithm use a MAE ("Masked Autoencoders Are Scalable Vision Learners") pre-trained Vision Transformer (ViT) minimally adapted to process high resolution inputs.
Prompt encoder
SAM supports two sets of prompts:
- sparse (points, boxes, text)
- dense (masks)
SAM prompts handle points and boxes by positional encodings summed with learned embeddings for each prompt type. Dense prompts (i.e., masks) are embedded using convolutions and summed element-wise with the image embedding.
Mask decoder
The mask decoder efficiently maps the image embedding, prompt embeddings, and an output token to a mask. This design employs a modification of a Transformer decoder block followed by a dynamic mask prediction head. The decoder block uses prompt self-attention and cross-attention in two directions (prompt-to-image embedding and vice-versa) to update all embeddings. After running two blocks, the procedure upsample the image embedding and an MLP maps the output token to a dynamic linear classifier, which then computes the mask foreground probability at each image location.
Resolving ambiguity
With one output, to avoid masks merging in case of an ambiguous prompt the model can predict more than one output mask for a single prompt. 3 masks should address most common cases (nested masks are often at most three deep: whole, part, and subpart). During training, the procedure backprops only the minimum loss over masks. To rank masks, the model predicts a confidence score (i.e., estimated IoU) for each mask.
About image embedding re-use and SamGIS
After reading this paper I understood that I could improve SamGIS software design storing and re-using the image embeddings.
I implemented this change in SamGIS version 1.3.0. Some test data from the SamGIS demo I used:
- first request: 5.42s
- instantiated fastsam model
- created image from webmap (I'm using OpenStreetMap as tiles provider and Mapnik as map layer)
- created image embedding
- second request: 0.41s
- from third to seventh request: ~0.34s
Note that making one request immediately after another keep requests duration low probably because of cache during tiles download on backend side. Instead waiting more than 10 minutes it seems invalidate the cache, then contextily (the GeoPandas' library that I use as a tiles client) added from 0.5s to 1.5s of time, during my tests, to download the tiles.
Click here to show my test request payload
{
"bbox": {
"ne": {
"lat": 46.236615111857255,
"lng": 9.519996643066408
},
"sw": {
"lat": 46.13405108959001,
"lng": 9.29821014404297
}
},
"prompt": [
{
"id": 146,
"type": "point",
"data": {
"lat": 46.18483299780137,
"lng": 9.418864745562386
},
"label": 1
}
],
"zoom": 13,
"source_type": "OpenStreetMap"
}
About Zero-Shot Text-to-Mask: LISA and SamGIS
SAM can use also simple free-form text prompts. For a practical use of this feature, see:
Of course could be of your interest also my integration work of LISA with SamGIS and its cuda demo. I need to keep it paused because of cost, but I am requesting the use of a free GPU from HuggingFace.
Right now there is a demo hardware-based ZeroGPU: it's a little bit slow compared with the regular cuda demo, but it's free to use (if I keep to pay the fee for PRO HuggingFace subscription).
If you like my project, please like or comment on the HuggingFace GPU resource request thread.