Skip to content

LISA adapted to SamGIS

Image segmentation is a crucial task in computer vision, where the goal is to extract the instance segmentation mask for a desired object within the image. I've already worked on a project, SamGIS, that focuses on this particular application of computer vision. A logical progression now would be incorporating the ability to recognize objects through text prompts. This apparently simple activity is actually different compared to what Segment Anything (the ML backend used by SamGIS) does. In fact "SAM" does not outputs descriptions nor categorizations for its input images. Starting from a written prompt at the contrary requires understanding which classes of objects exist in the image under analysis. A visual language model (or VLM) that performs well for this task is LISA. LISA's authors built their work on top of Segment Anything and Llava, a large language model with multimodal capabilities (it can process both text prompts and images). By leveraging LISA's "reasoned segmentation" abilities, SamGIS can now conduct "zero-shot" analyses, meaning it can operate without specific or specialistic prior training in geological, geomorphological, or photogrammetric fields.

Some input text prompts with their geojson outputs

Input prompt:

waiting for data...

I exported the image in overlay from Esri.WorldImagery tiles provider.

Note that I added some complex text prompts like "you need to segment the houses near roads". In some cases the results are better than others and this can change thanks to more advanced LLMs or with a greater number of parameters.

Click here to show an example payload request
json
{
    "bbox": {
        "ne": {
            "lat": 46.173968917056655,
            "lng": 10.082219839096071
        },
        "sw": {
            "lat": 46.16651671595163,
            "lng": 10.066105127334597
        }
    },
    "string_prompt": "You are a skilled gis analyst with a lot of expertise in photogrammetry, remote sensing and geomorphology field. You need to identify...",
    "zoom": 17,
    "source_type": "Esri.WorldImagery"
}

Note in particular the source_type tag: this can take as values the identifying string of the tile providers listed in leaflet-extras/leaflet-providers. The most obvious uses (and the best results, probably) are using a satellite tiles provider like

Note that some of the tile providers above are commercial services and/or have special requirements to use (e.g. registration, abide by the terms of service, etc).

Duration of segmentation tasks

At the moment, a prompt that also requires an explanation about the segmentation task slows down greatly the analysis. The same prompt on the same image without "descriptive" or "explanatory" questions instead finish much faster. Tests with explanatory text perform in more than 60 seconds while without duration is between 3 and 8 seconds, using the HuggingFace hardware profile "Nvidia T4 Small" with 4 vCPU, 15 GB RAM and 16 GB VRAM.

Software architecture

Technically and architecturally, the demo consists of a frontend page like SamGIS demo. Instead of the drawing tool bar there is a text prompt for natural language requests with some selectable examples displayed at the top of the page. The backend utilizes a FastAPI-based API that calls a custom LISA function wrapper.

Unfortunately I have to pause my demo due to GPU cost, but I am requesting the use of a free GPU from HuggingFace. Please feel free to reach out to me on LinkedIn for a live demonstration, ask for more information or further clarifications.

Like my website? Pay me a coffee
References are available upon request. I hereby authorize the use of my personal data in compliance with the Italian D. Lgs. 196/2003, art. 13 for the purpose of making me job offers.