LogBook


2025/08/08 [GADM-api]

I analyzed geometry duplication across GADM levels and found over 9,000 duplicate geometries. This duplication often occurs in small countries, such as the Vatican, or in islands. To verify whether geometries were truly duplicated, I compared their hashes, point counts, and areas — an approach that should eliminate hash collisions.

-- =======================================================================
-- QUERY 3: Check for potential MD5 collisions by comparing geometry equality
-- =======================================================================
WITH all_geometries AS (
    SELECT 
        md5_geom_binary_hash as hash,
        'adm_0' as table_name,
        geom,
        gid_0 as gid,
        country as name
    FROM adm_0
    
    UNION ALL
    
    SELECT 
        md5_geom_binary_hash as hash,
        'adm_1' as table_name,
        geom,
        gid_1 as gid,
        name_1 as name
    FROM adm_1
    
    UNION ALL
    
    SELECT 
        md5_geom_binary_hash as hash,
        'adm_2' as table_name,
        geom,
        gid_2 as gid,
        name_2 as name
    FROM adm_2
    
    UNION ALL
    
    SELECT 
        md5_geom_binary_hash as hash,
        'adm_3' as table_name,
        geom,
        gid_3 as gid,
        name_3 as name
    FROM adm_3

    UNION ALL
    
    SELECT 
        md5_geom_binary_hash as hash,
        'adm_4' as table_name,
        geom,
        gid_4 as gid,
        name_4 as name
    FROM adm_4

    UNION ALL
    
    SELECT 
        md5_geom_binary_hash as hash,
        'adm_5' as table_name,
        geom,
        gid_5 as gid,
        name_5 as name
    FROM adm_5
    -- Add other tables as needed for comprehensive check
),
hash_groups AS (
    SELECT 
        hash,
        COUNT(*) as count,
        array_agg(DISTINCT table_name) as tables,
        array_agg(name) as names
    FROM all_geometries
    GROUP BY hash
    HAVING COUNT(*) > 1
),
collision_check AS (
    SELECT 
        hg.hash,
        hg.count,
        hg.tables,
        hg.names,
        -- Check if all geometries with same hash are actually equal
        COUNT(DISTINCT ST_AsText(ag.geom)) as unique_geometries_count
    FROM hash_groups hg
    JOIN all_geometries ag ON hg.hash = ag.hash
    GROUP BY hg.hash, hg.count, hg.tables, hg.names
)
SELECT 
    hash,
    count as duplicate_count,
    tables,
    names,
    unique_geometries_count,
    CASE 
        WHEN unique_geometries_count = 1 THEN 'TRUE DUPLICATE'
        ELSE 'POTENTIAL MD5 COLLISION!'
    END as analysis
FROM collision_check
ORDER BY unique_geometries_count DESC, duplicate_count DESC;

An unexpected benefit of this situation is that I can store all geometries in one place, remove duplicates, and save space. One option is to maintain a single geometries table with a one-to-many relationship to the original GADM tables. However, this makes it harder to map back from adm_geometries to the lowest-level adm_<nr> tables.

My next experiment is to merge all adm_ tables into a single table, with metadata stored in a JSON/JSONB column.

table: adm
- lv: int
- metadata: JSON
- geometry_hash // md5 hash

table: adm_geometries
- geometry_hash: text // md5 hash
- geom: GEOMETRY(POLYGON)

That’s it for today!