I analyzed geometry duplication across GADM levels and found over 9,000 duplicate geometries. This duplication often occurs in small countries, such as the Vatican, or in islands. To verify whether geometries were truly duplicated, I compared their hashes, point counts, and areas — an approach that should eliminate hash collisions.
-- =======================================================================
-- QUERY 3: Check for potential MD5 collisions by comparing geometry equality
-- =======================================================================
WITH all_geometries AS (
SELECT
md5_geom_binary_hash as hash,
'adm_0' as table_name,
geom,
gid_0 as gid,
country as name
FROM adm_0
UNION ALL
SELECT
md5_geom_binary_hash as hash,
'adm_1' as table_name,
geom,
gid_1 as gid,
name_1 as name
FROM adm_1
UNION ALL
SELECT
md5_geom_binary_hash as hash,
'adm_2' as table_name,
geom,
gid_2 as gid,
name_2 as name
FROM adm_2
UNION ALL
SELECT
md5_geom_binary_hash as hash,
'adm_3' as table_name,
geom,
gid_3 as gid,
name_3 as name
FROM adm_3
UNION ALL
SELECT
md5_geom_binary_hash as hash,
'adm_4' as table_name,
geom,
gid_4 as gid,
name_4 as name
FROM adm_4
UNION ALL
SELECT
md5_geom_binary_hash as hash,
'adm_5' as table_name,
geom,
gid_5 as gid,
name_5 as name
FROM adm_5
-- Add other tables as needed for comprehensive check
),
hash_groups AS (
SELECT
hash,
COUNT(*) as count,
array_agg(DISTINCT table_name) as tables,
array_agg(name) as names
FROM all_geometries
GROUP BY hash
HAVING COUNT(*) > 1
),
collision_check AS (
SELECT
hg.hash,
hg.count,
hg.tables,
hg.names,
-- Check if all geometries with same hash are actually equal
COUNT(DISTINCT ST_AsText(ag.geom)) as unique_geometries_count
FROM hash_groups hg
JOIN all_geometries ag ON hg.hash = ag.hash
GROUP BY hg.hash, hg.count, hg.tables, hg.names
)
SELECT
hash,
count as duplicate_count,
tables,
names,
unique_geometries_count,
CASE
WHEN unique_geometries_count = 1 THEN 'TRUE DUPLICATE'
ELSE 'POTENTIAL MD5 COLLISION!'
END as analysis
FROM collision_check
ORDER BY unique_geometries_count DESC, duplicate_count DESC;
An unexpected benefit of this situation is that I can store all geometries in one place, remove duplicates, and save space. One option is to maintain a single geometries table with a one-to-many relationship to the original GADM tables. However, this makes it harder to map back from adm_geometries to the lowest-level adm_<nr> tables.
My next experiment is to merge all adm_ tables into a single table, with metadata stored in a JSON/JSONB column.
table: adm
- lv: int
- metadata: JSON
- geometry_hash // md5 hash
table: adm_geometries
- geometry_hash: text // md5 hash
- geom: GEOMETRY(POLYGON)
That’s it for today!