I've just got my hands on another arts and humanities data set. This one's smaller than most of the others I've been looking at, and it's been put together in an MS Access application. Fortunately, the owners are aware that that's not a maintainable approach, and want a method of publishing it on the Web. Also, rather nicely, they've been aware of a number of data issues, such as regularisation of text fields: they've partially normalised the data, and effectively have a good ontology for their data.
Sadly, it's not all rosy:
db=# select count(id) from table1; count ------- 3620 (1 row) db=# select id, count(id) as num from table1 group by id having count(id) > 1; id | num ------+----- 2409 | 2 2247 | 2 847 | 2 232 | 2 739 | 30 2408 | 2 2423 | 2 2445 | 2 [...] (186 rows)
So, that's 186 duplicate IDs, some with multiple entries. This isn't the only table, and it's referenced heavily by another table. By ID. The not-unique ID.
Time to investigate deeper.