A taxonomy of metadata matching algorithms
There are distinct categories of automated metadata discovery:
- Exact match – where data element linkages are made based on the exact name of a column in a database, the name of an XML element or a label on a screen. For example if a database column has the name “PersonBirthDate” and a data element in a metadata registry also has the name “PersonBirthDate”, automated tools can infer that the column of a database has the same semantics (meaning) as the data element in the metadata registry.
- Synonym match – where the discovery tool in not just given a single name but a set of synonym.
- Pattern match – in this case the tools is given a set of lexical patterns that it can match. For example the tools may search for “*gender*” or “*sex*”
- Semantic Similarity – In this algorithm that relies on a database of word conceptual nearness is used. For example the WordNet system can rank how close words are conceptually to each other. For example the terms “Person”, “Individual” and “Human” may be highly similar concepts.
Statistical matching uses statistics about data sources data itself to derive similarities with registered data elements.
- Distinct Value Analysis – By analyzing all the distinct values in a column the similarity to a registered data element may be made. For example if a column only has two distinct values of ‘male’ and ‘female’ this could be mapped to ‘PersonGenderCode’.
- Data distribution analysis – By analyzing the distribution of values within a single column and comparing this distribution with known data elements a semantic linkage could be inferred.