The unfortunately titled MarkTime's Metadata Wack Attack aims to be an introductory discussion about the subject of Metadata.
Good communication is an art form, not a science. And yet science provides cues about the makeup of good communication. I found an example of this conundrum when reading the introduction page for the government's metadata standards committee. The government’s premiere experts on naming conventions define themselves in this way:
>>NCITS L8 (formerly X3L8), on Data Representation, is a technical committee of National
>>Committee on Information Technology Standards (NCITS) Accredited Standards Committee
>>X3, which is accredited by ANSI, the American National Standards Institute.
The description is technically accurate and yet I found it to be nearly meaningless. The world's foremost group of scientists working on naming data named themselves 'NCITS L8'. The term NCITS L8 is not descriptive; even with helpful information that the group was formerly called X3L8. It has a number at the end, implying it is a member of a series; I felt stupid for also not knowing NCITS L7. To put it bluntly, this is exactly the type of name expected from a government agency: cryptic and nearly impossible to understand. Thankfully, most people in the industry refer to NCITS L8 as the ‘Metadata Standards Committee’. The term ‘Metadata Standards Committee’ is descriptive, and because it is so easy to understand, it is also easier to remember.
Here is the issue for an organization: too much effort into naming standards is not only a waste of time, but it potentially produces comical and bureaucratic results. However, a lack of attention to naming standards may result in even more dire consequences: incongruent names, outdated or inaccurate information, and as a data system grows larger, it could become unmanageable.
The solution must always be a compromise; an organization must define standards, and enforce a reasonable set of rules. Beyond these rules, a skilled communication artisan can distinguish himself or herself. It is an art, and with practice and attention, the information architect can improve on their skills in a noticeable way, and to the benefit of the group. An organization that fosters an environment of good communication achieves better communications.
The purpose of this article is to outline suggestions for better metadata naming standards.
The Amiga community has within it, some organizations, which have data collection needs (customers, users, sales, etc.), and at the time data is collected (right now while all these groups are small), some thought about standards will result in better formed data, data easier to exchange, whose meaning is well known, as opposed to collecting it without any specific reasoning, which will become an unmanageable expense over time.
Thankfully, if a company becomes that successful, they can afford to pay for their lack of preparation, by hiring an expert, so give me a call.
Another purpose of this article is just the usual rant, and Amigans are geeks who love this sort of thing.
"Humans are aware of anything that exists in the natural world through its properties. Data represents the properties of these things."
--from the introduction to: ISO/IEC 11179-1:1999(E)
The international standards community, primarily through ISO (the International Organization for Standardization) and IEC (the International Electrotechnical Commission) is developing international standards for data representation.
In this way, information may be exchanged globally, with less confusion resulting from language barriers, cultural differences, by industry, or confusion resulting from data that adheres to undocumented or nonexistent standards.
The document 11179 is a work in progress that has been in development for over a decade. It represents some of the best thought in defining metadata standards.
In addition, at this time, large software corporations, such as Microsoft and Oracle publish their own standards that are commonly followed.
The purposes of these standards are:
- allow data definitions to support the needs multiple users, departments, companies, and countries
- reduce redundancy of definitions
- recognize standard data, for example, addresses and identifiers
- give uniform guidance for the development and description of data
- allow data to be exchanged with data modeling software
Large corporations already deal with standards in some areas. Interfaces to shipping systems require address standardization. Government regulatory requirements often impose a need for standard data.
An organization will benefit by defining standards across its organization both for expediting data exchanges between organizations, companies and governments, but also for exchanges between employees and departments. A future employee will more easily understand the work of a current employee, for example, when standards are defined and applied.
The metadata standards committee gives some recommendations for defining a data element. They are:
each definition should be unique:
be stated in the singular:
state what the concept is, not only what it is not:
be stated in a descriptive way:
if using an abbreviation, it should be commonly understood:
be expressed without relying on the definition of another data element:
I recommend reading ISO/IEC 11179-4 for further discussion of these topics. But briefly, I will discuss the concept of defining data in the singular.
It is especially interesting because Oracle Corporation recommends that table names always be stated in the plural.
Who is correct?
There are good arguments for both sides. In Oracle8i: The Complete Reference (3rd edition), Loch and Loney argue in favor of the singular naming convention. It is good English. They use the examples: phone book, restaurant list, address book. We don't say, for example, ‘addresses book’. In English we most commonly use the singular.
However, if we were to label a container, for example, a jar of pickles, we would write on the label, the word 'pickles', not the word ‘pickle.’ Seemingly, by writing a singular on a container, we are implying the existence of only one of that item, which may be an inaccurate suggestion.
It is an interesting discussion, but the important thing is just to be consistent, so that you never have to recollect from memory if the table name is singular or plural. Inconsistency is the cause of a lot of coding errors and inefficiency.
The number one rule of naming convention is you name something only once.
For example, oracle recommends if a department number is deptno in one table, then it cannot be defined as dept_num in another table.
This may seem like a simple rule, but it's extremely difficult and takes a great deal of attention and good memory and understanding of data. It can be complex to follow, for example, information coming from different data feeds, may come in different forms.
For this purpose, a repository matching the relationships between entities is recommended for larger organizations.
This topic, of not duplicating information can be extended to different levels of abstraction.
For example, we can define at the most basic level of semantics, an object class. For example, in the column name CAR_TOTAL_AMOUNT, the word CAR is an object class, it defines the realm in which this description exists.
The recommendation is to have one, and only one, object class defined for each concept. A CAR should not be called somewhere else an AUTO. That leads to confusion, are columns named CAR_TOTAL_AMOUNT and AUTO_TOTAL_AMOUNT the same or different? The implication is that they are different, because we do not duplicate column names, but they also appear the same, because they have the same meaning in English.
In the example, CAR_TOTAL_AMOUNT, the word TOTAL is a property identifier; there should be one, and only one, word to represent this property identifier. Again, CAR_TOTAL_AMOUNT and CAR_COMPLETE_AMOUNT would be confusing.
The last term in CAR_TOTAL_AMOUNT is AMOUNT. It is called the representation term.
If you guessed you should have only one representation term for a concept, you are correct! AMOUNT should always be AMOUNT, not, for example COST somewhere else.
So, in summary, everything should be consistent, not just specific column names, but also concepts. The order of terms should be consistent. You should not have AMOUNT_TOTAL_CAR, even if it is consistently AMOUNT_TOTAL_CAR everywhere, because now the order of terms is not consistent.
The standard order is object class, then property term, then representation.
I hope the background information on metadata sparked some interest in metadata concepts. The subject of metadata is somewhat nebulous, I am not aware of a comprehensive reference on the subject.
Still a well organized data collection is obvious when you see it, and results in a more efficient operation. Databases are inherently complex, but part of the complexity typically seen from a database is due to design choices that, while appropriate for a computer, often give misleading signals to human beings. I have a whole collection of real world, very specific, choices for meta data collection, developed over years of experience, but that is for me to know, nanny nanny boo boo, this isn't a scientific article, after all.
Few people understand metadata, but a metadata expert must understand people. By constructing a database that is consistent, simple, concise, non-redundant, and with naming conventions that are intuitive, everyone from a programmer to an end user can more effectively work with the information store.
This article was written by Robert Dupuy, aka MarkTime. It is released to the public domain. References were given, as appropriate.