Manoj Kona's Oracle SOA Blog: Canonical Data Modeling

Canonical Data Modeling

What is a Canonical Data model?

Canonical Data Model is a design pattern used to communicate between different data formats. A form of Enterprise Application Integration, it is intended to reduce costs and standardize on agreed data definitions associated with integrating business systems (Wikipedia).

The Canonical Data model (CDM) is the definition of all entities in your enterprise and the relationships that they have, design and implemented on application independent way. In some cases canonical data model is also referred to as common data model. Any Enterprise Integration initiative will need a canonical model approach to harmonize data in motion, to be successful.

The typical large organization has hundreds of applications developed on incompatible data models. When one application defines customer as client, another as account, and a third as consumer, the ability to share information efficiently and integrate effectively is challenged. The Canonical Model addresses the issue of data consistency and business processes by relating the common concepts in the Canonical Model to the vocabularies used by applications.

All large organizations have many applications which were developed based on different data models and data definitions, yet they all need to share data. So the basic argument for a canonical data model is “let’s use a common language when exchanging information between systems.” In its simplest form, if system A wants to send data to system B, it translates the data into an agreed-upon standard syntax (canonical or common format) without regard to the data structures, syntax or protocol of system B. When B receives the data, it translates the canonical format into its internal structures and format and everything is good.

What are the advantages on using a CDM?

Different organizational units have different terms for the same concept, leading to miscommunication and errors of interpretation. By using a CDM we can have a single language the people or organizational units can know the entities of the company, facilitating the conversations between different teams/sub-organizational units.

Once a CDM defined, the mappings between applications will not be needed anymore. Instead, the map will be between the applications and the CDM, making this a onetime effort, instead of having to do the exercise every time a new interface is defined.

In a traditional project, at each phase, you would have to bring all teams on board to assure the correct mapping between applications. Using a CDM, you map once per application. Lack of a canonical/common model is a huge obstacle to any integration or interoperability we require across multiple applications.

How to design your CDM?

An effective Canonical Model will need to span both the logical level (so it can be understood by the business) as well as the physical level, as ultimately it will be used to generate XML messages with their mappings to backend systems.

Depending on the industry that you are you might already have some reference models (for Telecommunication there is SID, etc.). Here are some ideas on how to achieve a good conceptual CDM model:

List all major entities that business requires (Customer, Product, Order etc.)
Define how entities relate to each other
Specify the attributes of entities.

One of the most important things you need to establish is your governance model around the CDM. Since the CDM will be used all over your EAI landscape, any changes that it may suffer will have potential collateral impacts. Governance model must assure that:

Try to have newer versions backward compatible or all the applications tied to CDM have to go through changes/testing/production deployment.
Changes can be propagated in steps and do not require a big bang change.
Changes are documented and communicate changes to application owners that use the model.

How to implement your CDM?

There are different paths to implementing CDM. Picking up the CDM conceptual model, you will need to derive the physical model. Most of the tools you use for modeling are able to create the physical implementation of it. You should assure that the physical model is simple to use, and fit for purpose. Some things you should check are:

Is it partition able (can I use parts of it without having to take the whole thing)?
Does it contain any abstracts or other constructs that will not allow the usage of mapping tools (XSLT graphical mapping, etc)?
Are the types defined correctly (are all my strings defined as varchar2(255) in my DB model and is this OK, etc)?

How to use your CDM?

One of the principles that SOA defines, is that applications should consume the services on a standard way, and if you use CDM, the services should make use of it on the interface. By default, this principle is OK, but do not try to force all applications to conform to the same standard interface. You will find situations where although the service is fit for purpose, the interface is too complex for a certain application. If/when you find those situations, you should consider implementing service abstraction, allowing that way, that application to make use of your service using a different interface.

One of the most important things you also need to do is to educate the enterprise on the use of the CDM, and the long term advantages of it. People often fail to see the long term and select a more pragmatic approach to solve the problems they are facing without considering the impacts of it down the road. You need to assure that your stake holders comprehend the benefits that they will get by applying strong governance on the usage of CDM and the the technical people know how to make use of it. If you neglect either of this educational steps, you will find resistance whenever you propose to use the CDM and your CDM program will probably fail.

Challenges associated with Canonical Data Modeling?

Governance: Any Enterprise Integration initiative will need a canonical model approach to harmonize their data in motion, to be successful. Most start out with simple XML Schemas and Schema editing and source code repositories. This approach inevitably hits scalability issues. It is hard to collaboratively evolve and version the Canonical Model, expect resistance to adoption of the model as these tools require manual extra work to create a consumable schema from the model, and there is no governance of the semantics used and therefore no effective reuse.

Reuse: For the Canonical Model to be used as the starting point for integration, it is essential that everyone involved in planning, building, and maintaining SOA based applications can easily access and use the model. Developers building interfaces that map two data sources don't want to think about the controlled vocabulary they're supposed to use; they want an easy way to find what they need, be able to use it, and move on to the next requirement.

If the work happens in a context where industry standards like SID, Acord, or FPML can provide a starting point, the effort tends to succeed. This is true even when those standards have not previously been adopted by that enterprise.
Where such industry standards don’t exist, it’s often much harder to get enough agreement among the interested parties to get the effort off the ground.
Another dimension to canonical data modeling is the need for a federated approach. In very large organizations with multiple business domains, it sometimes turns out that it’s not possible to establish one canonical model. Instead, multiple domain models are necessary, interlinked with one another and with an enterprise-level canonical model. These domains may reflect different external ecosystems, such as securities trading participants, as opposed to customers of a wholesale bank, or international banking exchange operation.

Schema Management, Definition & Naming Conventions

The eventual output of a Canonical Model is an XML Schema. Naming and managing XSD documents and its contents is very important, much like the management of corporate data model. Good schema design guidelines ensure schemas are created consistently making them usable, re-usable, understandable and maintainable. The canonical documents form an important asset for an organization that adopts SOA.

Canonical Models are often large and complex, designed to meet the integration needs of an enterprise, line of business, or project. The key to their widespread usage is the ability to easily and quickly build subset XML schema from the model for use in payloads and WSDL.

Schema locations, Directory Structures

The schemas representing canonical models or enterprise business objects will be loaded into Oracle Metadata Services (MDS) at runtime. At Design time, the artifacts will reside in file based MDS on local drive. We will use the Product canonical modeling exercise as an example to depict the directory structure’s used.

We have borrowed some of the AIA directory structure guidelines for this.

· <COMPANYNAME> defined Enterprise Business Objects will reside in

<%MDS_HOME%>\apps\AIAMetaData\AIAComponents\EnterpriseObjectLibrary\<CompanyName>

Common Schema definitions like header and extension will be located

<%MDS_HOME%>\apps\AIAMetaData\AIAComponents\EnterpriseObjectLibrary\<COMPANYNAME>\Common\V1

Domain related schema definitions will reside in: <%MDS_HOME%>\apps\AIAMetaData\AIAComponents\EnterpriseObjectLibrary\<COMPANYNAME>\EBO\<Domain>

For example:

<%MDS_HOME%>\apps\AIAMetaData\AIAComponents\EnterpriseObjectLibrary\<COMPANYNAME>\EBO\Product

<%MDS_HOME%>\apps\AIAMetaData\AIAComponents\EnterpriseObjectLibrary\<COMPANYNAME>\EBO\Pricing

Schema Elements & Complex Types Naming Conventions

Following are some conventions and guidelines that can be followed during the creation of schema definitions during canonical data modeling.

· Element and attributes should use UCC camel case, for example “ProductType”. Avoid hyphens, spaces, or other syntax.

· Place emphasis on Readability. There is always a line to draw between document size and readability; wherever possible, favor readability.

· Try to avoid abbreviations and acronyms for element, attribute, and type names. Exceptions should be well known within your business area, for example ID (Identifier), and SAAS (Software as a Service).

· Postfix new types with the name 'Type'. Ex: “ProductType”, “ItemType” etc.

· Enumerations should use names, not numbers, and the values should be UCC camel case.

· Names should not include the name of the containing structure; for example, CustomerName should be Name within the sub element Customer.

· Create complexTypes or simpleTypes for types that are likely to be re-used. If the structure exists only in one place, define it in-line.

· Avoid the use of mixed content.

· Only define root level elements if the element is capable of being the root element in an XML document.

· Set elementFormDefault="qualified" in the schema element of your schema. This makes qualifying the name spaces in the resulting XML simpler (if not more verbose).

· Unless an attribute is global, try not to qualify an attribute. Set attributeFormDefault to be “unqualified".

Breaking schemas into multiple files can have several advantages. You can create re-usable definitions that can used across several projects. They make definitions easier to read and version as they break down the schema into smaller units that are simpler to manage.

Namespace Naming Standards

The purpose of Namespace is to provide unique name to element, type or attribute. Namespaces are a mechanism for breaking up your schemas. XSD standard allows you to structure your XSD schemas by breaking them into multiple files. These child schemas can then be included into a parent schema.

Placing the targetNamespace attribute at the top of your XSD schema means that all entities defined in it are part of this namespace.

· The following namespace standard is advisable

http://www.<CompanyName>.com/<Domain>/V<VersionNumber>/<Rest Of the Path depending on complex type or element or ebo or ebm etc.>

For example:

http://www.<CompanyName>.com/Customer/V1/CustomerComplexTypes

· Define a targetNamespace in your schema. This better identifies your schema, and can make things easier to modularize and re-use. The value of targetNamespace is just a unique identifier; typically, companies use their URL followed by something to qualify it. In principle, the namespace has no meaning, but some companies have used the URL where the schema is stored because the targetNamespace and some XML parsers will use this as a hint path for the schema.

For Example:

targetNamespace="http://www.<CompanyName>.com/Product/V1/ProductComplexTypes "

· Always specify a target namespace. Also, when specifying default namespace, we recommend setting it same as target namespace. The advantage of this approach is that you only prefix elements, types and attributes that are defined externally to the schema.

<?xml version="1.0" encoding="UTF-8"?>

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"

xmlns="http://www.<CompanyName>.com/Customer/V1/CustomerComplexTypes"

targetNamespace="http://www.<CompanyName>.com/Customer/V1/CustomerComplexTypes"

xmlns:hdr="http://www.<CompanyName>.com/Common/V1/Header"

xmlns:ext="http://www.<CompanyName>.com/Common/V1/Extension"

elementFormDefault="qualified" attributeFormDefault="unqualified">

· Use consistent name space aliases:

· xml: Defined in the XML standard

· xmlns: Defined in Name spaces in the XML standard

· xs: http://www.w3.org/2001/XMLSchema

· xsi: http://www.w3.org/2001/XMLSchema-instance

Namespace Scheme

A namespace is a collection of names for elements, attributes and types that serve to uniquely distinguish the collection from the collection of names in another namespace. As defined in the W3C XML specification, “XML namespaces provide a simple method for qualifying element and attribute names used in Extensible Markup Language documents by associating them with namespaces identified by URI references.”

This enables interoperability and consistency in the XML artefacts for the library of reusable types and schema modules. The UN/CEFACT reusability methodology maximizes the reuse of defined named types, a combination of locally and globablly declared elements, and attributes.

Declaring Namespace

Best practice dictates that every schema module have its own namespace with the exception that internal schema modules will be in the same namespace as the root schema.

Every UN/CEFACT defined or imported schema module MUST have a namespace declared, using the xsd:targetNamespace attribute.

Conventions and Recommendations

This section covers conventions and recommendations when designing your schemas.

When to Use Elements or Attributes

There is often some confusion over when to use an element or an attribute. Some people say that elements describe data and attributes describe the metadata; another way to look at it is that attributes are used for small pieces of data such as order IDs, but really it is personal taste that dictates when to use an attribute. Generally, it is best to use a child element if the information feels like data. Some of the problems with using attributes are:

· Attributes cannot contain multiple values (child elements can).

· Attributes are not easily expandable (to incorporate future changes to the schema).

· Attributes cannot describe structures (child elements can).

· Attributes can only be simple like integer, string, etc.

lf you use attributes as containers for data, you end up with documents that are difficult to read and maintain. Try to use elements to describe data.

Mixed Element Content

Mixed content is something you should try to avoid as much as possible. It is used heavily on the web in the form of xHtml, but it has many limitations. It is difficult to parse and it can lead to unforeseen complexity in the resulting data. XML Data Binding has limitations associated with it making it difficult to manipulate such documents.

For Product canonical, the schema is broken out into four files.

CommonTypes: This could contain all your basic types: AddressType, PriceType, PaymentMethodType, and so forth.

ProductTypes: This could contain all your definitions for your customers.

OrderTypes: This could contain all your definitions for orders.

Main Objects (Product, Marketing Schedule): This would pull all the sub schemas together into a single schema, and define your main element/s.

This all works fine without namespaces, but if different teams start working on different files, you have the possibility of name clashes, and it would not always be obvious where a definition had come from. The solution is to place the definitions for each schema file within a distinct namespace. We did this by adding the attribute targetNamespace into the schema element in the XSD file; in other words:

Versioning Standards

Try to think about versioning early in schema design. If it's important for a new version of a schema to be backwardly compatible, all additions to the schema should be optional. If it is important that existing products should be able to read newer versions of a given document, consider adding any and all anyAttribute entries to sthe end of your definitions.

References

http://www.liquid-technologies.com/Tutorials/XmlSchemas/XsdTutorial_01.aspx

http://www.unece.org/fileadmin/DAM/cefact/xml/XML-Naming-and-Design-Rules-V2.0.pdf

http://www.codeguru.com/java/print.php/c13529/XSD-Tutorial-XML-Schemas-For-Beginners.htm

http://www.w3.org/TR/xmlschema-guide2versioning/