Canonical Data Modeling
What is a Canonical Data model?
Canonical Data Model is a design
pattern used to communicate between different data formats. A form of
Enterprise Application Integration, it is intended to reduce costs and
standardize on agreed data definitions associated with integrating business
systems (Wikipedia).
The Canonical Data model (CDM) is
the definition of all entities in your enterprise and the relationships that
they have, design and implemented on application independent way. In some cases
canonical data model is also referred to as common data model. Any Enterprise Integration initiative will need a
canonical model approach to harmonize data in motion, to be successful.
The typical large
organization has hundreds of applications developed on incompatible data
models. When one application defines customer as client, another as
account, and a third as consumer, the ability to share information efficiently
and integrate effectively is challenged. The Canonical Model addresses
the issue of data consistency and business processes by relating the common concepts
in the Canonical Model to the vocabularies used by applications.
All large organizations have many
applications which were developed based on different data models and data
definitions, yet they all need to share data. So the basic argument for a canonical
data model is “let’s use a common language when exchanging information between
systems.” In its simplest form, if system A wants to send data to system B, it
translates the data into an agreed-upon standard syntax (canonical or common
format) without regard to the data structures, syntax or protocol of system
B. When B receives the data, it
translates the canonical format into its internal structures and format and
everything is good.
What are the advantages on using a CDM?
Different organizational units
have different terms for the same concept, leading to miscommunication and
errors of interpretation. By using a CDM we can have a single language the
people or organizational units can know the entities of the company,
facilitating the conversations between different teams/sub-organizational
units.
Once a CDM defined, the mappings
between applications will not be needed anymore. Instead, the map will be
between the applications and the CDM, making this a onetime effort, instead of
having to do the exercise every time a new interface is defined.
In a traditional project, at each phase, you would have
to bring all teams on board to assure the correct mapping between applications.
Using a CDM, you map once per application. Lack of a
canonical/common model is a huge obstacle to any integration or
interoperability we require across multiple applications.
How to design your CDM?
An effective
Canonical Model will need to span both the logical level (so it can be
understood by the business) as well as the physical level, as ultimately it
will be used to generate XML messages with their mappings to backend systems.
Depending on the
industry that you are you might already have some reference models (for
Telecommunication there is SID, etc.). Here are some ideas on how to achieve a
good conceptual CDM model:
- List all major entities that business
requires (Customer, Product, Order etc.)
- Define how entities relate to each other
- Specify the attributes of entities.
One of the most
important things you need to establish is your governance model around the CDM.
Since the CDM will be used all over your EAI landscape, any changes that it may
suffer will have potential collateral impacts. Governance model must assure
that:
- Try to have newer versions backward
compatible or all the applications tied to CDM have to go through
changes/testing/production deployment.
- Changes can be propagated in steps and do
not require a big bang change.
- Changes are documented and communicate
changes to application owners that use the model.
How to implement your CDM?
There are different
paths to implementing CDM. Picking up the CDM conceptual model, you will need
to derive the physical model. Most of the tools you use for modeling are able
to create the physical implementation of it. You should assure that the
physical model is simple to use, and fit for purpose. Some things you should
check are:
- Is it partition able (can I use parts of
it without having to take the whole thing)?
- Does it contain any abstracts or other
constructs that will not allow the usage of mapping tools (XSLT graphical
mapping, etc)?
- Are the types defined correctly (are all my strings defined as varchar2(255) in my DB model and is this OK, etc)?
How to use your CDM?
One of the principles that SOA
defines, is that applications should consume the services on a standard way,
and if you use CDM, the services should make use of it on the interface. By
default, this principle is OK, but do not try to force all applications to
conform to the same standard interface. You will find situations where although
the service is fit for purpose, the interface is too complex for a certain
application. If/when you find those situations, you should consider
implementing service abstraction, allowing that way, that
application to make use of your service using a different interface.
One of the most important things you also need to do is to educate the enterprise on the use of the CDM, and the long term advantages of it. People often fail to see the long term and select a more pragmatic approach to solve the problems they are facing without considering the impacts of it down the road. You need to assure that your stake holders comprehend the benefits that they will get by applying strong governance on the usage of CDM and the the technical people know how to make use of it. If you neglect either of this educational steps, you will find resistance whenever you propose to use the CDM and your CDM program will probably fail.
Challenges associated with Canonical Data Modeling?
Governance: Any Enterprise Integration initiative will need a canonical
model approach to harmonize their data in motion, to be successful. Most
start out with simple XML Schemas and Schema editing and source code
repositories. This approach inevitably hits scalability issues. It is
hard to collaboratively evolve and version the Canonical Model, expect
resistance to adoption of the model as these tools require manual extra work to
create a consumable schema from the model, and there is no governance of the
semantics used and therefore no effective reuse.
Reuse: For the Canonical Model to be used as the starting point
for integration, it is essential that everyone involved in planning, building,
and maintaining SOA based applications can easily access and use the
model. Developers building interfaces
that map two data sources don't want to think about the controlled vocabulary
they're supposed to use; they want an easy way to find what they need, be able
to use it, and move on to the next requirement.
- If the work happens in a context where industry standards like SID, Acord, or FPML can provide a starting point, the effort tends to succeed. This is true even when those standards have not previously been adopted by that enterprise.
- Where such industry standards don’t exist, it’s often much harder to get enough agreement among the interested parties to get the effort off the ground.
- Another dimension to canonical data modeling is the need for a federated approach. In very large organizations with multiple business domains, it sometimes turns out that it’s not possible to establish one canonical model. Instead, multiple domain models are necessary, interlinked with one another and with an enterprise-level canonical model. These domains may reflect different external ecosystems, such as securities trading participants, as opposed to customers of a wholesale bank, or international banking exchange operation.
Schema Management, Definition & Naming Conventions
The eventual output of a Canonical
Model is an XML Schema. Naming and managing XSD
documents and its contents is very important, much like the management of
corporate data model. Good schema design guidelines ensure schemas are created
consistently making them usable, re-usable, understandable and maintainable. The
canonical documents form an important asset for an organization that adopts
SOA.
Canonical Models are often
large and complex, designed to meet the integration needs of an enterprise,
line of business, or project. The key to their widespread usage is the
ability to easily and quickly build subset XML schema from the model for use in
payloads and WSDL.
Schema locations, Directory Structures
The schemas representing canonical models or
enterprise business objects will be loaded into Oracle Metadata Services (MDS)
at runtime. At Design time, the artifacts will reside in file based MDS on
local drive. We will use the Product canonical modeling exercise as an example
to depict the directory structure’s used.
We have borrowed some of the AIA directory
structure guidelines for this.
· <COMPANYNAME> defined Enterprise
Business Objects will reside in
<%MDS_HOME%>\apps\AIAMetaData\AIAComponents\EnterpriseObjectLibrary\<CompanyName>
·
Common
Schema definitions like header and extension will be located
<%MDS_HOME%>\apps\AIAMetaData\AIAComponents\EnterpriseObjectLibrary\<COMPANYNAME>\Common\V1
·
Domain
related schema definitions will reside in:
<%MDS_HOME%>\apps\AIAMetaData\AIAComponents\EnterpriseObjectLibrary\<COMPANYNAME>\EBO\<Domain>
For example:
<%MDS_HOME%>\apps\AIAMetaData\AIAComponents\EnterpriseObjectLibrary\<COMPANYNAME>\EBO\Product
<%MDS_HOME%>\apps\AIAMetaData\AIAComponents\EnterpriseObjectLibrary\<COMPANYNAME>\EBO\Pricing
Following are some conventions and guidelines
that can be followed during the creation of schema definitions during canonical
data modeling.
·
Element and attributes should use UCC camel case, for example
“ProductType”. Avoid hyphens, spaces, or other syntax.
·
Place emphasis on Readability. There is always a line to draw
between document size and readability; wherever possible, favor readability.
·
Try to avoid abbreviations and acronyms for element, attribute,
and type names. Exceptions should be well known within your business area, for
example ID (Identifier), and SAAS (Software as a Service).
·
Postfix new types with the name 'Type'. Ex: “ProductType”,
“ItemType” etc.
·
Enumerations should use names, not numbers, and the values should
be UCC camel case.
·
Names should not include the name of the containing structure; for
example, CustomerName should be Name within the sub element Customer.
·
Create complexTypes or simpleTypes for types that are likely to be
re-used. If the structure exists only in one place, define it in-line.
·
Avoid the use of mixed content.
·
Only define root level elements if the element is capable of being
the root element in an XML document.
·
Set elementFormDefault="qualified" in the schema element
of your schema. This makes qualifying the name spaces in the resulting XML
simpler (if not more verbose).
·
Unless an attribute is global, try not to qualify an attribute.
Set attributeFormDefault to be “unqualified".
Breaking schemas into
multiple files can have several advantages. You can create re-usable
definitions that can used across several projects. They make definitions easier
to read and version as they break down the schema into smaller units that are
simpler to manage.
Namespace Naming Standards
The purpose of Namespace
is to provide unique name to element, type or attribute. Namespaces are a
mechanism for breaking up your schemas. XSD standard allows you to structure
your XSD schemas by breaking them into multiple files. These child schemas can
then be included into a parent schema.
Placing the
targetNamespace attribute at the top of your XSD schema means that all entities
defined in it are part of this namespace.
· The following namespace standard is advisable
http://www.<CompanyName>.com/<Domain>/V<VersionNumber>/<Rest
Of the Path depending on complex type or element or ebo or ebm etc.>
For example:
http://www.<CompanyName>.com/Customer/V1/CustomerComplexTypes
· Define a targetNamespace in your schema. This
better identifies your schema, and can make things easier to modularize and
re-use. The value of targetNamespace is just a unique
identifier; typically, companies use their URL followed by something to qualify
it. In principle, the namespace has no meaning, but some companies have used
the URL where the schema is stored because the targetNamespace and some XML
parsers will use this as a hint path for the schema.
For Example:
targetNamespace="http://www.<CompanyName>.com/Product/V1/ProductComplexTypes
"
· Always specify a target namespace. Also, when specifying default
namespace, we recommend setting it same as target namespace. The advantage of
this approach is that you only prefix elements, types and attributes that are
defined externally to the schema.
<?xml version="1.0"
encoding="UTF-8"?>
<!-- edited with XMLSpy v2012 sp1 (x64)
(http://www.altova.com) by Manoj Kona (<COMPANYNAME>) -->
<xs:schema
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns="http://www.<CompanyName>.com/Customer/V1/CustomerComplexTypes"
targetNamespace="http://www.<CompanyName>.com/Customer/V1/CustomerComplexTypes"
xmlns:hdr="http://www.<CompanyName>.com/Common/V1/Header"
xmlns:ext="http://www.<CompanyName>.com/Common/V1/Extension"
elementFormDefault="qualified"
attributeFormDefault="unqualified">
· Use consistent name space aliases:
· xml: Defined in the XML
standard
· xmlns: Defined in Name spaces
in the XML standard
· xs:
http://www.w3.org/2001/XMLSchema
· xsi:
http://www.w3.org/2001/XMLSchema-instance
A namespace is a
collection of names for elements, attributes and types that serve to uniquely
distinguish the collection from the collection of names in another
namespace. As defined in the W3C XML specification, “XML namespaces
provide a simple method for qualifying element and attribute names used in
Extensible Markup Language documents by associating them with namespaces
identified by URI references.”
This enables
interoperability and consistency in the XML artefacts for the library of
reusable types and schema modules. The UN/CEFACT reusability methodology
maximizes the reuse of defined named types, a combination of locally and
globablly declared elements, and attributes.
Best practice dictates
that every schema module have its own namespace with the exception that
internal schema modules will be in the same namespace as the root schema.
Every UN/CEFACT defined
or imported schema module MUST have a namespace declared, using the
xsd:targetNamespace attribute.
Conventions and Recommendations
This section covers
conventions and recommendations when designing your schemas.
There is often some
confusion over when to use an element or an attribute. Some people say that
elements describe data and attributes describe the metadata; another way to
look at it is that attributes are used for small pieces of data such as order
IDs, but really it is personal taste that dictates when to use an attribute.
Generally, it is best to use a child element if the information feels like
data. Some of the problems with using attributes are:
· Attributes cannot contain multiple values (child elements can).
· Attributes are not easily expandable (to incorporate future
changes to the schema).
· Attributes cannot describe structures (child elements can).
· Attributes can only be simple like integer, string, etc.
lf you use attributes as
containers for data, you end up with documents that are difficult to read and
maintain. Try to use elements to describe data.
Mixed Element Content
Mixed content is
something you should try to avoid as much as possible. It is used heavily on
the web in the form of xHtml, but it has many limitations. It is difficult to
parse and it can lead to unforeseen complexity in the resulting data. XML Data
Binding has limitations associated with it making it difficult to manipulate
such documents.
For Product canonical, the schema is broken out into four files.
CommonTypes: This could contain all
your basic types: AddressType, PriceType, PaymentMethodType, and so forth.
ProductTypes: This could contain all
your definitions for your customers.
OrderTypes: This could contain all
your definitions for orders.
Main Objects (Product, Marketing Schedule): This would pull all the sub schemas together into a single
schema, and define your main element/s.
This all works fine
without namespaces, but if different teams start working on different files,
you have the possibility of name clashes, and it would not always be obvious
where a definition had come from. The solution is to place the definitions for
each schema file within a distinct namespace. We did this by adding the
attribute targetNamespace into the schema element in the XSD file; in other
words:
Try to think about versioning early in schema
design. If it's important for a new version of a schema to be backwardly
compatible, all additions to the schema should be optional. If it is important
that existing products should be able to read newer versions of a given
document, consider adding any and all anyAttribute entries to sthe end of your
definitions.