Data virtualization is shaking the integration market

In the summer, Denodo announced the update of its software platform for integration and real-time access to “heterogeneous, distributed, structured and unstructured data sources”. This version 8.0 offers the possibility of realizing hybrid and multicloud data integrations thanks to automation in the PaaS style.

Continuation of the article below

In particular, the publisher offers automated provisioning on EC2 instances from AWS and a containerized version of its platform. The platform is also available on request on the Google, AWS and Microsoft Azure marketplaces.

Optimizations and a new interface

In this logic of data integration, the publisher claims to have improved APIs and microservices, especially with support for GraphQL. Users then have the ability to query the platform’s virtual data model without writing a line of code. Denodo has passed the 150 connector mark. The platform now enables data integration from Databricks Delta, Azure Synapse, Google BigQuery, Amazon S3, ADLS and Google Cloud Storage.

Denodo has worked to reduce the response times for queries. “In analytical environments, most queries involve combining one or more tables with one or more dimensions in order to then apply an aggregation calculation,” says the editor’s documentation.

Denodo therefore improves its query optimizer by integrating a new view (table) called “Summary”. These summaries store general intermediate results that the optimizer uses to speed up queries. With this option, there is no need to create a new view to cache a dataset. The optimizer automatically analyzes incoming requests to determine if it can “use summary data”. This method also provides some form of data lineage for a view.

The heart of the product is a meta database

At the heart of the Denodo platform is what the publisher calls a “virtual database” (even uses the quotation marks). Virtual DataPort plays this role. This component provides a unified view of the data present in connected systems and embeds the Apache Derby database, wrappers (in some cases data extraction scripts and more), and a cache module. A language of type SQL (Virtual Query Language or VQL) is used to query the views of structured or unstructured data and to create junctions, associations, groups, etc.

However, Denodo recommends using the management tool that abstracts this query engine. Access may or may not be in real time with a caching system. Transaction data may be updated using INSERT / UPDATE / DELETE operations.

Denodo has assigned an internal data catalog based on machine learning to Virtual DataPort to make it easier to access consumables. In addition, this catalog was previously based on the Apache Derby database. It is now possible to store metadata and declarative categories (created by users) in an external DBMS.

For analysis purposes, data can be visualized, explored, and shared between data scientists by adding Apache Zeppelin notebooks. The tool helps to combine queries, scripts, text and graphics.

In addition, the user interface, which can be accessed via an SSO (with SAML, Kerberos, Oauth and two-factor authentication modules), must allow better integration of the various tools.

“We add ETLs because there is always some data to be replicated, but virtualization is used for most modern data sources.” Olivier TijouRegional Vice President France, Belux, French-speaking Switzerland, Russia and Africa, Denodo

“We put ourselves above the customer’s data sources by offering a single access door that enables the technological transitions of IT, companies using BI or data science tools to be decorrelated. It is also possible to connect data sources with the applications that use them, ”said Olivier Tijou, Regional Vice President France, Belux, French-speaking Switzerland, Russia and Africa at Denodo.

This would simplify the steps involved in extracting, transforming, and loading data for analysts (and less expensive for the business) because the data no longer needs to be replicated prior to analysis.

“We add ETLs because there is always some data to be replicated, but data virtualization is used for most modern data sources, often in conjunction with API management systems,” says the manager. . “This access to most data sources enables our customers, particularly banks, to take control of governance and compliance, as it is very complicated for them today,” he adds.

The data integration market is changing

Precisely because this type of use is increasing among Denodo customers, analysts are adjusting their perception of the Palo Alto-based publisher.

“Denodo is known for data virtualization, and over the years they have grown into a data fabric supplier,” said Noel Yuhanna, vice president and senior analyst at Forrester Research, in a report. “Denodo’s data structure solution integrates key data management components including data integration, ingestion, transformation, governance and security,” he added.

In its 2020 Magic Quadrant for data integration, Gartner sees data virtualization as a building block that publishers must offer their customers. (Gartner also advises that version 8.0 of the Denodo platform must fix bugs in deployment and integration management with certain analysis tools.)

So virtualization or data structure? These overlapping concepts used by analytics companies can create confusion.

“There is a position that is not always well understood and can lead to confusion,” admits Olivier Tijou. “The Gartner Magic Quadrant sets a benchmark architecture for data virtualization / federation. It is placed on the data storage or processing blocks. The main aim is to federate data broken down by the IS of companies, but it is possible to enrich it, for example to anonymize it. “

Cloud giants are aware of this phenomenon. Some, like AWS, offer query federation through their own services (Federated Query by RedShift), others like Google Cloud, the target of multicloud federation (BigQuery Omni). Others like Microsoft Azure or Snowflake follow this trend.

“The first goal of cloud providers is to store customer data at home. That is not our goal, ”replies Olivier Tijou. “The data lake had the same trend, which is a downward phenomenon.”

The manager mentions the tendency of companies to put all data in a single data lake, which is generally based on the Hadoop framework. “Companies couldn’t put all of their data there, and it takes a long time. The elephant has taken a lead in the wing somewhere, ”he reflects in colorful language.

“Somewhere the elephant has taken a lead in the wing.” Olivier Tijou Regional VP France, Belux, French-speaking Switzerland, Russia and Africa, Denodo

“This doesn’t mean that the concept of the data lake or the Hadoop distribution is irrelevant, but we wanted to do too much. In a way, the cloud giants and some vendors are trying to replicate this phenomenon. “

“” [La fédération de données offertes par les services des fournisseurs de cloud] remains limited compared to what we know how to do. Our main customers don’t have just one winner. For example, a large account specializing in the food industry is using the services of two giants and there is a need to pool data between the two. The data virtualization approach has its place in this context, ”he adds.

After all, Denodo’s offer is more often compared to the functions of players like Talend, TIBCO or Informatica, who for their part are trying to combine virtualization and more traditional replication methods.

Denodo has 800 customers around the world. In France, the publisher has convinced LVMH, Suez, Sanofi and Rexel as well as certain banks and pharmaceutical laboratories. In Europe, the European Commission is undoubtedly the most notable reference.