“At the time of the merger, Cloudera and HortonWorks were embarking on a multi-faceted roadmap, including moving to a common platform for local customers, and there were some extremely aggressive projects in the public cloud: from self-service, usage-based billing, auto-scaling, a managed version, etc. In short, everything that didn’t exist in the traditional Cloudera world, ”recalls Denis Fraval, EMEA Pre-Sales Director at Cloudera.
Continuation of the article below
While the porting of the entire historical range of Cloudera in the native cloud mode required “a very extensive development work and the consideration of the different building blocks and technological limitations of cloud providers”, the development of the Cloudera Data Platform in the public cloud began with the provision of data warehouse and machine learning offers. “These were the most popular offers, but above all the most modern and therefore the easiest to transport in a native cloud mode,” explains the manager.
Recently, Cloudera added three products to the Cloudera Data Platform (CDP) public cloud: Data Engineering, Operational Database (OpDB), and Data Visualization.
Cloudera Data Engineering: a building block that customers expect
Cloudera Data Engineering (CDE) is therefore offered as a serverless service for CDP that allows you to send Apache Spark jobs developed in Java, Scala or Python to a cluster with automated scaling.
The solution is based on a partitioned environment with a specific VPN, CDE service and Kubernetes cluster that can be used to create virtual clusters with automatic scaling capacity. Spark jobs can run on these virtual clusters. The K8 orchestration is coupled with Yunikorn, an open source universal resource planner developed by the Cloudera teams. In this managed version, the editor offers access to a monitoring service, a graphical user interface for controlling and, if necessary, limiting resource consumption and the number of instances.
Workflows and the creation of Spark pipelines can be provided, tracked and planned via Apache Airflow or other equivalent building blocks via API. The user does not manage these components, everything is done through the CDP user interface. The data engineer can get additional information about a particular job by clicking on its representation in the user interface. He then accesses tabs to view the logs and get a visualization of performance and resources used.
Therefore, if no administrator needs to view the logs (for example), you must first configure the VPN, Orchestration, DNS, KMS, Load Balancing, Database and Storage object of the associated cloud provider.
“The data engineering part probably represents more than half of the use cases we have on Cloudera platforms,” says Denis Fraval.
The publisher is aware of this feature and would like to follow the developments of Spark, which was recently updated to version 3.x. “We have a dynamic of much more frequent updates. We will also improve the multitenant aspect to isolate noisy neighbors, such as machine learning experiments that can disrupt ETL jobs in production, ”explains the manager.
This capability would also allow streams to Spark 2.x for “use cases that require compatibility with certain third-party tools” and start new pipelines with Spark 3.x. “The Cloudera platform is in constant communication with third-party tools. If it is necessary to be compatible with all, it is better to do this segmentation of the development of the various components, ”recommends Denis Fraval.
Operational database: A managed version of HBase
The operational database is a managed version of the NoSQL HBase database. OpDB is also equipped with auto-scaling capabilities and a provisioning mode in partitioned environments and enables data to be ingested from a data source via Kafka and Spark or via NiFi. In addition to HBase, we find Apache Phoenix, the massively parallel relational database module that supports OLTP processing. It serves as an API interface to convert SQL queries into HBase scans and generate results that can be accessed via JDBC. Phoenix is compatible with the Hive and Impala query engines.
The big difference from Cloudera’s existing HBase distributions is that the HFILES key-value files are stored in an Amazon S3 or Microsoft ADLS Gen2 object storage system instead of HDFS. Apache HDFS continues to be used to write WALS (Write Ahead Logs) logs.
“Officially, OpDB was supposed to arrive later, but our engineering team accelerated delivery after getting good results,” says the pre-sales manager happily. “We’ll be offering Cloud DataFlow as a service and certainly the Solr-based Cloudera Search portion that will be available after that, and we’ll have virtually the entire Cloudera portfolio in cloud-native mode,” he adds.
Cloudera is following in the footsteps of NoSQL database editors like DataStax and MongoDB, which have already passed the step of DBaaS. It’s hard to imagine, however, that it will compete with these specialist publishers. This is another important building block that Cloudera, like the other CDP Public Cloud products mentioned, would like to control and secure with SDX.
With these managed versions, version updates are delegated to the publisher teams. Most current customers who have deployed Cloudera Data Platform as PaaS in a cloud infrastructure now need to manage the above components.
“Now that the public CDP cloud has been enriched with the required components, we are seeing a very positive response,” says Denis Fraval. According to him, this managed version of CDE complements their environments for experiments or specific tasks that are easier to manage with a managed version.
From dedicated visualization to data mining
CDP data visualization is added to these two offers. “This is our innovation part for BI and data science,” says the manager. This new building block takes over the data visualization solution from Arcadia Data, a publisher that was acquired by Cloudera a year ago. Cloudera was initially offered as an independent solution and wanted to use the building blocks of that solution to create a CDP data visualization. “This enables the visual use of models for machine learning, but also of SQL queries,” says Denis Fraval. “We have autonomous users who can leverage their machine learning models and don’t want to rely on a third-party solution for visual data exploration,” he explains.
This data visualization level is available in the data warehouse and machine learning offerings. It is based on the Data Visualization Engine and the Smart Acceleration level from Arcadia. It must enable the sharing and provision of diagrams or graphics via drag and drop. The tool can recommend visualizations through natural language queries. “This solution is not intended to replace the BI tools of our partners Power BI, Microstrategy or even Qlik,” warns Denis Fraval.
Cloudera of rising stars … and giant clouds
Cloudera has an open source vision and is a little shaken by players like Snowflake, Databricks, but also cloud providers like AWS, Google or Microsoft, who also want to provide complete “platforms” for BI and Data Science.
“This confirms Cloudera’s strategy, which has always been to provide a Swiss Army knife for data processing, from picking up to returning to the end user,” says Denis Fraval confidently.
“Some of the actors you mentioned have expertise in one of the phases of data management and have only found that their customers want to cover the entire cycle. They try to expand their functional fields by adding building blocks. We are relatively far from that: They cover the urgent needs of their customers, ”estimates the Pre-Sales Director.
“What bothers us a bit is that some of them no longer allow these new developments to flow into open source communities. It’s a shame. In addition, they create a technical dependency that prevents reversibility. It bothers us a lot more and, above all, our customers. According to Cloudera, the principle of big data is to manipulate data in open formats and using a multicloud approach, ”defends Denis Fraval.