William Vambenepe leads the product management team responsible for Big Data services on Google Cloud Platform (BigQuery, Dataflow, etc.) William was previously an Architect at Oracle and before that a Distinguished Technologist at HP.
He holds an engineer degree from Ecole Centrale Paris, a Diploma in Computer Science from Cambridge University and a Master of Science in Engineering Management from Stanford University.
What do you enjoy most about your role at Google Cloud Platform?
What I most enjoy is seeing customers bloom on the platform. Here’s an analogy: when my daughter went from elementary school to middle school, she quickly became much more mature and independent. It felt like she grew two years in a few weeks. I see a similar almost-instantaneous maturity happening with many customers on Google Cloud. They first come with an infrastructure-driven approach, and a mindset of scarcity. Often their very first foray in the Cloud replicates on-prem patterns. But very quickly it clicks. Why would they need a long-lived shared Hadoop cluster when Dataproc can provision one from scratch in under 90 seconds, for which they pay by the minute? “Everybody gets a pony” as one Spotify engineer once said describing their use of Hadoop on Google Cloud. Similarly, we see many Data Warehouse customers who spent years strategizing what data to keep and which should be removed to make room in their Data Warehouse. Enters BigQuery, where storage is at most $0.02 per GB per month, and $0.01 after 90 days. Suddenly there is no need to ever delete data, at least not for cost reasons. And they can spend their mental energy thinking about how to use the data, not how to maintain systems and optimize limited storage capacity.
When the mental switch happen, we see a whole new pattern, with much more ambition, much more experimentation (the cost of experimentation is negligible, both in terms of Cloud resources consumed and, more importantly, time spent on the investigation because you can jump straight to the core of the task with no setup time). We see customers who had internalized that batch processing was the natural order of the universe move to stream processing (why wait for your results, when Dataflow makes stream execution as easy as batch).
The next round of discussions with these customers is not about infrastructure, or prices, it’s about them sharing what they’ve achieved and sharing what they plan to do next. For example, to start using Google’s Machine Learning services. In the span of a few months, they transition from IT as a necessary burden to IT as a power tool.
What is the most effective GCP product for managing Big Data?
Our product portfolio is designed as a set of complementary and well-integrated products. Almost no real-world task requires just one product. In this context, a service is not “more effective” than another, in the same way that a shovel is not “more effective” than an umbrella. They do different things. But if I interpret “effective” to mean which one is the most uniquely effective (most innovative, most differentiated from the competition), then I’d say it’s Google Cloud Dataflow, our fully managed (“no-ops”) service for data processing pipelines.
It distinguishes itself on two aspects. First, it’s about what it can do (functional aspects). Google Cloud Dataflow implements the groundbreaking Dataflow Model, which provides developer with a powerful model to create data processing pipelines that can run in either batch or stream mode. And it incorporates the management of event delivery delays so that programmers only need to worry about declaring how they want events grouped, without having to worry about managing state themselves to account for out-of-order and late arriving events.
Note that Google open sourced its implementation of the Dataflow Model and contributed it to Apache as Apache Beam (the name comes from concatenating the “B” of “batch” with the “eam” of “stream”). So, Google Cloud Dataflow runs Beam pipelines, but they can also run on other Apache engines like Apache Spark and Apache Flink.
So why would you run these pipelines on Google Cloud Dataflow? That’s where the second key aspect comes in, the operation aspect. By that, I mean the fact that running pipelines in Google Cloud Dataflow means that the user is free from any operational concern. All they have to do is submit the pipeline they wrote. Period. No need to deploy anything, to scale, to patch, to guess the needed capacity, etc. Dataflow will automatically provision the needed resources, and auto-scale so the pipeline execution is performant without costing any more than it needs to.
How can Google help businesses with digital transformation?
Google can help business not just with digital transformation, but also with the transformation to using Artificial Intelligence. And the good news is that the latter is a natural continuation of the former. Step 1 remains the digitalization of the business, where Google helps by providing fully-managed data storage and processing services so that you don’t have to have a full staff of tech wizards to manage the digital infrastructure of your business. Data ingestion is easy, and in many cases completely automated, e.g. if you use Google Analytics Premium and want to import the data into Cloud. For other cases, Google has developed a rich partner ecosystem to provide the right data integration infrastructure. Once the data is in Google Cloud, all processing systems are “serverless” meaning that customer don’t need to worry about operating servers. This allows people to very easy run analytics (using their favorite tool, e.g. Tablea, Microstrategy or Looker) on the data.
But, as mentioned above, in addition to opening the door to easy and powerful analytics this digitalization of the business on Google Cloud also puts businesses directly in position to take the next step and apply Google’s unique Machine Learning capabilities to their business challenges.