BigQuery and MotherDuck

Brian Suk
Google Cloud - Community
6 min readDec 14, 2023

--

This has nothing to do with the article but it’s a photo I took in Frankfurt of Egyptian Geese that I like, which apparently (despite its name) are African members of the Duck family. I found that interesting, and now you know too! Yes, I birdwatch.

This is one of those ideas that started off as a bar table discussion (admittedly after a couple cocktails) and one hare-brained idea led to another, and ended with me asking myself if there’s a way to connect MotherDuck with Google BigQuery.

It turns out you can, sort of.

This definitely landed in this category.

I’m still figuring out if there’s practical production use case applicability for this, but if anything it serves as an interesting technical demonstration. So, here goes!

OpenCVDB and DuckDB

It originally started when a teammate of mine asked me if DuckDB would be good at ingesting and parsing OpenCVDB. Of course it’s a great way to quickly get to that data! For context, the OpenCVDB is the Open Cloud Vulnerability Database that is sponsored by Wiz (the security company). It documents security bugs in cloud services and publishes them as YAML files which don’t necessarily lend themselves to querying easily.

I want to GROUP BY on this data!

This led me to create a quick experiment in a Google Colab notebook. In quick order we are able to download the files, run a quick Python library to convert the YAML files to JSON, load them into DuckDB, shred that JSON, and run some analytics.

Get us some sweet, sweet charts.

I’ve published the Colab notebook here which has step by step instructions and a walk through of how this works. Most of it should be explained in there, and I would recommend taking a few minutes to give it a whirl to see how it works. DuckDB in general with Colab notebooks have been my new favorite thing for quick adhoc analysis of things.

Enter MotherDuck

This then got me thinking. How easy is it, really, to switch from using DuckDB in a local in-memory process to MotherDuck? Is it really as simple as adding md in front and doing the same thing?

It turns out, yeah, it is that simple. I’ve taken the same notebook and swapped it over for a MotherDuck connection, and while it looks a little more complicated since it’s a public facing notebook and I needed a way for the user to pass in the MotherDuck service token to the Colab notebook, md and the authorization token really are the only two things you have to add to it.

Duck and MotherDuck

I’ve made a copy of the original notebook with the MotherDuck connection string, and you can find the notebook here. I would highly recommend signing up for a MotherDuck account and giving the notebook a run, because it will create a table with the OpenCVDB data which you’ll need if you want to try the next bit.

OpenCVDB in MotherDuck!

And the next bit is the interesting one where my Rube Goldberg machine took off.

BigDuck

This is when the discussions got later into that night at the bar, and the ideas got a little more outside the box. Can we get Google BigQuery to talk to MotherDuck and can we roll our own BigQuery Omni using duct tape and popsicle sticks?

Sure!

This is where BigQuery Routines comes to help us out. If you’ve read any of my previous blog posts, you’ll know that I’m a huge fan of seeing how far I can get stretch using BigQuery Functions, both UDF and remote, to do things that I want (like here, here, and there for example). We’re going to do that again with MotherDuck.

While this probably isn’t a good use case for a production workload because there’s a million better ways to accomplish this, let’s say that we want to take data in BigQuery that has cloud services as a column (like a billing table), and we want to call the OpenCVDB data in MotherDuck to get the vulnerability count for any given service. So, for example, if we have “EC2” appear, it will return 4 since (as of the time of this writing) OpenCVDB has 4 known vulnerabilities logged. “Cloud SQL” would return 3, and so on and so forth.

We’re going to create a Cloud Function that takes the data that BigQuery sends, ships it to MotherDuck where it will do an aggregation of that, return it back, the function will re-assemble the response, and ship it back to BigQuery. The Python code for that is here. What really makes this more efficient (well, all things considered) is that DuckDB will let you query dataframes directly in your code as a table. Given this, here’s the psudeocode for the function:

- Get input from BigQuery of services data.
- Convert the input JSON list to a dataframe (df), preserving input row order,
and appending a row number.
- SELECT DISTINCT services FROM df, SELECT vulnerabilities from OpenCVDB where
OpenCVDB.service_name = df.distinct(services), and count/aggregate that.
- Rejoin df with the aggregate count so we return the same number of rows
as the input.
- Return result to BigQuery.

Putting this into Cloud Functions v2 allows us to expose this as a remote function, and successfully allows us to retrieve data from MotherDuck!

BigQuery just learned how to quack.

As mentioned earlier, I’m still noodling through scenarios where this would be useful in a production sense, but thought it would be neat to share out as a technical experiment and demonstration.

Can I try this?

Yes, please do! Run through the notebook for MotherDuck as mentioned earlier in order to hydrate your MotherDuck table, clone this repository, and you should be a terraform apply away from this working in your own project. There are two variables that you need, the project ID and your MotherDuck service token (you can either input this in the command line or create a tfvars file to automate it), and that should be about it. Some notes in the Terraform scripts:

  • Your MotherDuck service token will be saved in Google Secret Manager, so nothing within GCP will be in plaintext.
  • Everything is created with service accounts with minimal privileges.
  • The BigQuery dataset where this lives has a randomized suffix in the name so you can create and destroy it and it shouldn’t affect anything else you have.

Other than that, it should be fairly straightforward. I’ve tried to make this as streamlined as possible, but please do let me know in the comments if you run into any issues!

At the end of the Terraform run, it will give you two SQL queries to run. The first one is to create the remote function call to your Cloud Function. This is the only thing I found that didn’t have a native Terraform object for it, so you will have to do this in SQL. The second query it gives you is something you can quickly test your new function!

Two shiny queries for you to run!

As always, please comment with any issues you encounter, and if you have any ideas on how this might apply to a use case you’re looking to solve, I’d love to hear that too! Happy querying!

--

--

Brian Suk
Google Cloud - Community

Avid 2020 bed-to-couch traveler, cloud tech, big data, random trivia, Xoogler. My employer isn’t responsible for what’s here. NYC. linkedin.com/in/briansuk