BigQuery and MotherDuck
This is one of those ideas that started off as a bar table discussion (admittedly after a couple cocktails) and one hare-brained idea led to another, and ended with me asking myself if there’s a way to connect MotherDuck with Google BigQuery.
It turns out you can, sort of.
I’m still figuring out if there’s practical production use case applicability for this, but if anything it serves as an interesting technical demonstration. So, here goes!
OpenCVDB and DuckDB
It originally started when a teammate of mine asked me if DuckDB would be good at ingesting and parsing OpenCVDB. Of course it’s a great way to quickly get to that data! For context, the OpenCVDB is the Open Cloud Vulnerability Database that is sponsored by Wiz (the security company). It documents security bugs in cloud services and publishes them as YAML files which don’t necessarily lend themselves to querying easily.
This led me to create a quick experiment in a Google Colab notebook. In quick order we are able to download the files, run a quick Python library to convert the YAML files to JSON, load them into DuckDB, shred that JSON, and run some analytics.
I’ve published the Colab notebook here which has step by step instructions and a walk through of how this works. Most of it should be explained in there, and I would recommend taking a few minutes to give it a whirl to see how it works. DuckDB in general with Colab notebooks have been my new favorite thing for quick adhoc analysis of things.
Enter MotherDuck
This then got me thinking. How easy is it, really, to switch from using DuckDB in a local in-memory process to MotherDuck? Is it really as simple as adding md
in front and doing the same thing?
It turns out, yeah, it is that simple. I’ve taken the same notebook and swapped it over for a MotherDuck connection, and while it looks a little more complicated since it’s a public facing notebook and I needed a way for the user to pass in the MotherDuck service token to the Colab notebook, md
and the authorization token really are the only two things you have to add to it.
I’ve made a copy of the original notebook with the MotherDuck connection string, and you can find the notebook here. I would highly recommend signing up for a MotherDuck account and giving the notebook a run, because it will create a table with the OpenCVDB data which you’ll need if you want to try the next bit.
And the next bit is the interesting one where my Rube Goldberg machine took off.
BigDuck
This is when the discussions got later into that night at the bar, and the ideas got a little more outside the box. Can we get Google BigQuery to talk to MotherDuck and can we roll our own BigQuery Omni using duct tape and popsicle sticks?
Sure!
This is where BigQuery Routines comes to help us out. If you’ve read any of my previous blog posts, you’ll know that I’m a huge fan of seeing how far I can get stretch using BigQuery Functions, both UDF and remote, to do things that I want (like here, here, and there for example). We’re going to do that again with MotherDuck.
While this probably isn’t a good use case for a production workload because there’s a million better ways to accomplish this, let’s say that we want to take data in BigQuery that has cloud services as a column (like a billing table), and we want to call the OpenCVDB data in MotherDuck to get the vulnerability count for any given service. So, for example, if we have “EC2” appear, it will return 4 since (as of the time of this writing) OpenCVDB has 4 known vulnerabilities logged. “Cloud SQL” would return 3, and so on and so forth.
We’re going to create a Cloud Function that takes the data that BigQuery sends, ships it to MotherDuck where it will do an aggregation of that, return it back, the function will re-assemble the response, and ship it back to BigQuery. The Python code for that is here. What really makes this more efficient (well, all things considered) is that DuckDB will let you query dataframes directly in your code as a table. Given this, here’s the psudeocode for the function:
- Get input from BigQuery of services data.
- Convert the input JSON list to a dataframe (df), preserving input row order,
and appending a row number.
- SELECT DISTINCT services FROM df, SELECT vulnerabilities from OpenCVDB where
OpenCVDB.service_name = df.distinct(services), and count/aggregate that.
- Rejoin df with the aggregate count so we return the same number of rows
as the input.
- Return result to BigQuery.
Putting this into Cloud Functions v2 allows us to expose this as a remote function, and successfully allows us to retrieve data from MotherDuck!
As mentioned earlier, I’m still noodling through scenarios where this would be useful in a production sense, but thought it would be neat to share out as a technical experiment and demonstration.
Can I try this?
Yes, please do! Run through the notebook for MotherDuck as mentioned earlier in order to hydrate your MotherDuck table, clone this repository, and you should be a terraform apply
away from this working in your own project. There are two variables that you need, the project ID and your MotherDuck service token (you can either input this in the command line or create a tfvars
file to automate it), and that should be about it. Some notes in the Terraform scripts:
- Your MotherDuck service token will be saved in Google Secret Manager, so nothing within GCP will be in plaintext.
- Everything is created with service accounts with minimal privileges.
- The BigQuery dataset where this lives has a randomized suffix in the name so you can create and destroy it and it shouldn’t affect anything else you have.
Other than that, it should be fairly straightforward. I’ve tried to make this as streamlined as possible, but please do let me know in the comments if you run into any issues!
At the end of the Terraform run, it will give you two SQL queries to run. The first one is to create the remote function call to your Cloud Function. This is the only thing I found that didn’t have a native Terraform object for it, so you will have to do this in SQL. The second query it gives you is something you can quickly test your new function!
As always, please comment with any issues you encounter, and if you have any ideas on how this might apply to a use case you’re looking to solve, I’d love to hear that too! Happy querying!