Overcoming large-scale geospatial analysis using Google BigQuery

Brian Suk
The SADA Engineering Blog
6 min readFeb 16, 2022

--

Introduction

The drive for location analytics has increased the need for geospatial analytics at scale. Things such as connected devices, tracking digital twins, and sustainability initiatives have significantly increased the volume and velocity of captured geolocation data, stressing the geospatial processing systems and the related storage subsystems. There are many ways to analyze large geospatial data sets — unfortunately, many are slow or fail if the dataset is too large. When trying to make comparisons between technologies, it becomes difficult as the geospatial space has no commonly accepted benchmarking framework (there is no “TPC-H for GIS,” so to speak). The Defence Science and Technology Laboratory (DSTL) in the United Kingdom commissioned and published a benchmark whitepaper, “Benchmarking Technologies with Geospatial Big Data,” comparing many commonly used technologies, including Geospark, Postgres, and MongoDB, and proves to be a solid place to start for creating common tests across GIS technologies.

The original DSTL benchmark used 20 n1-standard-16 Google Compute Engine (GCE) virtual machines to generate three geospatial datasets. ​​After loading the datasets into the various databases, performance was measured by running a series of queries across each technology with differing levels of query complexity. Three of the technologies compared in the benchmark utilized Hadoop Distributed File System (HDFS), with a five-node GCE environment performing the load to HDFS. The other technologies had various other loading mechanisms.

The DSTL whitepaper provides a detailed analysis of each technology and draws a few general conclusions:

  • Data ingestion can be time and computationally expensive.
  • Infrastructure (including compute, storage, and monitoring) usage and management can be expensive.
  • Data Indexing (and re-indexing in a live environment) also needs to be managed as many systems rely on that to provide adequate performance.

At SADA, with our expertise in the Google Cloud Platform (GCP), we wanted to test these challenging analytics requirements using the geospatial capabilities of BigQuery by replicating the data generation, data loading, and querying benchmarks as closely as we could. Not only were we successful, but the performance also proved to be highly responsive and an excellent option for geospatial data analytic challenges at scale.

We acknowledge the original DSTL benchmark occurred in 2017, and the tested technologies have since improved significantly. For our analysis, the benchmark provides a basis for analysis and comparison. As such, we welcome executing these tests on other data technologies.

Below are the key findings and a summary of our analysis. If you would like to replicate this test, please look at the technical deep dive post, which documents and explains all utilized queries.

Environment

First, it’s essential to understand BigQuery’s pricing and resourcing model (this is strictly about analysis resources, as storage pricing is consistent). By default, BigQuery utilizes on-demand pricing, which means you simply pay for the cost of the query and bytes scanned, as described in Google documentation. Also, by default, for performance BigQuery provides up to 2,000 slots with transient bursts, which makes benchmarking unpredictable. BigQuery Reservations provides access to a consistent number of slots with fixed pricing depending on the number of reserved slots and duration of commitment (by the year, month, or minute). If you decide to run these tests yourself, ascertain if your organization has an existing reservation you can leverage. If not, consider using flex slots for this (here is another great blog post giving an example of the power of flex slots). For our tests, we utilized 2,000 flex slots for consistency, repeatability, and cost optimization.

Data Generation

Compared to the DSTL benchmark, we chose a different approach for data generation. Instead of externally generating the data and then loading it, we explored the feasibility of generating the test data directly within BigQuery. An excellent blog post, written by a Googler, discusses using the faker.js framework to generate large volumes of data by running it within a BigQuery Javascript UDF. To ensure alignment with the DSTL datasets, we needed long text and geographies. Both are supported faker.js data types. The following table compares BigQuery’s data generation times for 10 billion rows relative to the benchmark Hadoop Distributed File System (HDFS) load times.

Note that the times for HDFS are merely for loading the generated data and do not include data generation itself, whereas the times for BigQuery include data generation. HDFS was slightly faster with loading dataset 2, but the large datasets were significantly faster with BigQuery.

Data Querying

This section details the BigQuery results and the DSTL findings found on pages 44–56 of their white paper. The following is a chart that compares the time, as a multiplier, with BigQuery as a baseline of 1.0

Below are more detailed results for each query. All times are in minutes — note how much longer the other technologies took compared to BigQuery.

Dataset 1 — Query 1

Benchmark query: select * from dataset1 where dataset1.geo is within bbox1

Dataset 1 — Query 2

Benchmark query: select * from dataset1 where dataset1.geo is within bbox2

Dataset 1 — Query 3

Benchmark query: select * from dataset1 where dataset1.timestamp < time1 and dataset1.timestamp > time2

Dataset 1 — Query 4

Benchmark query: select * from dataset1 where dataset1.geo is within bbox1 and dataset1.timestamp < time1 and dataset1.timestamp > time2

Dataset 1 — Query 5

Benchmark query: select * from dataset1 where dataset1.geo is within bbox2 and dataset1.timestamp < time1 and dataset1.timestamp > time2

Dataset 3 — Query 1

Benchmark query: select * from dataset3 where dataset3.geo is within bbox2 and dataset3.timestamp < time1 and dataset3.timestamp > time2

Dataset 3 — Query 2

Benchmark query: select * from dataset3 where dataset3.geo is within 10km of any point in dataset2.geo

Dataset 3 — Query 3

Benchmark query: select * from dataset3 where dataset3.geo intersects bbox1 and dataset3.timestamp < time1 and dataset3.timestamp > time2

Dataset 3 — Query 4

Benchmark query: select * from dataset3 where dataset3.geo intersects bbox2 and dataset3.timestamp < time1 and dataset3.timestamp > time2

Dataset 1 — Query 9

Benchmark query: select * from dataset1 where dataset1.geo is within 10km of point1, order by distance to point1

Dataset 1 — Query 10

Benchmark query: select * from dataset1 where dataset1.geo is within 10km of point1 order by timestamp

Dataset 3 — Query 5

Benchmark query: select * from dataset3 where dataset3.geo is within 10km of point1 order by closest distance to point1

Dataset 3 — Query 6

Benchmark query: select * from dataset3 where dataset3.geo is within 10km of point 1 order by timestamp

Join Queries

These were interesting as all the benchmarked technologies failed (as shown on page 56 of the report). Here are the two queries:

Join query 1: select * from dataset1 where dataset1.geo is within 10km of any point in dataset2.geo

Join query 2: select * from dataset3 where dataset3.geo is within 10km of any point in dataset2.geo

The technical deep-dive blog discusses the performance tuning required to execute this in BigQuery, and the timing is as follows:

Overall, BigQuery is significantly faster than the DSTL benchmark technologies. Additionally, BigQuery eliminates all infrastructure and management costs along with database indexing. For more information about other BigQuery benefits, refer to Google’s BigQuery overview.

If you would like to recreate this exercise on your own in BigQuery, or would like to see the underlying engineering processes so you can recreate this in another technology, please see this external blog post that goes through all the details of both data generation, and benchmark querying.

About SADA

At SADA, we climb every mountain, clear every hurdle, and turn the improbable into possible — over and over again. Simply put, we propel your organization forward. It’s not enough to migrate to the cloud, it’s what you do once you’re there. Accelerating application development. Advancing productivity and collaboration. Using your data as a competitive edge. When it comes to Google Cloud, we’re not an add-on, we’re a must-have, driving the business performance of our clients with its power. Beyond our expertise and experience, what sets us apart is our people. It’s the spirit that carried us from scrappy origins as one of the Google Cloud launch partners to an award-winning global partner year after year. With a client list that spans healthcare, media and entertainment, retail, manufacturing, public sector and digital natives — we simply get the job done, every step of the way. Visit SADA.com to learn more.

If you’re interested in becoming part of the SADA team, please visit our careers page.

--

--

Avid 2020 bed-to-couch traveler, cloud tech, big data, random trivia, Xoogler. My employer isn’t responsible for what’s here. NYC. linkedin.com/in/briansuk