Partitioned tables

Like 5

As you know, for cost effectiveness, some datasets are split in several tables in BigQuery. That’s particularly the case for the Google Analytics tables: ga_sessions_YYYYMMDD.

The reason why these tables are split is that we can query only a subset of the whole data. Example: query to collect one week of visits per country:

#standardSQL
-- Query 1
SELECT date AS `date`
  , geoNetwork.country AS `country`
  , SUM(totals.visits) as `visits`
FROM `XXXXXXXX.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN "20170731" AND "20170806"
GROUP BY 1, 2
-- This query will process 18.1 MB when run.

⚠️ Be careful, if you don’t specify the date range WHERE _TABLE_SUFFIX BETWEEN "20170731" AND "20170806", your query will become way larger:

#standardSQL
-- Query 2
SELECT date AS `date`
  , geoNetwork.country AS `country`
  , SUM(totals.visits) as `visits`
FROM `XXXXXXXX.ga_sessions_*`
GROUP BY 1, 2
-- This query will process 1.91 GB when run.

How to synchronize these tables with Domo?

Now, the question is how can we use these tables in Domo. By default, Domo connector for BigQuery will exacute a query every day, erase your previous data, and replace with the fresher one. This works, but isn’t the most efficient way.

The idea is to use an alternative way of synchronizong data with Domo: by appending new data every day to the existing table (and not erasing and replacing).

A trick

But first, let’s have a look at how wonderful BigQuery is. Let’s consider this query.

#standardSQL
-- Query 3
SELECT geoNetwork.country AS `country`
  , PARSE_DATE("%E4Y%m%d", _TABLE_SUFFIX) AS `date`
  , SUM(totals.visits) AS `visits`
FROM `XXXXXXXX.ga_sessions_*`
WHERE REGEXP_CONTAINS(_TABLE_SUFFIX, r'^\d{8}$') -- checks if the suffix is a real date
GROUP BY 1, 2
-- This query will process 1.24 GB when run.

Query 3 is exactly the same as Query 2, it’s simply not using the column date, and use the column _TABLE_SUFFIX (which apparently BigQuery doesn’t bill to use!). It also doesn’t include the _intraday_ tables.

For convenience PARSE_DATE("%E4Y%m%d", _TABLE_SUFFIX) will convert _TABLE_SUFFIX (STRING) to a DATE type.

Let’s try to use that query as a subquery (or a view):

#standardSQL
-- Query 4
WITH my_view AS (
  SELECT geoNetwork.country AS `country`
    , PARSE_DATE("%E4Y%m%d", _TABLE_SUFFIX) AS `date`
    , SUM(totals.visits) AS `visits`
  FROM `XXXXXXXX.ga_sessions_*`
  WHERE REGEXP_CONTAINS(_TABLE_SUFFIX, r'^\d{8}$')
  GROUP BY 1, 2
)
SELECT date
  , country
  , visits
FROM my_view
WHERE date BETWEEN DATE("2017-07-31") AND DATE("2017-08-06")
-- This query will process 11.8 MB when run.

Query 4 is great for 2 reasons:

  1. it’s equivalent to Query 1, but processes less data (by using _TABLE_SUFFIX instead of date)
  2. we can imagine query my_view directly from Domo, filtering on a single date (yesterday), and paying only for the amont of data processed (1 day)!

Thank you

How to do that?

Protocol

First, you have to save Query 3 as a view in BigQuery. Then, in Domo, you can create a new data source, using BigQuery API. The query to write would be:

#standardSQL
SELECT date
  , country
  , visits
FROM my_view
WHERE date = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) -- Yesterday

In the scheduling tab, don’t forget to change the update settings to use the APPEND update mode. Select a time when you’re sure the table daily table will be ready (after noon for Google Analytics for example). And you’re done!

⚠️ Be extremely careful with the view you created (Query 3): if you query if with no filter on the date column, you will be charged for everything (avoid SELECT * FROM my_view) ⚠️

Recipe

– ae