Connecting to Google BigQuery
Note: This topic describes how to import data using the online Sisense Web Application. To import data through the desktop ElastiCube Manager, click here.
Sisense’s Google BigQuery connector is a powerful tool allowing you to retrieve from BigQuery just the data that you require, based on timestamp, date range or other parameters defined by you, thus reducing the import times and the costs associated with querying a large number of tables in BigQuery.
Sisense’s Google BigQuery connector supports complex data types and partition types used in BigQuery. You can run queries on defined partitions, including tables partitioned by date field, sharded tables, and tables partitioned by ingestion time. Complex data types, including Nested and Nested Repeated, are ingested automatically without any need to write SQL or functions.
In this topic:
- Connecting to BigQuery
- Querying Partitions
- Querying Nested Objects
- Source Limitations
- Connecting to Google BigQuery
Note: For the list of connectors available on Linux, click here.
Sisense enables easy and quick access to databases, tables and views contained within Google BigQuery.
To connect to your Google BigQuery database, you need to provide a Project ID. The Project ID is a unique identifier for your BigQuery project. You receive the Project ID when you create a project in Google BigQuery.
After you provide the Project ID, you can sign into your BigQuery account with your Google credentials. Once Google has authenticated your account, you can select what tables are to be imported into Sisense.
Note: If you are connecting to a Google service remotely (for example, if you are on a Linux deployment), and the address of Sisense is something other than localhost, Google requires that you connect using the OAuth 2.0 protocol. See Google Authentication for more information.
To import Google BigQuery data:
- In the Data page, open an ElastiCube or click to create a new ElastiCube.
- In the Model Editor, click . The Add Data dialog box is displayed.
- In the Add Data dialog box, select Google BigQuery.
- Click Sign in with Google and enter your email, and then select Next to enter your password. If you have multiple accounts, select the account that has the Google BigQuery data you want to access and enter the password, if you're not already signed in. You must be signed out of all other Google accounts.
- Click Accept to allow your Sisense Server to access your Google BigQuery data.
- In Project ID, enter your BigQuery Project ID. This project ID is only used for authentication. Once you log in, you will have access to all the projects to which you have Google Big Query permissions.
If you have already created a project and can’t find the Project ID, click here for help.
- In Record Fields Flattening Level, enter the number of nested levels you want to flatten. For example, in a table with nested data depth of seven levels, you may choose to flatten only the first two levels. The data on the other levels will not be imported into Sisense. See Single Record Objects for more information.
- Select Record Repeated and Array Fields Unnesting if you want to unnest arrays and Nested Repeated fields, and to flatten and unnest arrays of Records into additional rows. If you do not select the checkbox, all these data types will be imported as strings.
- Select Allow Large Results if your BigQuery table is larger than 128 MB. Note that when importing query results larger than 128 MB, BigQuery creates a temporary table (click here for more information). For this table to be created, you must be granted the ‘Bigquery.tables.create’ permission in BigQuery (click here for more information). If you don’t have this permission, the query results will not be imported into Sisense. The message that Sisense issues in this case is “Response too large to return. Consider setting allowLargeResults to true in your job configuration.”
- Click Next. All tables and views associated with Google BigQuery are displayed.
- From the Tables list, select the relevant table or view you want to work with. You can click next to the relevant table or click Preview to view to see a preview of the data inside it.
Note: When selecting the tables or views, the Connector Wizard displays all schemas from all projects accessed, not only the schemas from the project whose ID you provided for authentication.
When you select the table or view, a new option is displayed at the bottom of the list, Add Import Query.
- (Optional) Click + to customize the data you want to import with SQL. See Controlling the Partitions to Import with Custom Queries for more information.
- After you have selected all the relevant tables, click Done. The tables are added to your schema.
Sisense’s Google BigQuery connector provides the ability to run queries on defined partitions:
- Sharded tables - supports the ability to view the parent tables (grouped view, as wildcard Select statement) while running queries per the partitions as filters.
- Tables partitioned by date field - querying by partitions as out-of-the-box functionality, via filters and custom queries.
- Tables partitioned by ingestion time - using BigQuery SQL functions via custom queries.
Sisense provides a unique solution to rebuilding the parent table from its constituent sharded tables (grouped view). For this, Sisense provides an external file, where you can indicate the custom prefixes or regular expressions (regex) by which to group the shared tables into one parent table.
To group the shared tables into a single parent table:
- Open the prefixes.json file (You can open it in Notepad). This file is located in the directory:
This file contains default prefix and regex parameters that group all Google Analytics daily tables of the format "ga_sessions_YYYYMMDD” into one parent table.
- To refine the grouping, modify the default parameters or add additional parameters as relevant to your data, and save the file. Make sure that the prefixes row is separated by a comma from the regex row.
- To update your data model based on the new parameters, do one of the following:
- from the Connection Wizard, jump to a previous step and return to the step you were on.
- from the data model, open the Connector Wizard to add another table.
In both cases, The Connector Wizard will display the parent table, grouped from the sharded tables based on the indicated parameters.
The Sisense BigQuery connector provides the following abilities to run queries on BigQuery's table partitions and sharded tables:
- Partitioned tables: Tables that are partitioned based on a TIMESTAMP or DATE column. The Sisense BigQuery connector enables querying the partitions via filters and custom queries, simply by the where clause sent to BigQuery.
- Tables partitioned by ingestion time: Tables partitioned based on the data's ingestion (load) date or arrival date. The Sisense BigQuery connector enables using BigQuery SQL functions via custom queries to retrieve the requested partitions. Click here for details about these two methods.
- As an alternative to partitioned tables, Google BigQuery enables sharding tables using a time-based naming approach, such as [PREFIX]_YYYYMMDD. This is referred to as date-sharded tables. See here for details. The Sisense BigQuery connector enables supports the ability to view the parent tables (grouped view, as Wildcard Select statement) while running queries per the partitions as filters.
You can import subsets of your partitioned and sharded tables into Sisense using SQL. For general information, see Controlling the Partitions to Import with Custom Queries.
Note: Starting from Sisense version 8.0.1, you can create custom SQL queries in Live data models. With Live table queries, you can now add and transform data and build custom business logic for analytics from Live models using custom SQL. Using this feature you can add a WHERE clause on your specific partition. For details, see here.
Tables partitioned based on a TIMESTAMP or DATE column do not have pseudo columns in BigQuery. To limit the number of partitions scanned when querying partitioned tables, use a predicate filter (a WHERE clause).
A common use case of leveraging partitioned tables in BigQuery is the accumulate build. Defining the accumulate build on a TIMESTAMP or DATE field enables you to scan only specific partitions, thus reducing the import times and costs involved in querying in BigQuery.
For example, the following query prunes partitions:
ON t1.id_field = t2 field2
t1.ts = CURRENT_TIMESTAMP()
When you create an ingestion-time partitioned table in BigQuery, two pseudo columns are added to the table: a _PARTITIONTIME pseudo column and a _PARTITIONDATE pseudo column. The _PARTITIONTIME pseudo column contains a date-based timestamp for data that is loaded into the table. The _PARTITIONDATE pseudo column contains a date representation.
You can use the _PARTITIONTIME and _PARTITIONDATE pseudo columns to limit the number of partitions scanned during a query. This reduces the on-demand analysis cost. For example, the following query scans only the partitions between the dates January 1, 2019 and January 2, 2019 from the partitioned table:
SELECT [COLUMN] FROM [DATASET].[TABLE] WHERE _PARTITIONDATE BETWEEN '2019-01-01' AND '2019-01-02'
See here for a detailed explanation.
BigQuery stores sharded tables in the format of table_name_SUFFIX (for example "ga_sessions_YYYYMMDD”). You can write a custom query to import only the sharded tables in the requested date range. This provides the benefit of faster and cheaper queries, since in BigQuery you pay for the amount of data you scan.
To do that, you can use wildcard tables combined with the _TABLE_SUFFIX pseudo column in the WHERE clause. The _TABLE_SUFFIX pseudo column contains the values matched by the table wildcard. The _TABLE_SUFFIX pseudo column contains the values matched by the table wildcard. For example, the following FROM clause uses the wildcard (*) to match all tables in the ga_sessions dataset that match a specific date.
SELECT date, SUM(totals.visits) AS visits, SUM(totals.pageviews) AS pageviews, FROM `aerial-citron-207113.GA360.ga_sessions_*` WHERE _TABLE SUFFIX BETWEEN '20170801' AND '20180801' Group BY date ORDER BY date ASC
See here for a detailed explanation of the example.
Also, see here for additional examples of advanced custom queries.
Note: Sisense uses the standard SQL dialect, and not legacy SQL (also known as the BigQuery SQL).
BigQuery supports nested records and arrays within tables. Nested records in BigQuery can be Single or Repeated records.
These record types are imported into Sisense as:
- Single records - as additional columns
- Repeated records - as additional columns and rows
For example: A BigQuery schema that has Single and Repeated records:
Would be imported into Sisense as:
Click here for more information.
Sisense ingests both nested records and arrays automatically without any need to write SQL and functions:
- Nested - flattens records/arrays to columns with the ability to choose how many levels to expand.
- Nested Repeated - flattens repeated records to rows and columns automatically, using the UNNEST function.
BigQuery supports Nested data as objects of Record data type.
When importing data into Sisense, you need to indicate how many levels of nested data you want to flatten (see Connecting to BigQuery). In Sisense, data on these levels will be flattened to columns using the dot operator (.).
Data on the nested levels that you chose not to flatten is not imported into Sisense.
Repeated Record Objects
BigQuery supports Nested Repeated data as arrays of Repeated Record data type.
When importing data into Sisense, Nested Repeated data type is flattened to rows and columns automatically using the UNNEST function.
Note: This feature is currently available only in ElastiCube data models. It will be available in Live data models starting from Sisense version 8.0. If you are on an earlier Sisense version and working with Live data models, you can use the UNNEST operator to retrieve nested and repeated data as flattened rows via custom query (see here for more information).
Data in Array (primitive) data type is imported into Sisense. Sisense converts this data to un-indexed strings.
- The following functions are not supported for Google BigQuery: Cot, Diff, Mod, Round
- Maximum 15 nested levels (click here for details)
- Maximum query length: 1 MB
- Maximum response size: 128 MB compressed. This limitation can be overcome by clicking Allow Large Results Set as explained in Connecting to BigQuery.
- Query execution time limit: 6 hours
- Maximum number of tables referenced per query: 1,000
- Concurrent queries - based on the account rate limits. Click here for information on BigQuery quotas.
When building an ElastiCube with this connector, you might receive an "out of memory" error. To add more memory, see Troubleshooting Out-of-Memory Issues.