Bigquery Client

Parent POM for Hotels.com open source projects

License	License Apache License, Version 2.0
GroupId	GroupId com.hotels
ArtifactId	ArtifactId circus-train-bigquery
Last Version	Last Version 5.2.0
Release Date	Release Date Nov 28, 2019
Type	Type jar
Description	Description Bigquery Client Parent POM for Hotels.com open source projects
Project Organization	Project Organization Hotels.com (Data Platform Team)
Source Code Management	Source Code Management https://github.com/HotelsDotCom/circus-train-bigquery

Download circus-train-bigquery

Filename	Size
circus-train-bigquery-5.2.0.pom
circus-train-bigquery-5.2.0.jar	11 MB
circus-train-bigquery-5.2.0-sources.jar	53 KB
circus-train-bigquery-5.2.0-javadoc.jar	274 KB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/com.hotels/circus-train-bigquery/ -->
<dependency>
    <groupId>com.hotels</groupId>
    <artifactId>circus-train-bigquery</artifactId>
    <version>5.2.0</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/com.hotels/circus-train-bigquery/
implementation 'com.hotels:circus-train-bigquery:5.2.0'

Gradle Kotlin

// https://jarcasting.com/artifacts/com.hotels/circus-train-bigquery/
implementation ("com.hotels:circus-train-bigquery:5.2.0")

Apache Buildr

'com.hotels:circus-train-bigquery:jar:5.2.0'

Apache Ivy

<dependency org="com.hotels" name="circus-train-bigquery" rev="5.2.0">
  <artifact name="circus-train-bigquery" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='com.hotels', module='circus-train-bigquery', version='5.2.0')
)

Scala SBT

libraryDependencies += "com.hotels" % "circus-train-bigquery" % "5.2.0"

Leiningen

[com.hotels/circus-train-bigquery "5.2.0"]

Dependencies

compile (3)

Group / Artifact	Type	Version
com.google.cloud : google-cloud-bigquery	jar
com.google.cloud : google-cloud-storage	jar
com.google.guava : guava	jar	20.0

provided (10)

Group / Artifact	Type	Version
com.hotels : circus-train-api	jar	15.0.0
com.hotels : circus-train-core	jar	15.0.0
com.hotels : circus-train-gcp	jar	15.0.0
org.apache.hadoop : hadoop-common	jar	2.7.1
org.apache.hive : hive-metastore	jar	2.3.3
org.apache.hive.hcatalog : hive-hcatalog-core	jar	2.3.3
org.apache.hive : hive-exec	jar	2.3.3
org.springframework.boot : spring-boot	jar
org.springframework.boot : spring-boot-autoconfigure	jar
org.springframework.boot : spring-boot-configuration-processor Optional	jar

test (3)

Group / Artifact	Type	Version
junit : junit	jar
org.mockito : mockito-core	jar
fm.last.commons : lastcommons-test	jar	7.0.2

Project Modules

There are no modules declared in this project.

Circus Train BigQuery To Hive Replication

Overview

This Circus Train plugin enables the conversion of BigQuery tables to Hive.

Start using

You can obtain Circus Train BigQuery from Maven Central:

Installation

In order to be used by Circus Train the above circus-train-bigquery jar file must be added to Circus Train's classpath. It is highly recommended that the version of this library and the version of Circus Train are identical. The recommended way to make this extension available on the classpath is to store it in a standard location and then add this to the CIRCUS_TRAIN_CLASSPATH environment variable (e.g. via a startup script):

export CIRCUS_TRAIN_CLASSPATH=$CIRCUS_TRAIN_CLASSPATH:/opt/circus-train-big-query/lib/*

Another option is to place the jar file in the Circus Train lib folder which will then automatically load it but risks interfering with any Circus Train jobs that do not require the extension's functionality.

Configuration

Add the following to the Circus Train YAML configuration in order to load the BigQuery extension via Circus Train's extension loading mechanism:

     extension-packages: com.hotels.bdp.circustrain.bigquery

Configure Circus Train as you would for a copy job from Google Cloud Configuration
Provide the Google Cloud project ID that your BigQuery instance resides in as your source-catalog hive-metastore-uris parameter using the format hive-metastore-uris: bigquery://<project-id>
To enable copying to Google Storage provide a path to your Google Credentials in the configuration under the gcp-security parameter.
Provide your BigQuery dataset as source-table database-name and your BigQuery table name as source-table table-name

Partition Generation

Circus Train BigQuery allows you to add partitions for one column to your Hive destination table upon replication from BigQuery to Hive. The partition must be a field present on your source BigQuery table. The user can configure the partition field by setting the table-replications[n].copier-option : circus-train-bigquery-partition-by property on a specific table replication within the Circus Train configuration file. The destination data will be repartitioned on the specified field.

If your destination data is partitioned you can also specify a partition filter using the table-replications[n].copier-option : circus-train-bigquery-partition-filter property. The partition filter is a BigQuery SQL query which will be used to filter the data being replicated - i.e. so one doesn't have to replicate the entire table on each run. You can think of the partition filter being executed as select * from bq_db.table where ${filter-condition}, with the ${filter-condition} being substituted by your partition filter. Only the rows returned by this query will be replicated. See partition filters in the main Circus Train documentation for more information.

Examples:

Simple Configuration

extension-packages: com.hotels.bdp.circustrain.bigquery
source-catalog:
  name: my-google-source-catalog
  hive-metastore-uris: bigquery://my-gcp-project-id
replica-catalog:
  name: my-replica-catalog
  hive-metastore-uris: thrift://internal-shared-hive-metastore-elb-123456789.us-west-2.elb.foobaz.com:9083

gcp-security:
  credential-provider: /home/hadoop/.gcp/my-gcp-project-01c26fd71db7.json 

table-replications:
- source-table:
    database-name: mysourcedb
    table-name: google_ads_data
  replica-table:
    database-name: myreplicadb
    table-name: bigquery_google_ads_data
    table-location: s3://mybucket/foo/baz/

Configuration with partition generation configured

extension-packages: com.hotels.bdp.circustrain.bigquery
source-catalog:
  ... see above ...

table-replications:
  -
    source-table:
      database-name: bdp
      table-name: bigquery_source
    replica-table:
      database-name: bdp
      table-name: hive_replica
      table-location: s3://bucket/foo/bar/
    copier-options:
      circustrain-bigquery-partition-by: date

Configuration with partition filter configured

extension-packages: com.hotels.bdp.circustrain.bigquery
source-catalog:
  ... see above ...

table-replications:
  -
    source-table:
      database-name: bdp
      table-name: bigquery_source
    replica-table:
      database-name: bdp
      table-name: hive_replica
      table-location: s3://bucket/foo/bar/
    copier-options:
      circustrain-bigquery-partition-by: date
      circustrain-bigquery-partition-filter: date BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -150 DAY) AND CURRENT_TIMESTAMP()

Technical Overview

The BigQuery plugin works by extracting the BigQuery table data into Google Storage using Google APIs, and then using Circus Train listeners to convert the BigQuery metadata into a Hive table object. The data is then replicated from source to replica using the metadata from this mocked Hive table.

Contact

Mailing List

If you would like to ask any questions about or discuss Circus Train BigQuery please join the main Circus Train mailing list at

https://groups.google.com/forum/#!forum/circus-train-user

Credits

The Circus Train BigQuery logo is licensed under the Creative Commons Attribution-Share Alike 4.0 International license. It includes an adaption of the Google BigQuery logo that is similarly licensed under the CC BY-SA 4.0 International license. The Circus Train logo uses the Ewert font by Johan Kallas under the SIL Open Font License (OFL).

Legal

This project is available under the Apache 2.0 License.

Hotels.com

Hotels.com open source contributions

Versions

Version
5.2.0 Nov 28, 2019
5.1.1 Sep 23, 2019
5.1.0 May 10, 2019
5.0.4 Dec 17, 2018
5.0.3 Dec 5, 2018
5.0.2 Nov 22, 2018
5.0.1 Nov 19, 2018
5.0.0 Nov 16, 2018
4.0.0 Nov 9, 2018
3.0.0 Sep 7, 2018
2.0.2 Jul 19, 2018
2.0.1 Jul 18, 2018
2.0.0 Jul 13, 2018
1.0.2 May 2, 2018
1.0.1 Feb 28, 2018