Data sources
As mentioned in Connecting to remote data sources, you need to provide the Import tool with information about the data source in order for the tool to be able to load the tables from your remote data source.
The information required depends on the remote data source you want to connect to. In general it covers the same aspects, but on a detailed level it can differ from one data source to another. This page outlines where to find the required information in your remote data source.
Some data sources require specific roles/permissions to access the data. Exactly which permissions are required depends on the data source and how the data is organized. Therefore, it is recommended to consult the database administrator to set up the minimal required permissions for each case.
|
As the sources are third parties, this guidance is based on the information available at the time of writing and may not always reflect the current state of the third party data source. |
PostgreSQL, MySQL, SQL Server, and Oracle
All four data sources follow common patterns for RDBMS of host, port, and database schema. They all use Username and Password for authentication.
BigQuery
The following information is required to connect to BigQuery:
-
Project ID
-
Dataset name
-
Service Account key JSON
The Project ID and dataset name are available in the BigQuery Studio in the Google Cloud console. Browse to the BigQuery dataset you want to connect to and inspect the details to find the project ID and dataset name. Note that the dataset is sometimes prefixed with the project ID namespace, but not always.
The service account key JSON requires that you create a service account and then create a JSON key for the account. See Google Cloud Create service accounts for information on how to set up a service account.
You need to add the following roles to the Permissions of the service account:
-
BigQuery Data Viewer -
BigQuery Read Session User -
BigQuery Job User
Once the account is created, you need to create a private key for the account. This is accessed from the Keys tab of the service account, use Add key and select key type JSON and download the key.
Databricks
The following information is required to connect to Databricks:
-
Server hostname
-
HTTP Path
-
Catalog
-
Schema
-
Personal Access Token
The server hostname and the HTTP Path can be found under SQL Warehouses → Connection details. See the Databricks documentation for more information. Note that the instructions in the link are for AWS, but the same information applies for Azure and GCP. Refer to Azure, and GCP.
The catalog is found in the left-hand menu in the Databricks workspace, browse for the catalog you want to use. If you expand the selected catalog with the down arrow, you can find the schema.
The personal access token can be created from the User settings menu. Navigate to Developer and use Manage under Access tokens to generate a new token.
Redshift
The following information is required to connect to Redshift:
-
Server endpoint
-
Port
-
Database
-
Schema
-
IAM Credentials or
-
Password
The server endpoint is found in the dashboard. In addition to the actual endpoint, it contains the port and database. Copy the endpoint up until the colon, then copy the port number after the colon, and the database is found after the slash. The port and database should not contain any punctuation (i.e. no colon or slash).
The credentials can be either IAM credentials or username/password.
Snowflake
The following information is required to connect to Snowflake:
-
Host
-
Port
-
Database
-
Schema
-
Warehouse
-
Role
-
Username
-
Private Key
-
Private Key Passphrase
The host can be found in the account details of the Snowflake account. Open your profile in Snowflake, and navigate to Account to access the account details.
The database name can be found in Catalog in the left-hand menu in the Snowflake workspace and the schema is listed under each database name.
Warehouse can be found in Compute in the left-hand menu. If none is specified, a default is used.
Role is optional and if not specified, default permissions should apply. These are:
GRANT USAGE ON WAREHOUSE $WAREHOUSE_NAME TO ROLE $ROLE;
GRANT USAGE ON DATABASE $DATABASE TO ROLE $ROLE;
GRANT USAGE ON SCHEMA $DATABASE.$SCHEMA TO ROLE $ROLE;
GRANT SELECT ON ALL TABLES IN SCHEMA $DATABASE.$SCHEMA TO ROLE $ROLE;
However, note that the necessary permissions depend on how the data is organized in Snowflake. It is recommended to consult the DB administrator to set up the minimal required permissions for each case.
The credentials are the username and private key of a user in Snowflake.
Google Cloud Storage
The following information is required to connect to Google Cloud Storage:
-
Bucket name
-
Bucket path
-
HMAC Access Key
-
HMAC Secret
In the Google Cloud Console, under Buckets, navigate to a specific resource in a bucket to see its bucket path. The bucket path can optionally contain glob patterns to reference multiple fles, but must always reference to parquet or csv files. Each file must be given a temporary table name, which is used to reference the file in the mapping UI.
For Google service accounts with access to the cloud bucket, see the Google documentation on how to create an HMAC key.