Connecting to AWS S3 Data with Athena
The Sisense Athena connector is currently in beta.
The Sisense Athena connector allows you to quickly connect to your Amazon S3 data to query and mashup data from Amazon S3. Amazon Athena allows you to tap into all your data in S3 without the need to set up complex processes to extract, transform, and load the data (ETL). Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL, which makes it easy for anyone with SQL skills to quickly analyze large-scale datasets. The data formats that Athena supports include CSV, JSON, Parquet, Avro, and ORC.
The connector supports the ability to preview and explore all catalogs and schema/databases that are listed under their catalogs.
The connector supports most field types, including array types. Sisense converts unrecognized types of data to strings, and issues a notification message about this.
The connector also supports accumulative builds by table, by field, and by a specific field that you set from the UI. This allows you to easily build only new data according to a specified index in your table without manually changing configuration files.
The connector allows you to use partitions to save time and money on the scanned data, see here.
You can connect to Amazon Athena either by providing your IAM Credentials or by authenticating via a connection string, where you can connect via your IAM Profile or Instance Profile (EC2).
For a list of performance-tuning tips for Amazon Athena, click here.
Importing Partitioned Data
If your Amazon S3 data is partitioned, the Sisense Athena connector allows you to use partitions on the scanned data. Since you pay for the queries that you run in Athena, this ability saves you time and money.
For example, see below the difference in the amount of data scanned with and without partitions:
Amount of Scanned Data
SELECT sum(l_tax)FROM lineitem_partionedWHERE l_shipdate = CAST('1992-02-02' as date)
SELECT sum(l_tax)FROM lineitem_partioned WHERE l_shipdate = CAST('1992-02-02' as date) AND dt = '1992-02-02'
To import data from Amazon S3:
In the Data page, open an ElastiCube or create a new ElastiCube.
In the Model Editor, click ; the Add Data dialog box is displayed.
Click Amazon Athena. The Athena Connect area is displayed.
If connecting using your IAM Credentials, select Basic.
- In Region, enter the region endpoint of your Athena instance. To find the region code from a region name, click here . Amazon Athena is supported in the regions listed here.
- In S3 Output Location , indicate the location where the query output files are downloaded. Athena will cache all query results in this location. For details, click here.
- In Access Key, provide your AWS access key. To find your AWS access key and secret key, click here.
- In Secret Key , provide your AWS secret key.
- (Optionally) In Additional Parameters, enter the appropriate configuration options by appending key-value pairs to the string (e.g., timeout=10;). This allows you to use all the parameters that are supported by the driver.
If connecting using other different authentication methods, click Advanced. Sisense uses connection strings to connect to Athena and import data into Sisense. Connection strings contain authentication parameters that the data source uses to verify your identity and what information you can export to Sisense.
Provide the connection string in the following structure:
jdbc:awsathena:AwsRegion=<region>;User= <AWS access key>;Password=<AWS secret key>;S3OutputLocation=<location>;
For details on building the connection string, click here.
If you are connecting to your EC2 instance, click EC2. EC2 is an Amazon service where you can install Sisense. This enables Sisense to run faster.
- In Region, indicate the region endpoint of your Athena instance. To find the region code from a region name, click here.
Amazon Athena is supported in the regions listed here.
- In S3 Output Location, indicate the location to which the query output files are downloaded. Athena will cache all query results in this location. For details, click here.
- In Additional Parameters, enter the appropriate configuration options, such as your credentials, by appending key-value pairs to the string (e.g., timeout=10;).
Click Next. A list of tables in the database are displayed. All tables and views associated with the database will appear in a new window.
From the Tables list, select the relevant table or view you want to work with. You can click next to the relevant table or click Preview to see a preview of the data inside it.
(Optional) Click + to customize the data you want to import with SQL. See Importing Data with Custom Queries for more information.
After you have selected all the relevant tables, click Done. The tables are added to your data model.
AWS places a limitation for query timeout which is 30 minutes. If your build takes longer, you can contact AWS Support to request a limit increase.
The number of DML queries you can submit simultaneously is 20 by default. DML queries include SELECT queries. You can contact AWS Support to request a limit increase.
The AWS authentication method Active Directory Federation Services using SAML 2.0 is not supported by the Sisense Athena connector.
While adding new tables, the Athena beta connector displays all the tables that exist in the data source, regardless of the user permissions/specified schema. As the connector may retrieve a long list of items, this can slow down the process of adding the tables.
If your builds are failing, there are several things you should check:
Port 444 is not open for outbound communication
Is your Athena in the same location as defined in your Athena settings
Is your data over 50GB
What is the timeout limit in AWS? If you build takes longer than 30 minutes, you can contact AWS Support to request a limit increase.