HDInsight allows users to run data transformation and interactive queries on different types of data. Interactive Hive (Hive on LLAP) is still in preview and as a result not available in Europe, old good Hive has to be tweaked to satisfy interactive queries.
In HDInsight data is stored separately and there can be many HDInsight clusters on the same storage accounts. One use case is to have a separate cluster for Data Scientists (interactive queries, data exploration) and another one for ETL jobs. The clusters can run independently, but preferably on different time slots (Blob Storage throughput limitations).
USE CASES
Interactive
- Common patterns: short running jobs over refined data
- Main problems: how to use BI tools, how to utilize produced datasets
- Optimization: as in ETL + use different setup than ETL cluster (see below)
ETL
- Common patterns: fire and forget, long running jobs, full table scans
- Main problems: how to schedule, make operationale, cost based optimizations
- Optimization: tweak joins, mappers, reducers, memory settings, etc. all you can find
CLUSTER SETTINGS
(You can change those settings in Ambari UI/Hive/Configuration or change corresponding configuration files)
Interactive
- hive.execution.engine = tez;
- hive.server2.tez.initialize.default.sessions = true
- hive.server2.tez.default.queues = default
- hive.server2.tez.sessions.per.default.queue = max concurrent queries
- hive.prewarm.enabled = true
- hive.prewarm.numcontainers 1-5
- tez.am.session.min.held-containers 1-5
ETL
- hive.execution.engine = tez; (Rebuild index requriesmr)
- hive.server2.tez.initialize.default.sessions = false
- hive.server2.tez.default.queues = default
- hive.server2.tez.sessions.per.default.queue = 1
- hive.prewarm.enabled = false
- hive.prewarm.numcontainers = 0
- tez.am.session.min.held-containers = 0
JOIN STRATEGY
Shuffle (default)
- Slow, but works everytime Map (broadcast)
- Very fast, but limited
- One table has to fit in memory
- Understand Optimize Auto Join Conversion
- Default size for in-memory table is 10MB
- For star schemas consider loading dimension tables into memory
- In case of many joins, n-1 tables should fit in memory Bucket
- Very efficient, but difficult to setup
- Requries both tables to be bucketed and sorted on the same column
HIVE EXECUTION
(WebHCat vs HiveServer2)
- HiveServer2 starts queries faster
- WebHCar has more detailed history
- HiveServer2 is synchronous while WebHCat is asynchronous
- In Visual Studio HiveServer2 is called “Interactive”while WebHCat is called “Batch”
EXECUTION ENGINE OPTIMIZATION
Type | Recommended | HDI Default |
Joins | Bucket join/Sort Merge join | Shuffle |
hive.auto.convert.join.noconditionaltask.size | 1/3 of -Xmx value | Auto-Tuned |
tez.grouping.min-size | Decrease for better latency; Increase for more throughput | 16777216 |
tez.grouping.max-size | Decrease for better latency; Increase for more throughput | 1073741824 |
hive.exec.reducers.bytes.per.reducer | Decrease if reducers are the bottleneck | 256MB |
hive.cbo.enable | true but need to rewrite tables | True |
tez.am.resource.memory.mb | 4GB upper bound for most | Auto-Tuned |
tez.session.am.dag.submit.timeout.secs | 300+ | 300 |
tez.am.container.idle.release-timeout-min.millis | 20000+ | 10000 |
tez.am.container.idle.release-timeout-max.millis | 40000+ | 20000 |
How to know what settings are used during TEZ query execution? Ambari/Tez view/«Query application ID»/ App Configuration/
More info: HDInsight performance Hive optimization Hive optimization #2 Hive join strategies Hive joins