You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently I noticed that after a DELETE in a clustered Hive table, the SELECT using Hive Warehouse Connector (HWC) does not load all the data.
It seems like that some buckets files are ignored after the DELETE.
In the beeline all the data continues to be returned correctly.
I'm using HDP 3.1.5 with Spark2 2.3.2, Hive 3.1.0 and hive-warehouse-connector-assembly-1.0.0.3.1.5.0-152.jar.
See this example:
Hive >>
CREATE TABLE mdm_dev.test(
`id_pessoa` string,
`num_cpf_cnpj` string,
`nom_pessoa` string,
`dat_carga` string,
`ind_origem_criacao` string,
`ind_tipo_pessoa` string,
`id_carga` int)
CLUSTERED BY (
num_cpf_cnpj)
INTO 64 BUCKETS
STORED AS ORC
TBLPROPERTIES ('transactional'='true');
INSERT INTO mdm_dev.test (SELECT * FROM mdm_dev.original);
ANALYZE TABLE mdm_dev.test COMPUTE STATISTICS;
ANALYZE TABLE mdm_dev.test COMPUTE STATISTICS FOR COLUMNS;
SELECT COUNT(*) FROM mdm_dev.test;
180978025
Spark >>
import com.hortonworks.hwc.HiveWarehouseSession
val hive = HiveWarehouseSession.session(spark).build()
hive.executeQuery(" SELECT * FROM mdm_dev.test ").count
180978025
Up to this point everyting is OK. Now, let's DELETE some data.
Hive >>
SELECT id_carga, COUNT(*) FROM mdm_dev.test GROUP BY id_carga;
42306 | 953595
...
DELETE FROM mdm_dev.test WHERE id_carga = 42306;
SELECT COUNT(*) FROM mdm_dev.test;
180024430
Spark >>
import com.hortonworks.hwc.HiveWarehouseSession
val hive = HiveWarehouseSession.session(spark).build()
hive.executeQuery(" SELECT * FROM mdm_dev.test ").count
37577751
It's missing a lot of data in the dataframe =/
Looking for missing rows I noticed that some bucket files have not been read.
The text was updated successfully, but these errors were encountered:
davyam
changed the title
Not all data is loaded from clustered Hive tables after a delete
Not load all data from clustered Hive tables after a delete
Mar 5, 2020
davyam
changed the title
Not load all data from clustered Hive tables after a delete
Not all data is loaded from clustered Hive tables after a delete
Mar 5, 2020
within sparksession after update/delete issue the command hive.executeUpdate("alter table <tbl_nm> compact 'major'") and wait couple seconds; spark would fetch all records.
Hi guys,
Recently I noticed that after a DELETE in a clustered Hive table, the SELECT using Hive Warehouse Connector (HWC) does not load all the data.
It seems like that some buckets files are ignored after the DELETE.
In the beeline all the data continues to be returned correctly.
I'm using HDP 3.1.5 with Spark2 2.3.2, Hive 3.1.0 and hive-warehouse-connector-assembly-1.0.0.3.1.5.0-152.jar.
See this example:
Hive >>
Spark >>
Up to this point everyting is OK. Now, let's DELETE some data.
Hive >>
Spark >>
It's missing a lot of data in the dataframe =/
Looking for missing rows I noticed that some bucket files have not been read.
The text was updated successfully, but these errors were encountered: