Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not all data is loaded from clustered Hive tables after a delete #281

Open
davyam opened this issue Mar 5, 2020 · 3 comments
Open

Not all data is loaded from clustered Hive tables after a delete #281

davyam opened this issue Mar 5, 2020 · 3 comments

Comments

@davyam
Copy link

davyam commented Mar 5, 2020

Hi guys,

Recently I noticed that after a DELETE in a clustered Hive table, the SELECT using Hive Warehouse Connector (HWC) does not load all the data.

It seems like that some buckets files are ignored after the DELETE.

In the beeline all the data continues to be returned correctly.

I'm using HDP 3.1.5 with Spark2 2.3.2, Hive 3.1.0 and hive-warehouse-connector-assembly-1.0.0.3.1.5.0-152.jar.

See this example:

Hive >>

CREATE TABLE mdm_dev.test(                                            
  `id_pessoa` string,                                               
  `num_cpf_cnpj` string,                                                      
  `nom_pessoa` string,                                                        
  `dat_carga` string,                                                         
  `ind_origem_criacao` string,                                                
  `ind_tipo_pessoa` string, 
  `id_carga` int)                                                  
CLUSTERED BY (                                                                
  num_cpf_cnpj)                                                               
INTO 64 BUCKETS
STORED AS ORC                
TBLPROPERTIES ('transactional'='true');

INSERT INTO mdm_dev.test (SELECT * FROM mdm_dev.original);

ANALYZE TABLE mdm_dev.test COMPUTE STATISTICS;

ANALYZE TABLE mdm_dev.test COMPUTE STATISTICS FOR COLUMNS;

SELECT COUNT(*) FROM mdm_dev.test; 
180978025

Spark >>

import com.hortonworks.hwc.HiveWarehouseSession

val hive = HiveWarehouseSession.session(spark).build()

hive.executeQuery(" SELECT * FROM mdm_dev.test ").count
180978025

Up to this point everyting is OK. Now, let's DELETE some data.

Hive >>

SELECT id_carga, COUNT(*) FROM mdm_dev.test GROUP BY id_carga;
42306 | 953595
...

DELETE FROM mdm_dev.test WHERE id_carga = 42306;

SELECT COUNT(*) FROM mdm_dev.test;
180024430

Spark >>

import com.hortonworks.hwc.HiveWarehouseSession

val hive = HiveWarehouseSession.session(spark).build()

hive.executeQuery(" SELECT * FROM mdm_dev.test ").count
37577751

It's missing a lot of data in the dataframe =/

Looking for missing rows I noticed that some bucket files have not been read.

@davyam davyam changed the title Not all data is loaded from clustered Hive tables after a delete Not load all data from clustered Hive tables after a delete Mar 5, 2020
@davyam davyam changed the title Not load all data from clustered Hive tables after a delete Not all data is loaded from clustered Hive tables after a delete Mar 5, 2020
@guerinclement
Copy link

I got the same behavior after an UPDATE statement when the table was created with a "CREATE TABLE ... AS SELECT ..." instruction.

@Anaya0123
Copy link

Thanks for sharing, This will really helpful Hadoop training.

@tamimasyed
Copy link

within sparksession after update/delete issue the command hive.executeUpdate("alter table <tbl_nm> compact 'major'") and wait couple seconds; spark would fetch all records.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants