栏目分类:
子分类:
返回
终身学习网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
终身学习网 > IT > 前沿技术 > 大数据 > 大数据系统

Spark String Decimal类型引起的问题

大数据系统 更新时间:发布时间: 百科书网 趣学号
问题背景

从Spark 2 到 Spark3 这期间, Spark 对于 String 和 Decimal 类型的比较会自动转换为Double 类型。这样会导致转换后的Filter 无法进行 Data Filter Pushed. 社区相关Ticket:

[SPARK-17913][SQL] compare atomic and string type column may return confusing result
[SPARK-22469][SQL] Accuracy problem in comparison with string and numeric
SPARK-29274: Should not coerce decimal type to double type when it’s join column

Test Query Query 1
withTable("t1") {
  sql("CREATE TABLE t1 USING PARQUET " +
    "SELECt cast(id + 0.1 as decimal(13,2)) as salary FROM range(0, 100)")
  sql("select * from t1 where salary = '12.1' ").collect()
}

Query这样会因为Filter 将Decimal类型转换成Double类型,而无法进行数据下推

== Physical Plan ==
*(1) Project [salary#276]
± *(1) Filter (isnotnull(salary#276) AND (cast(salary#276 as double) = 12.1))
± *(1) ColumnarToRow
± FileScan parquet default.t1[salary#276] Batched: true, DataFilters: [isnotnull(salary#276), (cast(salary#276 as double) = 12.1)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/wakun/ws/ebay/spark3/spark-warehouse/org.apache.spark.sql.execution…, PartitionFilters: [], PushedFilters: [IsNotNull(salary)], ReadSchema: structsalary:decimal(13,2), UsedIndexes: []

Query 2
sql("select * from t1 where salary = cast('12.1' as decimal) ").collect()

Query 这种写法是错误的,这样是将 12.1 cast 成 decimal(10,0) 类型,结果也就是 12.00,所以数据结果错误

== Physical Plan ==
*(1) Project [salary#276]
± *(1) Filter (isnotnull(salary#276) AND (salary#276 = 12.00))
± *(1) ColumnarToRow
± FileScan parquet default.t1[salary#276] Batched: true, DataFilters: [isnotnull(salary#276), (salary#276 = 12.00)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/wakun/ws/ebay/spark3/spark-warehouse/org.apache.spark.sql.execution…, PartitionFilters: [], PushedFilters: [IsNotNull(salary), EqualTo(salary,12.00)], ReadSchema: structsalary:decimal(13,2), UsedIndexes: []

Query 3
sql("select * from t1 where salary = cast('12.1' as decimal(13,2)) ").collect()

Query 这样写才是对的

== Physical Plan ==
*(1) Project [salary#276]
± *(1) Filter (isnotnull(salary#276) AND (salary#276 = 12.10))
± *(1) ColumnarToRow
± FileScan parquet default.t1[salary#276] Batched: true, DataFilters: [isnotnull(salary#276), (salary#276 = 12.10)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/wakun/ws/ebay/spark3/spark-warehouse/org.apache.spark.sql.execution…, PartitionFilters: [], PushedFilters: [IsNotNull(salary), EqualTo(salary,12.10)], ReadSchema: structsalary:decimal(13,2), UsedIndexes: []

转载请注明:文章转载自 www.051e.com
本文地址:http://www.051e.com/it/601070.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 ©2023-2025 051e.com

ICP备案号:京ICP备12030808号