不是單組分組函數，HIve之DML 聚合分組應用函數靜動態分區表

2023-11-16 阅读 17 评论 0

摘要：DML 查詢的相關的 desc xxx desc formatted xxxx select * from xxxx 這里也可以指定字段工作的時候一般都是hi指定字段的 select * from xxxx where xx=xx select * from xxxx where sal between 800 and16000； limit select * from xxxx where sal in（“

DML 查詢的相關的

desc  xxx 
desc formatted  xxxx
select   *  from   xxxx  這里也可以指定字段 工作的時候一般都是hi指定字段的 
select  * from  xxxx  where  xx=xx
select * from xxxx  where sal between 800 and16000； limit   
select * from xxxx  where sal  in（“xxx”，“zzz”）;  in也可以查詢在這個之間的這里面是字符間隔，比如姓名not in  不在這之間的 where comm is not null         ！=  組合是不等于的意思

以后再處理日志的時候，很多日志是不規范的，所以我們要考慮不同的情況
這些基本的查詢不會跑mapreduce

Hive構建在Hadoop之上的數據倉庫
sql ==> Hive ==> MapReduce

聚合函數： max min sum avg count

分組函數：出現在select中的字段，要么出現在group by子句中，要么出現在聚合函數中
求部門的平均工資 select deptne，avg（sal）from ruoze_emp group by deptne;
求每個部門、工作崗位的最高工資 select deptno,jop,max(sal) from ruoze_emp group by deptno ,jop;
求每個部門的平均薪水大于2000的部門
select deptno,avg(sal) from ruoze_emp group by deptno having avg(sal)>2000;
where 是作用于所有之上的，hiving是作用于分組之后的

不是單組分組函數？case when then if-else
如果怎么樣就怎么樣

select ename, sal,
case 如果
when sal>1 and sal<=1000 then 'LOWER'
when sal>1000 and sal<=2000 then 'MIDDLE'
when sal>2000 and sal<=4000 then 'HIGH'
ELSE 'HIGHEST' end
from ruoze_emp;

出報表的時候會用到

union all的使用

select count(1) from ruoze_emp where deptno=10
union all
select count(1) from ruoze_emp where deptno=20;

這用在數據傾斜場景比較多
a = a1 union all a2
a表是傾斜的，把a表分為a1表傾斜 a2表不傾斜用一個臨時表
把a1和a2各自的結果統計出來，用一個臨時表，然后用union all 就拿到最終的結果了
把正常的拿出來，不正常的拿出來，分別處理，把處理結果在同一起來

尋找hive的函數，
**在hive官網，**https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
hive網頁》Hive wiki 》 Operators and UDFs 找到函數 Operators操作符運算符
在xshell中的

>show functions; 描述的是hive里面所有支持的內置函數
查看具體的一個函數的用法
>desc function  extended max;     后面跟上你的函數

merge模塊，時間函數
select 加時間函數解讀一下官網 hive3視頻 40分鐘額位置

hive的函數 cast 類型裝換

cast (value as TYPE)
select comm ，cast（comm as int）from ruoze；

在這里如果轉換失敗，返回值就是null

substr
select substr （“adfghhs”，2，3）from ruoze
從第二個取三個字母

substring

excel按范圍分組？concat 是把2個字符串合在一起合并多個數據或則字符串
select concat （“ruoze”，’“jeson”）from ruoze
concat_ws
select concat_ws （“.”，“192”，“162”，“135”）from ruoze
以點分割把他們合在一起這個函數工作當中用的非常多

length長度一個字段的長度
select length （“192.168.0.20.0”）from ruoze
select ename，length（ename）from ruoze

explode函數是把數組分行

1,doudou,化學:物理:數學:語文
2,dasheng,化學:數學:生物:生理:衛生
3,rachel,化學:語文:英語:體育:生物

create table ruoze_student(
id int,
name string,
subjects array<string>數組類型
)row format delimited fields terminated by ','
COLLECTION ITEMS TERMINATED BY ':';  這里的意思是，我們的數據里面是：進行分割的，集合里面的分隔符

collection
n. 采集，聚集; [稅收] 征收; 收藏品; 募捐

為hadoop用戶在hdfs中創建用戶目錄、加載數據，里面含有集合

load data local inpath '/home/hadoop/data/student.txt' into table ruoze_student;

我們要集合去重復

select distnct s.sub from (select explode(subjects） as sub from ruoze_studens) s ;

面試題需求：使用hive完成wordcount

先創建一張表

create table ruoze_wc(
sentence string
);

導入數據

load data local inpath '/home/hadoop/data/student.txt' into table ruoze_wc;

基于這張表我們要做wordcount操作。
第一步要切分 select split（sentence，“，”）from ruoze_wc; 拆成數組
第二部是把每個數組拆分出來 select explode（split（sentence，“，”））from

ruoze_wc; select word, count(1) as c
from
(
select explode(split(sentence,",")) as word from ruoze_wc
) t group by word   這里的t要加上。這是一個子查詢，要加個別名 ，雖然沒有用，但是的加上
order by c desc;  排序

分區表：一個表按照某個字段進行分區

集群session共享。分區的意思何在，

求時間 2018年10月21日22：00到2018年10月21日23:59的數據
startime>201810212200 and starttime < 201810212359

access.log 很大的一張表每天的數據都在這里面

是把這張表讀取出來，然后全表去掃描，這種性能是很低的
所以一般情況做分區
這張表存在的路徑
/user/hive/warehouse/access/d=20181021 d是每天做分區
減少很多io

分區表的創建

create table order_partition(
order_Number string,
event_time string
)PARTITIONED BY(event_month string)  分區字段
row format delimited fields terminated by '\t';
加載數據
load data local inpath '/home/hadoop/data/order.txt' into table order_partition PARTITION (event_month='2014-05');表名          創建分區表這里要 指定分區

注意：

es 嵌套查詢。去看一下日志
cd 切換到root用戶下面
cd /tmp/hadoop
ls
tail -200f hive.log
這里掛了
解決改變mysql設置，不能改變已經存在的表，你需要轉換表的編輯
先把hive關掉
切換到mysql數據庫
mysql> 把下面的復制一下到數據庫

use ruoze_d5;
alter table PARTITIONS convert to character set latin1;
alter table PARTITION_KEYS convert to character set latin1;

在重新啟動一下hive

hive>load data local inpath ‘/home/hadoop/data/order.txt’ into table order_partition PARTITION (event_month=‘2014-05’);
在加載一下，數據進來了

在去hadoop上面看一下
hadoop fs -ls /user/hive/warehouse 這里會有不一樣的
分區的名稱是分區字段=分區值要知道他的目錄結構

分區表在hive中查詢的時候要把分區字段加上，要不然還是全局掃
select * from order_partition where event_month=‘2014-05’;

ehcache集群？這里我們做一個操作，在hadoop上面建了一個分區文件夾 hadoop fs -mkdir -p /user/hive/warehouse/order_partion/event_month=2014-06
在創建的時候，分區字段不一樣，我們把之前的event_month='2014-05’這個文件移動到這里
然后去hive，在去查詢select * from order_partition where event_month=‘2014-05’;
這時發現是找不到的
hive> msck repair table order_partition；
這時在查詢就會發現有了
但是，這個功能不要用，這個功能是刷所有分區的，性能非常低，生產上杜絕使用這個
使用下面的命令，在生產上一定使用這個方式

alter table order_partition add partition(event_month='2014-07');

我們查詢有多少分區
show partitions order_partition;

在創建一個表生產上多級分區使用

create table order_mulit_partition(
orderNumber string,
event_time string
)PARTITIONED BY(event_month string, step string)   多級分區，就是多個字段，這個是2分區  
row format delimited fields terminated by '\t';

這個加載數據怎么加
load data local inpath ‘/home/hadoop/data/order.txt’ into table order_mulit_partition PARTITION (event_month=‘2014-05’,step=‘1’); 指定分區的時候要與前面相對應

單級分區/多級分區 ==> 靜態分區：你導入數據的時候分區字段要寫全

不是單組分組函數怎么解決。show create table ruoze_emp;
就會顯示下面的表創建語句的結構

CREATE TABLE `ruoze_emp`(`empno` int, `ename` string, `job` string, `mgr` int, `hiredate` string, `sal` double, `comm` double，`deptno` int)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t';

然后我們在這個基礎上創建分區表

CREATE TABLE `ruoze_emp_partition`(`empno` int, `ename` string, `job` string, `mgr` int, `hiredate` string, `sal` double, `comm` double)
partitioned by(`deptno` int)  這里是以他為分區字段，他的字段是不能出現在 表字段里面的 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t';

需求是什么呢
以部門編號作為分區字段，將這張整體表寫到分區里面去

之前的做法：

   insert into table ruoze_emp_partition PARTITION(deptno=10)select empno,ename,job,mgr,hiredate,sal,comm from ruoze_emp where deptno=10;這里不能字*。字段要一個一個寫，因為我們的分區字段的原因

假設：有1000個deptno 這里指的是分區字段有這么多以上的方法是不行的所以這就是靜態分區的弊端

insert overwrite table ruoze_emp_partition PARTITION(deptno)  這里直接寫字段名
select empno,ename,job,mgr,hiredate,sal,comm,deptno from ruoze_emp;分區字段要寫到最后，如果你有2個，也要相對應對上

不是分組函數？這里會報錯
>set hive.exec.dynamic.partition.mode=nonstrict; 報錯里面會提醒你要你執行這個語句
如果你想全局使用，在hive-site里面配置一下
在執行一下就Ok了

原文链接：https://hbdhgg.com/3/173669.html

上一篇：hive修改表的存儲格式，hive 外部表不支持添加列

下一篇：kafka實戰，Kafka模擬實現（用于自我測試環境）