Tips

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
distinct works for all columns;
limit 5 offset 6 (jump start from line 7 )
-- this is the comments
/* Common comment format for
different language */
order by c1 desc, c2; --asc is the default
order by 1,2; -- this also works
the A and a is the same within the dictionary, so it works in DBMS, the admin can change it
<> != !> ;between x and y; is null;
where A or B and C; -- the and will be do first, the sql running like (A and C), then (or B)
where (A or B) and C;
where vendid IN ("A","B") -- same logic like or, and the IN clause can include the "Select xxxx" and run faster
where NOT A; -- like <> and we can use the NOT IN (Select )
LIKE "Fish%" -- start from Fish % can not match the NUL, % can be 0 wildcard. in access, _ _ match with the single byte
-- % make the SQL query works slow, limit the useage of the %
Select RTRIM(A) + '(' + RTRIM(B) + ')' -- use the RTRIM to delete the right side spaces of the values
trim(), ltrim(), rtrim()
Select RTRIM(A) + '(' + RTRIM(B) + ')'
AS C -- C could be the alias of the new column in SQL, if the column name is illegal, use the alias
select Now() -- return you the current date and time

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
The distinct function in different DBMS, should be care about. We should consider the portable.
-- 常用的文本处理函数
-- 函  数 说  明
LEFT() -- (或使用子字符串函数) 返回字符串左边的字符
LENGTH() -- (也使用DATALENGTH()或LEN()) 返回字符串的长度
LOWER() -- (Access使用LCASE()) 将字符串转换为小写
LTRIM() -- 去掉字符串左边的空格
RIGHT() -- (或使用子字符串函数) 返回字符串右边的字符
RTRIM() -- 去掉字符串右边的空格
SOUNDEX() -- 返回字符串的SOUNDEX值
UPPER() -- (Access使用UCASE()) 将字符串转换为大写”
soundex(A) = soundex('Michael Green') -- SQL can return the similar prounance results, like Michelle Green.
AVG() -- ignore the row which the target column value is NULL
select avg(distinct A) -- just calulate the specific products, unralated with count of products
COUNT(*) count all rows includes NULL row
COUNT(colum) count all rows excludes NULL row
对非数值数据使用MAX()
MAX()一般用来找出最大的数值或日期值,但许多(并非所有)DBMS允许将它用来返回任意列中的最大值,包括返回文本列中的最大值。在用于文本数据时,MAX()返回按该列排序后的最后一行。”
MAX() ingore NULL row
GROUP BY
- GROUP BY子句可以包含任意数目的列,因而可以对分组进行嵌套,更细致地进行数据分组。
- 如果在GROUP BY子句中嵌套了分组,数据将在最后指定的分组上进行汇总。换句话说,在建立分组时,指定的所有列都一起计算(所以不能从个别的列取回数据)。
- GROUP BY子句中列出的每一列都必须是检索列或有效的表达式(但不能是聚集函数)。如果在SELECT中使用表达式,则必须在GROUP BY子句中指定相同的表达式。不能使用别名。
- 除聚集计算语句外,SELECT语句中的每一列都必须在GROUP BY子句中给出。”
HAVING
-- “目前为止所学过的所有类型的WHERE 子句都可以用HAVING来替代。唯一的差别是,WHERE过滤行,而HAVING过滤分组。”
-- having support all where operator
HAVINGWHERE的差别
-- 这里有另一种理解方法,WHERE在数据分组前进行过滤,HAVING在数据分组后进行过滤。这是一个重要的区别,WHERE排除的行不包括在分组中。这可能会改变计算值,从而影响HAVING子句中基于这些值过滤掉的分组。”

Query

1
2
3
4
5
6
7
8
9
10
11
12
13
show full processlist;
kill *****;
desc tablename
-- ssss
source ~/.bash_profile
mysql -u root -p
123456
use database
source sql.file path
group by column , no quote
mysql -u USERNAME -pPASSWORD -h REMOTE_SERVER_IP DB_NAME
1
2
-- String to Date in MySQL
SET expired_date = STR_TO_DATE(@expired_date, '%m/%d/%Y');
1
2
3
4
5
6
7
-- Load csv in S3 to remote MySQL Server.
-- Tranfer data from one remote MySQL Server to S3 and to another MySQL Server.
sqoop import --connect
sqoop export --connect
1
2
3
4
5
6
LOAD DATA local INFILE '<file>.csv'
INTO TABLE <TABLE>
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;
1
# Insert data from SQL Dump file

mysql导入数据load data infile用法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
-- Load/Insert csv data in S3 to MySQL Server
-- Copy the schema of create table.
ssh -i ~/wang.pem <EMR>
-- Mount the new drive in EMR, the files over 5gb
cd /mnt
mkdir xx
cd xx/
-- aws copy csv into EMR
aws s3 cp
--
mysqlimport -h smb-dwh-dev.cluster-c9teojsqesw4.us-west-2.rds.amazonaws.com -P 3306 -u adhoc_user -p'4dh0cu3er' --local --ignore-lines=1 --fields-terminated-by=',' adhoc_sandbox sas_billingplan.csv
mysqlimport -h smb-dwh-dev.cluster-c9teojsqesw4.us-west-2.rds.amazonaws.com -P 3306 -u adhoc_user -p'4dh0cu3er' --local --ignore-lines=1 --fields-terminated-by=',' adhoc_sandbox sas_billing.csv
mysqlimport -h smb-dwh-dev.cluster-c9teojsqesw4.us-west-2.rds.amazonaws.com -P 3306 -u adhoc_user -p'4dh0cu3er' --local --ignore-lines=1 --fields-terminated-by=',' adhoc_sandbox sas_leasingcontracs.csv
-- Loading the sas_leasingcontracs from csv in s3 to hive table
DROP TABLE sas_leasingcontracs;
CREATE EXTERNAL TABLE sas_leasingcontracs
(
Active_Contract_Status STRING,
COMPANY_CODE STRING,
CONTRACT_NUMBER STRING,
CUSTOMER_SEGMENT STRING,
FINANCIAL_PRODUCT STRING,
GOVT_COMM_MUNI_INDICATOR STRING,
Lease_Expiry_Date STRING,
Sales_Office STRING,
Lease_Start_Date STRING,
LEASE_STATUS_CODE STRING,
Original_Term_Duration STRING,
Prime_Period_Contract_Status STRING,
Remaining_Term STRING,
SLP_Contract_Status STRING,Sales_Group STRING,
Customer_Number STRING,
Original_Residual_Value STRING,
Term_Duration STRING,
Terminated_Contract_Status STRING,
Termination_Date STRING,
Termination_Reason_Code STRING,
Financed_Cost STRING,
Lease_Program_Offer STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://pbdl-bda-xbec/staging/sharedworkspace/congyang_pipeline/sas_leasingcontracs/'
tblproperties ("skip.header.line.count"="1")

MySQL DATE_FORMAT() Function

https://www.w3schools.com/sql/func\_mysql\_date\_format.asp

1
2
3
4
5
6
7
-- Create Table from query
CREATE TABLE adhoc_sandbox.smb_plan AS
SELECT bpn, created, company_name
FROM cc_dwh.aria_subscriptions
ALTER TABLE adhoc_sandbox.smb_plan
ADD PRIMARY KEY (bpn);

SQL 通配符

在搜索数据库中的数据时,SQL 通配符可以替代一个或多个字符。
SQL 通配符必须与 LIKE 运算符一起使用。
在 SQL 中,可使用以下通配符:%

1
2
SELECT * FROM Persons
WHERE City LIKE 'Ne%'

If the value appears more than once.

1
2
3
4
5
6
7
8
9
select *
from sfdc_contacts
where okta_id
in(
select okta_id
from sfdc_contacts
group by okta_id
having COUNT(okta_id) >1
)
1
SELECT * FROM automoto WHERE (title LIKE '%$keyword%')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
做项目需要用到联合唯一
例如,user表中有user_id,user_name两个字段,如果不希望有两条一摸一样的的user_id和user_name,需要给user表添加两个字段的联合唯一索引:
alter table user add unique index(user_id,user_name);
这样当向表中添加相同记录的时候,会返回1062的添加失败信息。
但是有一种情况是表中已经有n个重复的记录,这时候我们才想起来要添加唯一索引,再执行上面的操作时,数据库会告诉你已经有重复的记录了,建立索引失败,这时候,我们可以用下面的操作:
alter ignore table user add unique index(user_id,user_name);
它会删除重复的记录(别怕,会保留一条),然后建立唯一索引,高效而且人性化。
MySql 删除重复记录
数据库
MySql
1、 查找表中多余的重复记录,重复记录是根据单个字段(peopleId)来判断
select * from people
where peopleId in (select peopleId from people group by peopleId having count(peopleId) > 1)
2、 删除表中多余的重复记录,重复记录是根据单个字段(peopleId)来判断,只留有 rowid 最小的记录
delete from people
where peopleId in (select peopleId from people group by peopleId having count(peopleId) > 1)
and rowid not in (select min(rowid) from people group by peopleId having count(peopleId )>1)
3、查找表中多余的重复记录(多个字段)
select * from vitae a
where (a.peopleId,a.seq) in (select peopleId,seq from vitae group by peopleId,seq having count(*) > 1)
4、删除表中多余的重复记录(多个字段),只留有 rowid 最小的记录
delete from vitae a
where (a.peopleId,a.seq) in (select peopleId,seq from vitae group by peopleId,seq having count(*) > 1)
and rowid not in (select min(rowid) from vitae group by peopleId,seq having count(*)>1)
5、查找表中多余的重复记录(多个字段),不包含 rowid 最小的记录
select * from vitae a
where (a.peopleId,a.seq) in (select peopleId,seq from vitae group by peopleId,seq having count(*) > 1)
and rowid not in (select min(rowid) from vitae group by peopleId,seq having count(*)>1)
6、 比方说,在 A 表中存在一个字段 “name”,而且不同记录之间的“name” 值有可能会相同,现在就是需要查询出在该表中的各记录之间,“name”值存在重复的项;
Select Name,Count(*) From A Group By Name Having Count(*) > 1
如果还查性别也相同大则如下:
Select Name,sex,Count(*) From A Group By Name,sex Having Count(*) > 1
有两个意义上的重复记录,一是完全重复的记录,也即所有字段均重复的记录,二是部分关键字段重复的记录,比如 Name 字段重复,而其他字段不一定重复或都重复可以忽略。
1)对于第一种重复,比较容易解决,使用
select distinct * from tableName
就可以得到无重复记录的结果集。
如果该表需要删除重复的记录(重复记录保留 1 条),可以按以下方法删除
select distinct * into #Tmp from tableName
drop table tableName
select * into tableName from #Tmp
drop table #Tmp
发生这种重复的原因是表设计不周产生的,增加唯一索引列即可解决。
2)这类重复问题通常要求保留重复记录中的第一条记录,操作方法如下
假设有重复的字段为 Name,Address,要求得到这两个字段唯一的结果集
select identity(int,1,1) as autoID, * into #Tmp from tableName
select min(autoID) as autoID into #Tmp2 from #Tmp group by Name,autoID
select * from #Tmp where autoID in(select autoID from #tmp2)
最后一个 select 即得到了 Name,Address 不重复的结果集(但多了一个 autoID 字段,实际写时可以写在 select 子句中省去此列)
7、 查询重复
select * from tablename where id in (
select id from tablename
group by id
having count(id) > 1
)