V
V
1000 Data Scientist Interview Questions v18
 
This is not for me, it’s for you, for everyone who wants jump into data science ocean,
I spent over 200 hours on this project, this is not a perfect one, but already works for me.
Hope it can help you a lot.
拿面试看简历,拿Onsite看实力,拿Offer看沟通。
V
V
Tech Interview
 
V
V
SQL(G)
 
Tips: The interviewer would ask,”why this?, why that?”
Tips: Be familiar with non platform specific methods, e.g. not the oracle function like rownum.
Tips: You may being graded by someone who will not play database and don’t know if queries works. So make it simple and 0 bug
Tips: No trial and error either. They may pull questions from a handbook?
SQL Tryit Editor v1.5 (SQL online test from w3schools)
V
*
Q: How to CREATE a table & SELECT INTO?
 
SQL CREATE TABLE Example
The following example creates a table called "Persons" that contains five columns:
PersonID, LastName, FirstName, Address, and City:
CREATE TABLE Persons (
    PersonID int,
    LastName varchar(255),
);
Create Table Using Another Table
CREATE TABLE new_table_name AS
    SELECT column1, column2,...
    FROM existing_table_name
    WHERE ....;
SELECT INTO
SELECT * INTO CustomersBackup2017
FROM Customers;
SELECT * INTO CustomersBackup2017 IN 'Backup.mdb'
FROM Customers;
WHERE Country = 'Germany';
Tip: SELECT INTO can also be used to create a new, empty table using the schema of another.
SELECT INTO 语句从一个表中选取数据,然后把数据插入另一个表中。
SELECT INTO 语句常用于创建表的备份复件或者用于对记录进行存档。
Just add a WHERE clause that causes the query to return no data:
SELECT * INTO newtable
FROM oldtable
WHERE 1 = 0;
V
*
Q: Define a SQL query? What is the difference between SELECT and UPDATE Query? How do you use SQL in SAS, Python, R languages?
 
Queries take the form of a command language that lets you select, insert, update, find out the location of data, and so forth.
The UPDATE query is used to update existing records in a table.
The SELECT query is used to select data from a database. The result is stored in a result table, called the result-set.
sqldf from within R,
pandasql from within Python
SELECT
WHERE
GROUP BY
ORDER BY
V
*
Q: What is the difference between INNER JOIN and OUTER JOIN?
 
Here are the different types of the JOINs in SQL:
(INNER) JOIN: Returns records that have matching values in both tables
LEFT (OUTER) JOIN: Return all records from the left table, and the matched records from the right table
RIGHT (OUTER) JOIN: Return all records from the right table, and the matched records from the left table
FULL (OUTER) JOIN: Return all records when there is a match in either left or right table
Pasted_Graphic_43
V
*
Q: How would you insert a row into a data table?
 
3 types, whole row, part of row, insert query
“INSERT INTO Customers (a,b,c)
VALUES('1000000006',
'Toy Land',
'123 Any Street',
'New York',
'NY',
'11111',
'USA',
NULL); ”
or Select 1,2,3, from table;
It’s a little dangers, but you can add the columns before the value part. Benefit: no worry about the columns structure change.
Tips: only insert + select can insert multi rows.
V
*
Q: How to optimize the MySQL
 
1. Use explain to check the query plan,
bVY97y
1. type列,连接类型。一个好的sql语句至少要达到range级别。杜绝出现all级别
2. key列,使用到的索引名。如果没有选择索引,值是NULL。可以采取强制索引方式
3. key_len列,索引长度
4. rows列,扫描行数。该值是个预估值
5. extra列,详细说明。注意常见的不太友好的值有:Using filesort, Using temporary
2. Limit the IN in the query.
MySQL对于IN做了相应的优化,即将IN中的常量全部存储在一个数组里面,而且这个数组是排好序的。但是如果数值较多,产生的消耗也是比较大的。再例如:select id from t where num in(1,2,3) 对于连续的数值,能用 between 就不要用 in 了;再或者使用连接来替换。
3. Give the specific select conditions
SELECT *增加很多不必要的消耗(cpu、io、内存、网络带宽);增加了使用覆盖索引的可能性;当表结构发生改变时,前断也需要更新。所以要求直接在select后面接上字段名。
四、当只需要一条数据的时候,使用limit 1
这是为了使EXPLAIN中type列达到const类型
五、如果排序字段没有用到索引,就尽量少排序
六、如果限制条件中其他字段没有索引,尽量少用or
or两边的字段中,如果有一个不是索引字段,而其他条件也不是索引字段,会造成该查询不走索引的情况。很多时候使用 union all 或者是union(必要的时候)的方式来代替“or”会得到更好的效果
7. Use union all instead of union
union和union all的差异主要是前者需要将结果集合并后再进行唯一性过滤操作,这就会涉及到排序,增加大量的CPU运算,加大资源消耗及延迟。当然,union all的前提条件是两个结果集没有重复数据。
八、不使用ORDER BY RAND()
select id from `dynamic` order by rand() limit 1000;
上面的sql语句,可优化为
select id from `dynamic` t1 join (select rand() * (select max(id) from `dynamic`) as nid) t2 on t1.id > t2.nid limit 1000;
九、区分in和exists, not in和not exists
select * from 表A where id in (select id from 表B)
上面sql语句相当于
select * from 表A where exists(select * from 表B where 表B.id=表A.id)
区分in和exists主要是造成了驱动顺序的改变(这是性能变化的关键),如果是exists,那么以外层表为驱动表,先被访问,如果是IN,那么先执行子查询。所以IN适合于外表大而内表小的情况;EXISTS适合于外表小而内表大的情况。
关于not in和not exists,推荐使用not exists,不仅仅是效率问题,not in可能存在逻辑问题。如何高效的写出一个替代not exists的sql语句?
原sql语句
select colname … from A表 where a.id not in (select b.id from B表)
高效的sql语句
select colname … from A表 Left join B表 on where a.id = b.id where b.id is null
取出的结果集如下图表示,A表不在B表中的数据
bVZab5
十、使用合理的分页方式以提高分页的效率
select id,name from product limit 866613, 20
使用上述sql语句做分页的时候,可能有人会发现,随着表数据量的增加,直接使用limit分页查询会越来越慢。
优化的方法如下:可以取前一页的最大行数的id,然后根据这个最大的id来限制下一页的起点。比如此列中,上一页最大的id是866612。sql可以采用如下的写法:
select id,name from product where id> 866612 limit 20
十一、分段查询
在一些用户选择页面中,可能一些用户选择的时间范围过大,造成查询缓慢。主要的原因是扫描行数过多。这个时候可以通过程序,分段进行查询,循环遍历,将结果合并处理进行展示。
如下图这个sql语句,扫描的行数成百万级以上的时候就可以使用分段查询
bVZacz
十二、避免在 where 子句中对字段进行 null 值判断
对于null的判断会导致引擎放弃使用索引而进行全表扫描。
十三、不建议使用%前缀模糊查询
例如LIKE “%name”或者LIKE “%name%”,这种查询会导致索引失效而进行全表扫描。但是可以使用LIKE “name%”。
那如何查询%name%?
如下图所示,虽然给secret字段添加了索引,但在explain结果果并没有使用
bVZac6
那么如何解决这个问题呢,答案:使用全文索引
在我们查询中经常会用到select id,fnum,fdst from dynamic_201606 where user_name like '%zhangsan%'; 。这样的语句,普通索引是无法满足查询需求的。庆幸的是在MySQL中,有全文索引来帮助我们。
创建全文索引的sql语法是:
ALTER TABLE `dynamic_201606` ADD FULLTEXT INDEX `idx_user_name` (`user_name`);
使用全文索引的sql语句是:
select id,fnum,fdst from dynamic_201606 where match(user_name) against('zhangsan' in boolean mode);
注意:在需要创建全文索引之前,请联系DBA确定能否创建。同时需要注意的是查询语句的写法与普通索引的区别
十四、避免在where子句中对字段进行表达式操作
比如
select user_id,user_project from user_base where age*2=36;
中对字段就行了算术运算,这会造成引擎放弃使用索引,建议改成
select user_id,user_project from user_base where age=36/2;
十五、避免隐式类型转换
where 子句中出现 column 字段的类型和传入的参数类型不一致的时候发生的类型转换,建议先确定where中的参数类型
bVZadW
十六、对于联合索引来说,要遵守最左前缀法则
举列来说索引含有字段id,name,school,可以直接用id字段,也可以id,name这样的顺序,但是name;school都无法使用这个索引。所以在创建联合索引的时候一定要注意索引字段顺序,常用的查询字段放在最前面
十七、必要时可以使用force index来强制查询走某个索引
有的时候MySQL优化器采取它认为合适的索引来检索sql语句,但是可能它所采用的索引并不是我们想要的。这时就可以采用force index来强制优化器使用我们制定的索引。
十八、注意范围查询语句
对于联合索引来说,如果存在范围查询,比如between,>,<等条件时,会造成后面的索引字段失效。
19. Optimize the Join
bVZafW
• LEFT JOIN A表为驱动表
INNER JOIN MySQL会自动找出那个数据少的表作用驱动表
• RIGHT JOIN B表为驱动表
注意:MySQL中没有full join,可以用以下方式来解决
select * from A left join B on B.name = A.name
where B.name is null
union all
select * from B;
尽量使用inner join,避免left join
参与联合查询的表至少为2张表,一般都存在大小之分。如果连接方式是inner join,在没有其他过滤条件的情况下MySQL会自动选择小表作为驱动表,但是left join在驱动表的选择上遵循的是左边驱动右边的原则,即left join左边的表名为驱动表
合理利用索引
被驱动表的索引字段作为on的限制字段。
利用小表去驱动大表
bVZagY
从原理图能够直观的看出如果能够减少驱动表的话,减少嵌套循环中的循环次数,以减少 IO总量及CPU运算的次数。
巧用STRAIGHT_JOIN
inner join是由mysql选择驱动表,但是有些特殊情况需要选择另个表作为驱动表,比如有group by、order by等「Using filesort」、「Using temporary」时。STRAIGHT_JOIN来强制连接顺序,在STRAIGHT_JOIN左边的表名就是驱动表,右边则是被驱动表。在使用STRAIGHT_JOIN有个前提条件是该查询是内连接,也就是inner join。其他链接不推荐使用STRAIGHT_JOIN,否则可能造成查询结果不准确。
bVZahc
这个方式有时可能减少3倍的时间。
V
*
? Q: What is SELF JOIN and why is it required?
 
Self Join is the act of joining one table with itself.
FOR: Convert a hierarchical structure into a flat structure
In our employee table example above, we have kept the manager ID of each employee in the same row as that of the employee.
This is an example of how a hierarchy (in this case employee-manager hierarchy) is stored in the RDBMS table. Now, suppose if we need to print out the names of the manager of each employee right beside the employee, we can use self join. See the example below:
Pasted_Graphic_35
The only reason we have performed a left outer join here (instead of INNER JOIN) is we have one employee in this table without a manager (employee ID = 1). If we perform inner join, this employee will not show-up.
V
*
Q: What is the difference between JOIN and UNION?
 
JOIN = “lookup” records on other table based on the given conditions between two tables.
For example, if we have the department ID of each employee, then we can use this department ID of the employee table to join with the department ID of department table to lookup department names.
UNION operation allows us to add 2 similar datasets to create resulting dataset that contains all the data from the source datasets.
Union does not require any condition for joining.
For example, if you have 2 employee tables with same structure, you can UNION them to create one result set that will contain all the employees from both of the tables.
SELECT * FROM EMP1
UNION
SELECT * FROM EMP2;
V
*
Q: What is the difference between UNION and UNION ALL?
 
Both unify for add two structurally similar datasets, but
UNION returns only the unique records
UNION ALL will return all the rows
The main use of UNION and UNION ALL is join two tables. The main difference between them is that UNION removes duplicate records and UNION ALL keeps the duplicate records.
By ‘duplicate records’, all the values of two or more rows are the same.
V
*
Q: What is the difference between WHERE clause and HAVING clause?
 
SELECT/UPDATA/DELETE + WHERE (static non-aggregated)
Then
GROUP BY + HAVING (an aggregated operation)
WHERE and HAVING both filters out records based on one or more conditions.
V
*
Q: What is the difference among UNION, MINUS and INTERSECT?
 
UNION combines the results from 2 tables and eliminates duplicate records from the result set.
MINUS operator when used between 2 tables, gives us all the rows from the first table except the rows which are present in the second table.
INTERSECT operator returns us only the matching or common rows between 2 result sets.
Pasted_Graphic_39
V
*
Q: What is the LIKE condition operator?
 
Pasted_Graphic_53
V
*
Q: How the HAVING COUNT(*) work?
 
When the group by happened, the count(*) already happened and being a hidden column
Pasted_Graphic_54
V
*
Q: How can you INSERT "Olsen" as the "LastName" in the "Persons" table?
 
INSERT INTO Table (Column) VALUES (‘value’)
V
*
Q: Case How to generate row number in SQL itself (without ROWNUM).
 
Generating a row number – that is a running sequence of numbers for each row is not easy using plain SQL. In fact, the method I am going to show below is not very generic either. This method only works if there is at least one unique column in the table. This method will also work if there is no single unique column, but collection of columns that is unique. Anyway, here is the query:
Pasted_Graphic_3
Pasted_Graphic_4
The column that is used in the row number generation logic is called “sort key”. Here sort key is “file_name” column. For this technique to work, the sort key needs to be unique. We have chosen the column “file_name” because this column happened to be unique in our Employee table. If it was not unique but some other collection of columns was, then we could have used those columns as our sort key (by concatenating those columns to form a single sort key).
Also notice how the rows are sorted in the result set. We have done an explicit sorting on the row_num column, which gives us all the row numbers in the sorted order. But notice that name column is also sorted (which is probably the reason why this column is referred as sort-key). If you want to change the order of the sorting from ascending to descending, you will need to change “>=” sign to “<=” in the query.
As I said before, this method is not very generic. This is why many databases already implement other methods to achieve this. For example, in Oracle database, every SQL result set contains a hidden column called ROWNUM. We can just explicitly select ROWNUM to get sequence numbers.
V
*
Q: How to update primary key?
 
I will use a single "alter table" statement to update the primary key.
alter table xx drop primary key, add primary key(k1, k2, k3);
V
*
Q: What is a “star schema”? “snowflake schema”?
 
The star schema architecture is the simplest data warehouse schema. It is called a star schema because the diagram resembles a star, with points radiating from a center. The center of the star consists of fact table and the points of the star are the dimension tables.
snowflake schema is a logical arrangement of tables in a multidimensional database such as the one of our project entity relationship diagram resembles a snowflake shape. The snowflake schema is represented by centralized fact tables which are connected to multiple dimensions
V
*
Q: Name the differences between view and materialized view
?
Views:
• A view takes the output of a query and makes it appear like a virtual table. we can use a view in most places where a table can be used.
• All operations performed on a view will affect data in the base table and so are subject to the integrity constraints and triggers of the base table.
• A View can be used to simplify SQL statements for the user or to isolate an application from any future change to the base table definition. A View can also be used to improve security by restricting access to a predetermined set of rows or columns.
• In addition to operating on base tables, one View can be based on another. A view can also JOIN a view with a table (GROUP BY or UNION).
Views are built using the attributes of one or more tables. View with single tables can be updated but those with multiple tables cannot be updated.
Materialized Views:
Materialized view is an aggregate table.
• Materialized views are schema objects that can be used to summarize, pre-compute, replicate, and distribute data. E.g. to construct a data warehouse.
• A materialized view provides indirect access to table data by storing the results of a query in a separate schema object.
• The existence of a materialized view is transparent to SQL, but when used for query rewrites, it will improve the performance of SQL execution.
V
>
Q: How would you optimize a SQL query, Improve SQL Query Performance?
!
The SQL Language seems easy to learn – the commands follow a simple syntax and don’t describe the specific algorithms used to retrieve data. The simplicity might, however, be deceptive.
Not all the database functions operate with the same efficiency. Two very similar queries can vary significantly in terms of the computation time. There are many ways can help improve the query performance.
What I can think about now is:
Understand the Data
Understand the data, its type and how queries are being performed to retrieve the data before making any decision to create an index. If you understand the behavior of data thoroughly, it will help you to decide which column should have either a clustered index or non-clustered index.
Revisit Your Schema Definitions
Last but not least, revisit your schema definitions; keep on eye out that appropriate FOREIGN KEY, NOT NULL and CEHCK constraints are in place or not. Availability of the right constraint on the right place always helps to improve the query performance, like FOREIGN KEY constraint helps to simplify joins by converting some outer or semi-joins to inner joins and CHECK constraint also helps a bit by removing unnecessary or redundant predicates.
Create a Highly Selective Index
Selectivity define the percentage of qualifying rows in the table (qualifying number of rows/total number of rows). If the ratio of the qualifying number of rows to the total number of rows is low, the index is highly selective and is most useful. A non-clustered index is most useful if the ratio is around 5% or less, which means if the index can eliminate 95% of the rows from consideration. If index is returning more than 5% of the rows in a table, it probably will not be used
Drop Unused Indexes
Dropping unused indexes can help to speed up data modifications without affecting data retrieval.
Avoid Multiple Joins in a Single Query
Try to avoid writing a SQL query using multiple joins that includes outer joins, cross apply, outer apply and other complex sub queries. It reduces the choices for Optimizer to decide the join order and join type. Sometime, Optimizer is forced to use nested loop joins, irrespective of the performance consequences for queries with excessively complex cross apply or sub queries.
1. Learn How to Create Indexes Properly
Learning how to index properly is the best thing you can do to improve the performance of your SQL queries. Indexes allow for quicker access to the database in typical situations. Database novices often find indexes mysterious or difficult. They either index nothing or they try to index everything. Of course, neither of these approaches is right. With no indexes at all, your queries are likely to be slow. If you index everything, your updates and insert triggers will be inefficient.
If you’re not sure that you know enough about indexes, take @MarkusWinand’s three minute-test. Markus’ website “Use the Index, Luke” is the best starting point to learn about indexes. He also published an excellent book on the subject.
2. Only Retrieve the Data we Really Need
A common way of retrieving the desired columns is to use the * symbol even though not all the columns are really needed:
SELECT * FROM users WHERE age > 20;
If the table is small, retrieving additional columns won’t make much of a difference. For larger datasets, however, specifying the columns might save a lot of computation time:
SELECT id, last_name, sex, age FROM users WHERE age > 20;
Keep in mind, however, that many popular ORM will not let you to create a query which selects only a subset of table columns.
Similarly, if you only need a limited number of rows you should use the LIMIT clause (or your database’s equivalent). Take a look at the following code:
SELECT name, price FROM products;
For instance, if you only want to display the first 10 records out of 50,000 on your website, it is advisable to inform the database about it. This way, the database will stop the search after finding 10 rows rather than scan the whole table:
SELECT name, price FROM products LIMIT 10;
The LIMIT statement is available in MySQL and Postgresql, but other databases have ways to achieve a similar effect.
These above examples illustrate the general idea – you should always think whether you need all the rows returned by an SQL statement. If you don’t, there is always some room for improvement.
4. Consider Getting Rid of Correlated Subqueries
A correlated subquery is a subquery which depends on the outer query. It uses the data obtained from the outer query in its WHERE clause. Suppose you want to list all users who have made a donation. You could retrieve the data with the following code:
SELECT user_id, last_name FROM users WHERE EXISTS (SELECT * FROM donationuser WHERE donationuser.user_id = users.user_id);
In the above case, the subquery runs once for each row of the main query, thus causing possible inefficiency. Instead, we can apply a join:
SELECT DISTINCT users.user_id FROM users INNER JOIN donationuser ON users.user_id = donationuser.user_id;
If there are millions of users in the database, the statement with the correlated subquery will most likely be less efficient than the INNER JOIN because it needs to run millions of times. But if you were to look for donations made by a single user, the correlated subquery might not be a bad idea. As a rule of thumb, if you look for many or most of the rows, try to avoid using correlated subqueries. Keep in mind, however, that using correlated subqueries might be inevitable in some cases.
5. Avoid Wildcard Characters at the Beginning of a LIKE Pattern
Whenever possible, avoid using the LIKE pattern in the following way:
SELECT * FROM users WHERE name LIKE '%bar%';
The use of the % wildcard at the beginning of the LIKE pattern will prevent the database from using a suitable index if such exists. Since the system doesn’t know what the beginning of the name column is, it will have to perform a full table scan anyway. In many cases, this may slow the query execution. If the query can be rewritten in the following way:
SELECT * FRO users WHERE name LIKE 'bar%';
then the performance may be enhanced. You should always consider whether a wildcard character at the beginning is really essential.
but remember to evaluate each situation individually to see which method works best -- there are no ironclad solutions
*
Understand the Data
Understand the data, its type and how queries are being performed to retrieve the data before making any decision to create an index. If you understand the behavior of data thoroughly, it will help you to decide which column should have either a clustered index or non-clustered index. If a clustered index is not on a unique column then SQL Server will maintain uniqueness by adding a unique identifier to every duplicate key, which leads to overhead. To avoid this type of overhead choose the column correctly or make the appropriate changes.
Create a Highly Selective Index
Selectivity define the percentage of qualifying rows in the table (qualifying number of rows/total number of rows). If the ratio of the qualifying number of rows to the total number of rows is low, the index is highly selective and is most useful. A non-clustered index is most useful if the ratio is around 5% or less, which means if the index can eliminate 95% of the rows from consideration. If index is returning more than 5% of the rows in a table, it probably will not be used; either a different index will be chosen or created or the table will be scanned.
Position a Column in an Index
Order or position of a column in an index also plays a vital role to improve SQL query performance. An index can help to improve the SQL query performance if the criteria of the query matches the columns that are left most in the index key. As a best practice, most selective columns should be placed leftmost in the key of a non-clustered index.
Drop Unused Indexes
Dropping unused indexes can help to speed up data modifications without affecting data retrieval. Also, you need to define a strategy for batch processes that run infrequently and use certain indexes. In such cases, creating indexes in advance of batch processes and then dropping them when the batch processes are done helps to reduce the overhead on the database.
Statistic Creation and Updates
You need to take care of statistic creation and regular updates for computed columns and multi-columns referred in the query; the query optimizer uses information about the distribution of values in one or more columns of a table statistics to estimate the cardinality, or number of rows, in the query result. These cardinality estimates enable the query optimizer to create a high-quality query plan.
Revisit Your Schema Definitions
Last but not least, revisit your schema definitions; keep on eye out that appropriate FOREIGN KEY, NOT NULL and CEHCK constraints are in place or not. Availability of the right constraint on the right place always helps to improve the query performance, like FOREIGN KEY constraint helps to simplify joins by converting some outer or semi-joins to inner joins and CHECK constraint also helps a bit by removing unnecessary or redundant predicates.
Avoid Multiple Joins in a Single Query
Try to avoid writing a SQL query using multiple joins that includes outer joins, cross apply, outer apply and other complex sub queries. It reduces the choices for Optimizer to decide the join order and join type. Sometime, Optimizer is forced to use nested loop joins, irrespective of the performance consequences for queries with excessively complex cross apply or sub queries.
Eliminate Cursors from the Query
Try to remove cursors from the query and use set-based query; set-based query is more efficient than cursor-based. If there is a need to use cursor than avoid dynamic cursors as it tends to limit the choice of plans available to the query optimizer. For example, dynamic cursor limits the optimizer to using nested loop joins.
Avoid Use of Non-correlated Scalar Sub Query
You can re-write your query to remove non-correlated scalar sub query as a separate query instead of part of the main query and store the output in a variable, which can be referred to in the main query or later part of the batch. This will give better options to Optimizer, which may help to return accurate cardinality estimates along with a better plan.
void Multi-statement Table Valued Functions (TVFs)
Multi-statement TVFs are more costly than inline TFVs. SQL Server expands inline TFVs into the main query like it expands views but evaluates multi-statement TVFs in a separate context from the main query and materializes the results of multi-statement into temporary work tables. The separate context and work table make multi-statement TVFs costly.
Creation and Use of Indexes
We are aware of the fact that Index can magically reduce the data retrieval time but have a reverse effect on DML operations, which may degrade query performance. With this fact, Indexing is a challenging task, but could help to improve SQL query performance and give you best query response time.
 
V
*
Q: What is the CASE in MySQL?
 
Pasted_Graphic_55
V
>
Leetcode(G)
 
V
*
175. Case Combine Two Tables
 
SELECT
FROM A
LEFT JOIN B
Basic LJ: ON =
LJ+using: USING()
Natural LJ: NATURAL LEFT JOIN B
Which way is faster, it depends, most time the results are random incident.
V
*
182. Case Duplicate Emails [find out the duplicate email]
?
select Email
from Person
group by Email
having count(*)>1;
—————
SELECT distinct p1.Email from Person p1
INNER JOIN Person p2
ON p1.Email = p2.Email
WHERE p1.Id <> p2.Id;
V
*
197. Case Rising Temperature
 
SELECT t1.Id
FROM Weather t1
INNER JOIN Weather t2
ON TO_DAYS(t1.Date) = TO_DAYS(t2.Date) + 1. //To_days is function in MySQL
WHERE t1.Temperature > t2.Temperature
V
*
181. Case Employees Earning More Than Their Managers
?
SELECT employer.Name
FROM Employee AS employer
JOIN Employee AS manager
ON (employer.ManagerId = manager.Id )
WHERE employer.Salary > manager.Salary ;
Select e1.Name as Employee
From employee e1, employee e2
where e1.managerid = e2.id and e1.salary > e2.salary
select a.Name
from Employee a inner join Employee b on a.ManagerId=b.Id
where a.Salary>b.Salary
Sub-query solution:
select *
from Employee e
where e.Salary >
(select m.Salary from Employee m
where e.ManagerId = m.Id)
V
*
196. Case Delete Duplicate Emails
 
Interesting delete use:
delete p1
FROM Person p1, Person p2
WHERE p1.Email = p2.Email AND p1.Id > p2.Id
V
*
176. Case Second Highest Salary
 
Very smart subquery
Select max(salary) from employees
Where salary not in
(select max(salary) from employees)
SELECT max(price) FROM [Products]
where price < ( select max(price) from products where price < (select max(price) from products))
With the help of MySQL function
select (
select distinct Salary
from Employee
order by Salary Desc
limit 1 offset 1
)as second
*
Middle
 
V
*
Q: Case query examples in SQL
?
Using DISTINCT with SELECT
select distinct EmployeeID
from Employees
where EmployeeID > 4
order by EmployeeID (order by at the end)
finds the total of each sales order
SELECT OrderID, sum(Quantity) as sales. (AS can be deleted)
from Orderdetails
group by OrderID
order by sales DESC
average price and the sum of year-to-date sales, grouped by product ID and special offer ID.
SELECT OrderID,ProductID, avg(Quantity) as "avg_sales", sum(quantity) as sum$
from Orderdetails
Where OrderQty > 10 (where can not include aggregate function in operation)
group by OrderID,ProductID (order meaningless) or (ProductID + orderID)
order by orderID DESC
//GROUP BY ALL in SQL Server once, now been removed.
Finding rows that contain a value as a part of a string
~WHERE Name LIKE ('%Frame%');
Finding rows that meet any of three conditions
WHERE ProductID = 2
OR ProductID = 4
OR Name = 'Spokes' ;
Finding rows that are in a list of values
WHERE Name IN ('Blade', 'Crown Race', 'Spokes');
Finding rows that have a value between two values
WHERE ProductID BETWEEN 725 AND 734;
Checking for NULL values
WHERE Weight IS NULL
Finding total sales and discounts for each product
select distinct(p.ProductName) as proName,
p.ProductID * Price as wholesale,
SupplierID * CategoryID as discount
from Products as p
inner join OrderDetails as o
on p.ProductID = o.ProductID
order by ProductName DESC; /ASC
Calculate total revenue for each product
SELECT 'Total income is', ((OrderQty * UnitPrice) * (1.0 - UnitPriceDiscount)), ' for ',
p.Name AS ProductName
FROM Product AS p
INNER JOIN SalesOrderDetail AS sod
ON p.ProductID = sod.ProductID
ORDER BY ProductName ASC;
V
*
Q: Case Delete all duplicates from a table in one query.
 
SELECT DISTINCT * INTO TableA_Verify FROM TableA
DROP TABLE TableA
RENAME TABLE TableA_Verify TO TableA;
V
*
Q: Case How to select first(top) 5 records from a table?
?
The problem here is how do you define which record is first and which is second. Which record is retrieved first from the database is not deterministic.
Here is the solution:
Generic solution,
I believe a generic solution can be devised for this problem if and only if there exists at least one distinct column in the table.
Screenshot_2017-04-26_16.27.10
In this example, if the chosen column was not distinct, we would have got more than 5 records returned in our output.
In Oracle,
SELECT *
FROM EMP
WHERE ROWNUM <= 5;
In SQL Server,
SELECT TOP 5 * FROM EMP;
In MySQL,
SELECT * FROM EMP
LIMIT 5
V
*
Q: Case Query a list of CITY names from STATION with even ID numbers only. You may print the results in any order, but must exclude duplicates from your answer.
 
SELECT DISTINCT CITY FROM STATION
WHERE ID % 2 = 0
V
*
Q: Case Let N be the number of CITY entries in STATION, and let N’ be the number of distinct CITY names in STATION; query the value of N-N’ from STATION. In other words, find the difference between the total number of CITY entries in the table and the number of distinct CITY entries in the table.
 
SELECT (COUNT(CITY)- COUNT(DISTINCT CITY)) FROM STATION
(Works in MySql and Sql Server)
V
*
Q: Case how many members ever moved from Microsoft to Google? (both member #1 and member #2 count)?
 
table: company
Member_id, Company, Year_Start
1, Microsoft, 2000
1, Google, 2006
1, Facebook, 2012
2, Microsoft, 2001
2, Oracle, 2004
2, Google, 2007
...
-- how many members ever moved from Microsoft to Google? (both member #1 and member #2 count)
SELECT DISTINCT COUNT c1.Member_id
FROM company c1 JOIN company c2
ON c1.Member_id = c2.Member_id
AND c1.Company = 'Microsoft'
AND c2.Company = 'Google'
AND c1.Year_Start < c2.Year_Start.
-- how many members moved directly from Microsoft to Google? (Member 2 does not count since Microsoft -> Oracle -> Google)
SELECT DISTINCT COUNT c1.Member_id
FROM company c1 JOIN company c2. visit 1point3acres.com for more.
ON c1.Member_id = c2.Member_id
AND c1.Company = 'Microsoft' . more info on 1point3acres.com
AND c2.Company = 'Google'
AND c1.Year_Start < c2.Year_Start
AND NOT EXIST(
SELECT c3.Member_id
FROM company c3
WHERE c3.Member_id = c1.Member_id
AND c3.Year_Start BETWEEN c1.Year_Start AND c2.Year_Start
)
加粗的地方,问了是否要等号的情况,怎么取舍。
V
*
Q: Case You have a table called Order with Cust_ID, Order_Date, Order_ID, Tran_Amt. Select the top 100 customers with high spend over a year long period.
!
SELECT Cust_ID FROM ( SELECT Cust_ID, SUM (Tran_Amt) FROM Order WHERE Order_Date between DATE_SUB( CURDATE() , INTERVAL 1 YEAR) AND CURDATE() GROUP BY (Cust_ID) ORDER BY SUM (Tran_Amt) DESC LIMIT 100) ; PS - I will suggest to replace the function CURDATE() and DATE_SUB( CURDATE() , INTERVAL 1 YEAR) with a date in number format for query performance. I don't think that we will executing the query more often.
Select * FROM( Select Cust_id, Year, Totals, RANK() OVER (PARTITION by a.Year ORDER by a.Totals DESC) AS TopLimits FROM ( Select DISTINCT Cust_Id, DatePart(yy,Order_date) AS Year, SUM(Tran_Amt) OVER (Partition by a.Cust_id, DATEPART(yy,a.Order_date)) Totals FROM dbo.[Order] a ) a) b WHERE b.TopLimits <= 100 This is performs the top 100 for all the years
SELECT TOP 100 Cust_ID, SUM(Tran_Amt) AS Total FROM Order WHERE Order_Date >= DATEADD(MONTH, DATEDIFF(MONTH, 0, GETDATE()) - 12, Order_Date) ORDER BY Total DESC
V
*
Q: Case If there are stores with store id and Total_sales and order date, he wanted to compare last weeks revenue to this week by store Id (sort of)
?
Our target is to find out the difference between two weeks sales, so I will use join function.
Select s1.store_id,
From store as s1, store as s2
V
*
Q: Case Find out the most common elements in a column by SQL.
!
select city, count(*) as most from customers
group by city
having count(*) =(
select max(most) from
(select city,count(*) as most
from customers
group by city
)
)
V
*
Q: Case find the user_id out which only in table a not the table b
 
SELECT user_id
FROM A left join B
ON A.id = B.id
WHERE B.HATE IS NULL;
*
Q: Case CostumerID TITLE DATE, find out the most popular movies the ID first watched
?
V
*
Q: Case CostumerID TITLE DATE, find out the ID who watch movie in Jan and Feb (need the data function)
?
1.
Select distinct customerID from table
Where customerID IN (select customerID from table where month(data) = 1)
And customerID IN (select customerID from table where month(data) = 2)
2.
Select m1.Title from Movie m1
group by m1.Title
order by count(distinct(m1.customerID) desc
limit 1
V
*
Q: Case Uber Data Analysis SQL 2hours Test
 
Given the below subset of Uber’s schema, write executable SQL queries to answer the questions below.
Please answer in a single query and assume read-only access to the database (i.e. do not use CREATE TABLE).
Assume a PostgreSQL database, server timezone is UTC.
Table Name: trips
Column Name: Datatype:
id                    integer
client_id           integer (Foreign keyed to users.user_id)
driver_id          integer (Foreign keyed to users.user_id)
city_id             integer
client_rating     integer
driver_rating     integer
status              Enum(‘completed’, ‘cancelled_by_driver’, ‘cancelled_by_client’)
actual_eta        integer
request_at       timestamp with timezone
distance           float
fare                 float
Table Name: cities
Column Name: Datatype:
city_id             integer
name              character varying (varchar)
created_at       timestamp with timezone
launched_at     timestamp with timezone
status             Enum(‘active’, ‘launching’, ‘closed’)
country          character varying (varchar)
Table Name: users
Column Name: Datatype:
user_id             integer
first_name        character varying (varchar)
last_name        character varying (varchar)
email               character varying (varchar)
signup_city_id   integer
banned            Boolean
role                 Enum(‘client’, ‘driver’, ‘partner’)
created_at       timestamp with timezone
updated_at      timestamp with timezone
referral_code   character varying (varchar)
referred_by     integer (Foreign keyed to users.user_id)
deleted_at      timestamp with timezone
QUESTIONS
1) In October 2014, how many $15 dollar trips were completed in the United States by day?
What percentage of trips were completed by client who signed up during October 2014?
2) Identify the 15 most frequent last names for Uber ‘clients’?
3) For city_ids 22, 23, and 35, list the top three clients by average fare per trip (min. 25 completed trips) for the first half of 2012. January 1st - June 30th. Assume the minimum fare is $12 in all markets.
4) Identify the client and driver/partner with the most referrals in 2013. The query’s output should display contact information along with the signup_date of their last referral.
5) What is the average client_rating for completed trips in London for early morning (2am - 5am) trips?
6) Provide the distribution of drivers by the number of completed trips in 2015. Group these drivers in the follow three buckets: less than 100 trips, between 100 and 300 trips, more than 300 trips.
7) Of referred clients, what percentage of them completed a trip?
V
*
Q: Case 1get the all product_id counts
?
Select product_id, count(*) as count from Order
Group by product_id
?Having product_id in (select o1.product_id from order o1 join order o2)
V
*
Q: Case 2get the table ordered on 02-jul-11 not 01-jul-11
?
Select distinct product_id from order
Where order_day = ’02-jul-11’
Except
Select distinct product_id from order where order_day = ’02-jul-11’
V
*
Q: Case 3Get the highest sold products (qty*price) on both days
?
Select data, product_id, sum(qty*price) as sold_amout from order
Group by data, product_id
Having sum(qty*price) DESC
Limit 1
V
*
Q: Case 4Get the all products total sales on data 01,02
?
Select product_id, sum(qty*price) as
V
*
Q: Case 5get all products vis, was ordered more than once
?
Select data, product_id from order
where count(qty)>1
Group by date
V
*
Q: Case 6get the data into single unit level records
?
Select order_id, item, qty from neworder
(O1,a1,5 -> o1,a1,1 o1,a1,1…)
V
*
Q: Case 7 four tables, will give the top product sales in each of product group, + …
?
Select product_group, product_id from table1
product_group by category
As a
Join
Select
*
Q: Case 8 four tables, give all products that have glance views but no sales.
?
*
Q: Case 9 five tables, give the sales of Electronics(a group) as a percentage of books
?
V
*
Q: Case 10 phone log table, please provide an SQL query to display the source_phone_number and a flag where the flag needs to be set to Y if first called number and last called number are the same and N if the first called number and last called number are different
?
Select source number, is_match (y/n)
*
Q: Case Sql query the most the 2nd common elements in a column
 
>
NextMove
 
*
Q: 1和2都是同样4组数据:transaction master data, wire transfer master data, branch master data, 第四个忘记了,用不着
1. 是算destiny country是Canada,固定时间的,所有transaction总和 (第一,第二组inner join一下,加上条件,算个sum)
2. 是算固定时间下,ATM only的branch,列出branch id和transaction amount (要去branch master data里去look up一下,对应“ATM only”的编号)
 
*
Q: MySQL subquery, MAX, Join, Left Join
 
*
Q: How to adding column in a table without taking table offline?
 
*
Q: delete/truncate difference?
 
*
Q: Case Top earning person in each department
 
*
Q: Simple order by statement to present the data in a ascending order. implement the ascending order on the same set of data without using…
 
*
Q: Given a raw data table, how would you write the SQL to perform the ETL to get data into a desired format?
 
*
Q: Create DDL (table and foreign keys) for several tables in a provided ERD. ERD contains at least one many to many relationship.
 
*
Q: Round2 more and more details on partitioning? partition keys? partition pairing? partitioning with merge concept? diff btw primary keys and partition keys? upsert concept..whats required for upsert? how does it work?
 
*
Q: FB query a scenario and create a view
 
*
Q: FB related to joins and fetching highest salary and stuff,The SQL questions consisted of Group by,find the highest salary per department.
 
V
*
Q: (Oracle) What are the differences among ROWNUM, RANK and DENSE_RANK?
ROW_NUMBER assigns contiguous, unique numbers from 1.. N to a result set.
 
RANK does not assign unique numbers—nor does it assign contiguous numbers. If two records tie for second place, no record will be assigned the 3rd rank as no one came in third, according to RANK. See below:
Screenshot_2017-04-26_16.03.29_1
DENSE_RANK, like RANK, does not assign unique numbers, but it does assign contiguous numbers. Even though two records tied for second place, there is a third-place record. See below:
Screenshot_2017-04-26_16.03.41_1
*
Q: What is upsert?
 
V
*
Q: (Oracle) What is the difference between ROWNUM pseudo column and ROW_NUMBER() function?
 
ROWNUM is a pseudo column present in Oracle database returned result set prior to ORDER BY being evaluated. So ORDER BY ROWNUM does not work.
ROW_NUMBER() is an analytical function which is used in conjunction to OVER() clause wherein we can specify ORDER BY and also PARTITION BY columns.
Suppose if you want to generate the row numbers in the order of ascending employee salaries for example, ROWNUM will not work. But you may use ROW_NUMBER() OVER() like shown below:
Screenshot_2017-04-26_16.12.27
*
Q: Case Facebook Data Scientist 1
Table: friending (date | time | action | actor_id | target_id), action = {'send_request','accept_request'}
Q(1):SQL to generate friend request acceptance rate
Q(2):generate the friend request acceptance rate for people who accept within 24 hours:.
 
V
*
Q: Case Facebook Data Scientist 2
table: content_actions {user_id|content_id|conent_type|target_id} content_type = {"comment", "post"}
Q: 1.What is the total number of comments and total number of posts?
     2.what is the distribution of comments?
3. How to get the nick name of each facebook user suach david - dave , and if we already have the data how can we use it?
 
补充内容 (2017-3-26 07:43):
?我又看一遍才猛然发现当时写的有很大问题啊 蠢哭了怪不得悲剧了
select content_type, count(*) from content_id group by content_type;
select temp.cnt, count(temp.post) from
(select targer_id as post, ...
补充内容 (2017-3-26 07:44):
select temp.cnt, count(temp.post) from
(select targer_id as post, count(content_id) as cnt from content_actions
where targer_id is not null group by targer_id) temp
group by temp.cnt;