Top Handy SQL Features for Data Scientists
Whenever we hear "data," the first thing that comes to mind is SQL! SQL comes with easy and quick to learn features to organize and retrieve data, as well as perform actions on it in order to gain useful insights.
By Saurabh Hooda, Hackr.io
SQL is to data science as nature is to life. Without preserving nature, there is no possibility of life. Without preserving the knowledge of SQL, there would be no data science.
Even though NoSQL databases and programming languages like R and Python do the job extremely well, whenever we hear "data," the first thing that comes to mind is SQL! Python does have excellent libraries that help data scientists – check the top Data Science Python Libraries here – but SQL is still the preferred mode of handling data.
Why?
SQL (Structured Query Language) has been around for a long time and is a very comfortable programming language for many developers. SQL comes with easy and quick to learn features to organize and retrieve data, as well as perform actions on it in order to gain useful insights. You can read some of the commonly asked SQL interview questions here to get a basic idea of what you should know about SQL. It is also efficient and close to the data.
If you have no idea about databases and haven’t worked with any you should start with the basics. Start by reading about DBMS (Database Management Systems) normalization, which will help you understand why a structure is important for easy analysis of data.
If you are into web development, you know that data is fetched from the backend (database) and presented to the front-end (UI). In the same way, data is entered by the user into the database. Visit Squareboat to know how we can help you with web development.
That said, let us now focus on the top handy features of SQL that should be on a Data Scientist’s fingertips.
1. Select statement
As a data scientist, you will be selecting (reading) a lot of data from different tables to get patterns, statistics and other important information. The basic select query
select * from <table name>;
could return a lot of records. In the real world, databases have millions of records, and if there are multiple columns in a table you will be overwhelmed with the magnitude of results. You can select only the columns you need.
select <column1>, <column2> from <table name>;
For example, if you want to know the name and age of all registered students, you can query the database as –
select name, age from student;
2. Grouping and sorting functions (Ranking and top n analysis)
Most of the time, you would want to work on a subset of data that you are interested in. For example, if you want to know the number of teenagers who use your product, you can do that easily with a where clause.
select name, age from student where age between 13 and 19;
In the same way, you can group the students based on their age. The below query will get you the number of students (count) from each branch/department.
select count(student_id), deptt from student group by deptt;
Let us say you want to know which department has the most/least students. You would simply use the order by clause to do that.
select count(student_id), deptt from student group by deptt order by count(student_id) desc;
Now, if we only want to know about those departments where at least 10 students use our product, what should we do?
We can use ‘having’ clause, which is similar to ‘where’ clause but works on a group of data.
select count(student_id), deptt from student group by deptt having count(student_id) >10 order by count(student_id) desc;
Getting such limited number of records from a huge data set is called top-n analysis. Some real-life examples could be – top 2 students who scored the highest marks, top 10 Facebook users who spend most time browsing, top 3 products that were sold by a company, top 5 employees of the month, least 3 performers of the month and so on. For example –
select rownum as toppers, first_name, last_name, total_marks from (select first_name, last_name from student order by total_marks) where rownum<3;
The first select statement (inline view) gets all records ordered by total_marks, while the 2nd one gets only the first two (n) top records.
Very easily, we have obtained data to a minute filtration level, without scanning or searching extensively for it. We can do a lot of such filtering to further satisfy more and more conditions.
3. String functions (text mining)
There are many useful string functions in SQL just like we have in programming languages. You can avoid writing lines of code when such features are available in the DBMS itself, because it is faster. Some common string functions are –
1. upper and lower
To convert the entire string to upper or lower case, we can use this function. This can come in handy when we want something to be printed in a respective case. For example, if we want a student's first name to be prints in all caps, we can use upper function –
select UPPER(first_name) from student;
2. replace
If we have to replace one or more characters with another, we can use this function. For example, let us say we need to get the list of mobile numbers without dashes in between. The value database is 1-832-234-1098 and we want to replace all the hyphens with spaces.
select first_name, REPLACE(phone_number, ‘-‘, ‘ ‘) as number from student;
This will present the value as 1 832 234 1098.
3. concat
Joins two or more different columns or strings to give result as one. For example, if you want to display full name as first name plus last name, concat can be used.
select CONCAT(first_name, ‘ ‘, last_name) as fullName from student;
4. substring
Just like in a programming language, SQL substring gets certain characters from a string. For example, if the exam table has hall_ticket_no (example 00019812345) wherein we only want the last few characters, we can write the query as –
select substring(hall_ticket_no,7,11) as sequence_num from exam;
There are two more variations of substring, left and right, that get the left few or right few characters accordingly. In our above example, we can use the RIGHT function as well.
5. len
Gets the number of characters in a string. For example –
select student_profile from student where len(student_profile) < 25;
This will show only those profiles which have shorter descriptions.
6. ltrim, rtrim
ltrim and rtrim remove the leading and trailing spaces respectively from a string. Sometimes database entries are with a lot of spaces at the beginning or end – it may be that the user entered it, or a program padded the spaces; we don’t know. If you wish to trim those and present the data, use these functions. Example –
select ltrim(first_name) from student;
To remove leading and trailing spaces, you can use –
select trim(first_name) from student;
4. Date functions
Handling dates is slightly complex but SQL does this with ease.
There are many functions like
- DATEADD – add one year to an existing date
- TO_DATE – converts a string into a date
- DATEDIFF – find the difference between the 2 given dates
- DATEPART – get a particular part of the date; for example, year, month or date
- DAY – get the day of the month for the given date
- CURRENT_TIMESTAMP – get the date and time (timestamp)
All the above are quite useful to analyze various types of data. For example, you can use DATEDIFF to get data from a particular range of time, or use the DAY function to find out on what days most students take leave and so on.
5. Aggregations (Statistical functions)
Aggregate functions allow us to find the sum (SUM), average(AVG), minimum(MIN), maximum(MAX) and count(COUNT) values from a set of data. We use these functions with group by and having clauses. For example, if we want to know what is the average percentage of marks that the students of each department have collectively, we can use the AVG function.
select AVG(total_marks) from students group by deptt;
In the same way, to get the number of students in a particular deptt, we can use count.
select count(*) from students group by deptt;
6. Joins
Often you will want to collate data from multiple tables and get only those columns that are matching some condition or pattern. For getting data from two or more tables, we use SQL joins.
7. Regular expressions
It is very useful to know about regex. For example, if you want to validate a phone number, credit card or any numeric values that match a certain sequence, use regex.
For example, if you want to get phone numbers that start with a number (and not a bracket or any other character), you can use –
select * from contact where phone like '[0-9]%';
You can do the same using any programming language like JavaScript, but it is simpler and less time consuming than executing code. Regular expressions can also be used to find a specific pattern in any columns of a table. For example, if you want to get the name of students that contains ‘ya’, you can do so by using the LIKE clause.
select first_name, deptt from student where first_name like ‘%ya%’;
8. Loading and copying data into database
If you have loads of data in Excel or .csv format and you want to copy all of it into a DBMS, SQL can do that for you.
You can use COPY FROM to copy data from a file to the database. Same way to copy data from database to file, use COPY INTO command.
9. Data bucketing
As we saw earlier, using group by helps us to get sets of data so that we can find trends and deduce business opportunities by analyzing it. Bucketing is the term used to find those groupings (most of the time with timestamps and numbers) and to generate histograms. This reduces human observation errors. For example, using the truncate function, we can get the nearest round off value for a decimal.
select truncate (average, 0) from students; -- if the actual value was 78.333, the result will be 78.
In the same way, date_trunc is used to group dates together. This could be useful in tracking user activities over a period of time. For example, the activities of the students in an online learning portal over a specific date interval, for example, weekdays and weekends.
Read more about Data bucketing here.
10. Sequencing
Using SQL sequences, we can generate a numeric sequence on demand in ascending or descending order. Sequences can be created using create sequence <sequence_name> by giving necessary details. Sequences are not associated with any tables, hence it is useful to retrieve table values using a sequence directly, if there is one. Calling ‘next value’ function retrieves the next row instead of selecting from the table.
The final word
If you haven’t yet had hands-on experience with SQL, you should try these tutorials. As a data scientist, you may not be using all the powerful features of SQL. However, once you start playing with the data, you would definitely enjoy digging more. The above list should help you get through the features you will be expected to use as a data scientist. Do visit this link to read about all the important data science interview questions and to start building your dream career.
Bio: Saurabh Hooda has worked globally for telecom and finance giants in various capacities. After working for a decade in Infosys and Sapient, he started his first startup, Leno, to solve a hyperlocal book-sharing problem. He is interested in product marketing, and analytics. His latest venture Hackr.io recommends the best Data Science tutorial and online programming courses for every programming language. All the tutorials are submitted and voted by the programming community.
Related:
- Is SQL needed to be a data scientist?
- 7 Steps to Mastering SQL for Data Science — 2019 Edition
- Python Data Science for Beginners