5 More Command Line Tools for Data Science
Use these tools to Access API, Manipulate CSV files, download datasets, and more from your terminal.
Image by Author
1. csvkit
Csvkit is a king of tabular data. It has a collection of tools that can be used to convert CSV files, manipulate the data, and perform data analysis.
You can install csvkit
using pip.
$ pip install csvkit
Example 1
In this example, we will use csvcut to select only two columns and use csvlook to display the results in tabular format.
csvcut -c sepal_length,species iris.csv | csvlook --max-rows 5
Note: you can limit number of rows with the argument --max-rows
Example 2
We will convert a CSV file into a JSON file using csvjson.
csvjson iris.csv > iris.json
Note: csvkit also provides us Excel to CSV and JSON to CSV tools.
Example 3
We can also perform data analysis on a CSV file by using SQL query. Csvsql requires SQL query and CSV file path You can display the results or save it in CSV.
csvsql --query "select * from iris where species like 'Iris-setosa'" iris.csv | csvlook --max-rows 5
2. IPython
IPython is an interactive Python shell that brings some functionalities of a jupyter notebook into your terminal. It allows you to test ideas faster without creating a Python file.
Install ipython
using pip install.
$ pip install ipython
Note: Ipython also comes with Anaconda and Jupyter Notebook. So, in most cases you don’t have to install it.
After installing, just type ipython
in the terminal and start performing data analysis just like you do in Jupyter notebooks. It is easy and fast.
3. cURL
cURL stands for client URL and is a CLI tool for transferring data to and from the server using URLs. You can use it to limit the rate, log errors, display progress, and test endpoints.
In the example, we are downloading the machine learning data from the University of California and saving it as a CSV file.
curl -o blood.csv https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data
Output:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 12843 100 12843 0 0 7772 0 0:00:01 0:00:01 --:--:-- 7769
You can use cURL for accessing APIs with tokens, push files, and automate the data pipelines.
4. awk
Awk is a terminal scripting language that we can use to manipulate the data and perform data analysis. It requires no complaining. We can use variables, numeric functions, string functions, and logical operators to write any type of script.
In the example, we are displaying the first and last columns of the CSV file and showing the last 10 rows. The $1 in the script means the first columns. You can also change it to $3 to display the 3rd column. The $NF represents the last columns.
awk -F "," '{print $1 " | " $NF}' iris.csv | tail
5. Kaggle
Kaggle API allows you to download all kinds of datasets from the Kaggle website. Furthermore, you can update your public dataset, submit the file to the competition, and run and manage Jupyter Notebook. It is a super command line tool.
Install Kaggle API using pip.
$ pip install kaggle
After that, go to the Kaggle website and get your credentials. You can follow this guide to set up your username and private key.
export KAGGLE_USERNAME=kingabzpro
export KAGGLE_KEY=xxxxxxxxxxxxxx
Example 1
After setting up authentication, you can search for random datasets. In our case, we are using the Survey on Employment Trends dataset.
Image from Survey on Employment Trends
You can either run the download script with -d
argument USERNAME/DATASET.
$ kaggle datasets download -d revathyta/survey-on-employment-trends
Or,
You can simply get API command by clicking on three dots and selecting “Copy API command” option.
Image from Survey on Employment Trends
It will download the dataset in the form of a zip file. You can also pipe the script with the unzip
command to extract the data.
Downloading survey-on-employment-trends.zip to C:\Users\abida
0%| | 0.00/6.22k [00:00<?, ?B/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 6.22k/6.22k [00:00<?, ?B/s]
Example 2
To create and share your dataset on Kaggle, you need to first initiate a metadata file by providing the path of the dataset.
$ kaggle datasets init -p /work/Kaggle/World-Vaccine-Progress
After that create the dataset and push the file to Kaggle server.
$ kaggle datasets create -p /work/Kaggle/World-Vaccine-Progress
You can also update your dataset by using the version
command. It requires a file path and message. Just like git.
$ kaggle datasets version -p /work/Kaggle/World-Vaccine-Progress -m "second version"
You can also check out my project Vaccine Update Dashboard which has successfully implemented Kaggle API to update the dataset regularly.
Conclusion
There are so many amazing CLI tools that I use and they have improved my productivity and helped me automate most of my work. You can even create your own CLI tool in Python using click or argparse.
In this article, we have learned about CLI tools to download the dataset, manipulate it, perform analysis, run scripts, and generate reports.
I am a fan of the Kaalgle API and csvkit. I use It regularly to automate my notebooks and analysis. If you want to learn how to use command line tools in your data science workflow, read Data Science at the Command Line book online for free.
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in Technology Management and a bachelor's degree in Telecommunication Engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.