- Dealing with Data Leakage - Oct 8, 2021.
Target leakage and data leakage represent challenging problems in machine learning. Be prepared to recognize and avoid these potentially messy problems.
Cross-validation, Data Science, Datasets, Machine Learning, Modeling, Training Data
- Use These Unique Data Sets to Sharpen Your Data Science Skills - Sep 29, 2021.
Want to get your hands on some real-world data sets right now? Kick off your bootcamp prep with this list of hot-button data sets curated to help you hone different data science skills.
Data Science Skills, Datasets
- Don’t Touch a Dataset Without Asking These 10 Questions - Sep 20, 2021.
Selecting the right dataset is critical for the success of your AI project.
Datasets, Distribution, Outliers, Privacy, Standardization
- 3 Data Acquisition, Annotation, and Augmentation Tools - Aug 27, 2021.
Check out these 3 projects found around GitHub that can help with your data acquisition, annotation, and augmentation tasks.
Computer Vision, Data Annotation, Data Labeling, Datasets, GitHub, NLP, Synthetic Data
- Open Source Datasets for Computer Vision - Aug 18, 2021.
Access to high-quality, noise-free, large-scale datasets is crucial for training complex deep neural network models for computer vision applications. Many open-source datasets are developed for use in image classification, pose estimation, image captioning, autonomous driving, and object segmentation. These datasets must be paired with the appropriate hardware and benchmarking strategies to optimize performance.
Computer Vision, Datasets, Open Source
- The Data Matters: Choosing the right data to analyze can make or break your analysis - Jun 15, 2021.
We started Nomad Data to help data scientists and business analysts quickly find the right commercial datasets to match their specific use case. We catalog use cases of data and use machine learning and AI to match analysis goals with datasets.
Consumer Analytics, Datasets, Geospatial
- Great New Resource for Natural Language Processing Research and Applications - May 27, 2021.
The NLP Index is a brand new resource for NLP code discovery, combining and indexing more than 3,000 paper and code pairs at launch. If you are interested in NLP research and locating the code and papers needed to understand an implement the latest research, you should check it out.
Datasets, NLP, Research
- Awesome list of datasets in 100+ categories - May 20, 2021.
With an estimated 44 zettabytes of data in existence in our digital world today and approximately 2.5 quintillion bytes of new data generated daily, there is a lot of data out there you could tap into for your data science projects. It's pretty hard to curate through such a massive universe of data, but this collection is a great start. Here, you can find data from cancer genomes to UFO reports, as well as years of air quality data to 200,000 jokes. Dive into this ocean of data to explore as you learn how to apply data science techniques or leverage your expertise to discover something new.
Big Data, Data Science, Datasets
- Introducing The NLP Index - Apr 29, 2021.
The NLP Index is a brand new resource for NLP code discovery, combining and indexing more than 3,000 paper and code pairs at launch. If you are interested in NLP research and locating the code and papers needed to understand an implement the latest research, you should check it out.
Datasets, NLP, Research
- 8 Places for Data Professionals to Find Datasets - Dec 17, 2020.
Here is a curated list of sites and resources invaluable for data professionals to acquire practice datasets.
Data Science, Datasets, Google, Government, Kaggle, Reddit, UCI
- Top Google AI, Machine Learning Tools for Everyone - Aug 18, 2020.
Google is much more than a search company. Learn about all the tools they are developing to help turn your ideas into reality through Google AI.
AI, AutoML, Bias, Data Science Platforms, Datasets, Google, Google Cloud, Google Colab, Machine Learning, TensorFlow
- The List of Top 10 Lists in Data Science - Aug 14, 2020.
The list of Top 10 lists that Data Scientists -- from enthusiasts to those who want to jump start a career -- must know to smoothly navigate a path through this field.
Algorithms, Data Science, Data Science Skills, Datasets, Influencers, LinkedIn, Python, Top 10
- New Poll: What was the largest dataset you analyzed / data mined? - Jun 9, 2020.
Take part in KDnuggets latest survey to have your voice heard, and let the community know what the largest dataset size you have worked with is.
Big Data, Datasets, Largest, Poll
- Dataset Splitting Best Practices in Python - May 26, 2020.
If you are splitting your dataset into training and testing data you need to keep some things in mind. This discussion of 3 best practices to keep in mind when doing so includes demonstration of how to implement these particular considerations in Python.
Datasets, Python, scikit-learn, Training Data, Validation
- 3 Best Sites to Find Datasets for your Data Science Projects - Apr 9, 2020.
When first learning data science, you will inevitably find yourself looking for more datasets to practice with. Here, we recommend the 3 best sites to find datasets to spark your next data science project.
Coronavirus, Data, Data Science, Datasets, Kaggle
- 10 Must-read Machine Learning Articles (March 2020) - Apr 9, 2020.
This list will feature some of the recent work and discoveries happening in machine learning, as well as guides and resources for both beginner and intermediate data scientists.
AI, API, Cloud, Data Analytics, Datasets, fast.ai, Machine Learning, Neural Networks, Social Media
- 21 Machine Learning Projects – Datasets Included - Mar 9, 2020.
Upgrading your machine learning, AI, and Data Science skills requires practice. To practice, you need to develop models with a large amount of data. Finding good datasets to work with can be challenging, so this article discusses more than 20 great datasets along with machine learning project ideas for you to tackle today.
Chatbot, Datasets, Google Trends, Machine Learning, Project, Uber
- The Big Bad NLP Database: Access Nearly 300 Datasets - Feb 28, 2020.
Check out this database of nearly 300 freely-accessible NLP datasets, curated from around the internet.
Datasets, NLP, Text Mining
- Passive Data Collection and Actionable Results: What to Know - Feb 21, 2020.
There are plenty of ways to get actionable results by using passive data. However, such an outcome will not happen without careful forethought. Data analysts must consider several crucial specifics, including what questions they want and expect the information to answer, and how they'll apply the findings to aid the business.
Analytics, Customer Analytics, Data Curation, Datasets
- Google Dataset Search Provides Access to 25 Million Datasets - Jan 29, 2020.
Google's dataset search is out of beta, and provides centralized access to 25 million datasets.
Data Science, Datasets, Google, Search
- The 5 Most Useful Techniques to Handle Imbalanced Datasets - Jan 22, 2020.
This post is about explaining the various techniques you can use to handle imbalanced datasets.
Balancing Classes, Datasets, Metrics, Python, Sampling, Unbalanced
- What is Data Catalog and Why You Should Care? - Dec 23, 2019.
Learn why data catalogs could be just the thing you need to meet the challenges of data and metadata management and collaboration.
Compliance, Consistency, Data Catalog, Data Governance, Datasets, Metadata, Reddit
- Data Sources 101 - Oct 28, 2019.
Data collection is one of the first steps of the data lifecycle — you need to get all the data you require in the first place. To collect the right data, you need to know where to find it and determine the effort involved in collecting it. This article answers the most basic question: where does all the data you need (or might need) come from?
Big Data, Data Science, Datasets, Unstructured data
- Know Your Data: Part 2 - Oct 8, 2019.
To build an effective learning model, it is must to understand the quality issues exist in data & how to detect and deal with it. In general, data quality issues are categories in four major sets.
Beginners, Data Preparation, Data Preprocessing, Datasets
- Know Your Data: Part 1 - Sep 30, 2019.
This article will introduce the different type of data sets, data object and attributes.
Beginners, Datasets
- Version Control for Data Science: Tracking Machine Learning Models and Datasets - Sep 13, 2019.
I am a Git god, why do I need another version control system for Machine Learning Projects?
Data Science, Datasets, Machine Learning, Modeling, Version Control
- How to Automate Tasks on GitHub With Machine Learning for Fun and Profit - May 3, 2019.
Check this tutorial on how to build a GitHub App that predicts and applies issue labels using Tensorflow and public datasets.
Datasets, GitHub, Python, TensorFlow
- Synthetic Data Generation: A must-have skill for new data scientists - Dec 27, 2018.
A brief rundown of methods/packages/ideas to generate synthetic data for self-driven data science projects and deep diving into machine learning methods.
Pages: 1 2
Classification, Clustering, Datasets, Machine Learning, Python, Synthetic Data
- Handling Imbalanced Datasets in Deep Learning - Dec 4, 2018.
It’s important to understand why we should do it so that we can be sure it’s a valuable investment. Class balancing techniques are only really necessary when we actually care about the minority classes.
Balancing Classes, Datasets, Deep Learning, Keras, Python
- Machine Learning Classification: A Dataset-based Pictorial - Nov 5, 2018.
In order to relate machine learning classification to the practical, let's see how this concept plays out, step by step (and with images), specifically in direct relation to a dataset.
Datasets, Machine Learning, Supervised Learning
- Semantic Interoperability: Are you training your AI by mixing data sources that look the same but aren’t? - Oct 9, 2018.
Semantic interoperability is a challenge in AI systems, especially since data has become increasingly more complex. The other issue is that semantic interoperability may be compromised when people use the same system differently.
AI, Datasets, Healthcare, Semantic Analysis
- Introducing VisualData: A Search Engine for Computer Vision Datasets - Sep 26, 2018.
Instead of building your own dataset, there already exists a rich collection of computer vision datasets contributed by academic researchers, hobbyists and companies.
Computer Vision, Datasets
- How (dis)similar are my train and test data? - Jun 7, 2018.
This articles examines a scenario where your machine learning model can fail.
Data Science, Datasets, Feature Selection, Machine Learning, Training Data
- Training Sets, Test Sets, and 10-fold Cross-validation - Jan 9, 2018.
More generally, in evaluating any data mining algorithm, if our test set is a subset of our training data the results will be optimistic and often overly optimistic. So that doesn’t seem like a great idea.
Cross-validation, Data Mining, Datasets, Machine Learning
- 70 Amazing Free Data Sources You Should Know - Dec 20, 2017.
70 free data sources for 2017 on government, crime, health, financial and economic data, marketing and social media, journalism and media, real estate, company directory and review, and more to start working on your data projects.
Big Data, Business, Crime, Datasets, Finance, Government, Health, Journalism, Octoparse, Social Media
- How (and Why) to Create a Good Validation Set - Nov 24, 2017.
The definitions of training, validation, and test sets can be fairly nuanced, and the terms are sometimes inconsistently used. In the deep learning community, “test-time inference” is often used to refer to evaluating on data in production, which is not the technical definition of a test set.
Cross-validation, Datasets, Rachel Thomas, Training Data, Validation
- Building a Wikipedia Text Corpus for Natural Language Processing - Nov 23, 2017.
Wikipedia is a rich source of well-organized textual data, and a vast collection of knowledge. What we will do here is build a corpus from the set of English Wikipedia articles, which is freely and conveniently available online.
Datasets, Natural Language Processing, NLP, Text Mining, Wikidata, Wikipedia
- The new Enigma Public – the platform connecting people to data - Sep 11, 2017.
Public data has tremendous potential and different people can use it to solve variety of problems. Enigma relaunches Enigma Public — the platform connecting people to data.
Datasets, Government, Healthcare, Social Good
- More Data or Better Algorithms: The Sweet Spot - Jan 17, 2017.
We examine the sweet spot for data-driven Machine Learning companies, where is not too easy and not too hard to collect the needed data.
Algorithms, Big Data, Data, Datasets, Machine Learning
- Data Sources for Cool Data Science Projects - Dec 20, 2016.
One of the biggest obstacles to successful projects has been getting access to interesting data. Here are some more cool public data sources you can use for your next project.
Data Incubator, Datasets, Elections, Healthcare, Michael Li
- Largest Dataset Analyzed Poll shows surprising stability, more junior Data Scientists - Nov 8, 2016.
The majority (57%) of respondents only worked with Gigabyte range data. More junior Data Scientists enter the market, but Petabyte Big Data Scientists still stand apart.
Asia, Big Data, Datasets, Europe, Largest, Poll, USA
- What is Academic Torrents and Where is Data Sharing Going? - Oct 26, 2016.
Learn more about Academic Torrents, a platform for researchers to share data consisting of a site where users can search for datasets, and a BitTorrent backbone which makes sharing data scalable and fast.
Datasets, Reproducibility, Research
- Data Science Basics: 3 Insights for Beginners - Sep 22, 2016.
For data science beginners, 3 elementary issues are given overview treatment: supervised vs. unsupervised learning, decision tree pruning, and training vs. testing datasets.
Algorithms, Beginners, Datasets, Overfitting, Supervised Learning, Unsupervised Learning
- 10 Data Acquisition Strategies for Startups - Jun 14, 2016.
An interesting discussion of the myriad methods in which startups may choose to acquire data, often the most overlooked and important aspect of a startup's success (or failure).
Pages: 1 2
Acquisitions, Crowdsourcing, Datasets, Startups
- Top 10 Open Dataset Resources on Github - May 31, 2016.
The top open dataset repositories on Github include a variety of data, freely available for use by researchers, practitioners, and students alike.
Datasets, GitHub, Machine Learning, Open Data
- Datasets Over Algorithms - May 3, 2016.
The average elapsed time between key algorithm proposals and corresponding advances is about 18 years; the average elapsed time between key dataset availabilities and corresponding advances is less than 3 years, 6 times faster.
Algorithms, Datasets
- 9 Must-Have Datasets for Investigating Recommender Systems - Feb 11, 2016.
Gain some insight into a variety of useful datasets for recommender systems, including data descriptions, appropriate uses, and some practical comparison.
Datasets, Lab41, Recommender Systems
- Tour of Real-World Machine Learning Problems - Dec 26, 2015.
The tour lists 20 interesting real-world machine learning problems for data science enthusiasts to learn by solving.
Datasets, Kaggle, Learning from Data, Machine Learning, Research, UCI
- Poll Results: Where is Big Data? For most, Largest Dataset Analyzed is in laptop-size GB range - Aug 18, 2015.
A majority of data scientists (56%) work in Gigabyte dataset range. We note a small increase in Petabyte (web-scale) data miners, and a decline in Megabyte data miners. US, Australia/NZ, and Asia lead in percentage of Terabyte and Petabyte analysts.
Asia, Australia, Big Data, Datasets, Europe, Largest, Poll, USA
- Awesome Public Datasets on GitHub - Apr 6, 2015.
A long, categorized list of large datasets (available for public use) to try your analytics skills on. Which one would you pick?
Pages: 1 2
Datasets, Finance, GitHub, Government, Machine Learning, NLP, Open Data, Time series data
- TweetNLP: Twitter Natural Language Processing - Oct 24, 2014.
A short overview of Natural Language Processing tools and utilities developed by Prof. Noah Smith, CMU and his team to analyze Twitter data.
Advanced Analytics, ARK, CMU, Datasets, NLP, Speech, Tools, Twitter
- Interactive Network and Graph Data Repository - Oct 17, 2014.
The network repository currently hosts over 500+ graphs/networks that span 19 collections of graphs from social science, machine learning, scientific computing, and many others.
Datasets, Graph Analytics, Graph Visualization, Network Graph
- Interesting Social Media Datasets - Aug 13, 2014.
Learn about some of the many interesting social media datasets available to you, some of which are quite new, and the different features and challenges they offer you for your next big data science project.
Challenge, Data Visualization, Datasets, Open Data, Social Media Analytics