- 7 Top Open Source Datasets to Train Natural Language Processing (NLP) & Text Models - Nov 8, 2021.
With a lot of excitement and research around NLP, there are growing opportunities to apply these technologies to real-world scenarios. It's not trivial to become familiar with NLP and these open-source data sets can help you increase your skills.
Dataset, NLP, Open Source
- Free dataset worth $1350 to test the accent gap! - Aug 3, 2021.
With so many accent variations, how do speech and voice technologies keep up? In a few words: accented speech training data, representative of diverse groups of people. The more people your model can understand, the more likely you are to acquire and retain customers.
Competition, Dataset, Marketplace, Speech Recognition
- Largest Dataset Analyzed – Poll Results and Trends - Jul 1, 2020.
The results show that despite the deluge of Big Data, large majority still works in Gigabyte or Megabyte-size datasets. Data Scientists work with the largest-size datasets, followed by Data Engineers, Data Analysts, and Business Analysts. Read more for details.
Data Scientist, Dataset, Largest, Poll, Trends
- Scikit-Learn & More for Synthetic Dataset Generation for Machine Learning - Sep 19, 2019.
While mature algorithms and extensive open-source libraries are widely available for machine learning practitioners, sufficient data to apply these techniques remains a core challenge. Discover how to leverage scikit-learn and other tools to generate synthetic data appropriate for optimizing and fine-tuning your models.
Dataset, Machine Learning, scikit-learn, Synthetic Data
- Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends - Oct 29, 2018.
The poll results show amazing consistency to past years, with median answers still in 10-100 gigabytes range. Really Big Data Scientists (100 Petabytes and more) continue to stand apart, but remain small segment where Asian data scientists lead for the first time in this poll.
Asia, Dataset, Europe, Largest, Poll, USA
- Toward Increased k-means Clustering Efficiency with the Naive Sharding Centroid Initialization Method - Mar 13, 2017.
What if a simple, deterministic approach which did not rely on randomization could be used for centroid initialization? Naive sharding is such a method, and its time-saving and efficient results, though preliminary, are promising.
Algorithms, Clustering, Dataset, K-means
- Web Scraping for Dataset Curation, Part 2: Tidying Craft Beer Data - Feb 14, 2017.
This is the second part in a 2 part series on curating data from the web. The first part focused on web scraping, while this post details the process of tidying scraped data after the fact.
Beer, Data Curation, Dataset, Python
- Web Scraping for Dataset Curation, Part 1: Collecting Craft Beer Data - Feb 13, 2017.
This post is the first in a 2 part series on scraping and cleaning data from the web using Python. This first part is concerned with the scraping aspect, while the second part while focus on the cleaning. A concrete example is presented.
Beer, Data Curation, Dataset, Python, Web Scraping
- Apache Spark Key Terms, Explained - Jun 13, 2016.
An overview of 13 core Apache Spark concepts, presented with focus and clarity in mind. A great beginner's overview of essential Spark terminology.
Pages: 1 2
Apache Spark, Databricks, Dataset, Explained, Key Terms, RDD, Tungsten
- Yahoo Releases the Largest-ever Machine Learning Dataset for Researchers - Jan 18, 2016.
Are you interested in massive amounts of data for research? Yahoo has just released the largest-ever machine learning dataset to the research community.
Anonymized, Dataset, Machine Learning, Yahoo