- Movie Recommendations with Spark Collaborative Filtering - Dec 1, 2021.
Not sure what movie to watch? Ask your recommender system.
Apache Spark, Collaborative, Knime, Low-Code, Recommender Systems
- The Most In Demand Skills for Data Engineers in 2021 - May 18, 2021.
If you are preparing to make a career in data or are looking for opportunities to skill-up in your current data-centric role, then this analysis of in-demand skills for 2021, based on over 17,000 Data Engineer job postings, should offer you a good idea as to which programming languages and software tools are increasing and decreasing in importance.
Apache Spark, AWS, Data Engineer, Data Science Skills, Data Scientist, Python, Skills, SQL
- How to Acquire the Most Wanted Data Science Skills - Nov 13, 2020.
We recently surveyed KDnuggets readers to determine the "most wanted" data science skills. Since they seem to be those most in demand from practitioners, here is a collection of resources for getting started with this learning.
Algorithms, Amazon, Apache Spark, AWS, Computer Vision, Data Science, Data Science Skills, Deep Learning, Docker, NLP, NoSQL, PyTorch, Reinforcement Learning, TensorFlow
- Working with Spark, Python or SQL on Azure Databricks - Aug 27, 2020.
Here we look at some ways to interchangeably work with Python, PySpark and SQL using Azure Databricks, an Apache Spark-based big data analytics service designed for data science and data engineering offered by Microsoft.
Apache Spark, Databricks, Microsoft Azure, Python, SQL
- Unifying Data Pipelines and Machine Learning with Apache Spark™ and Amazon SageMaker - Aug 25, 2020.
Roll up your sleeves and charge up because you’re invited to an interactive, virtual Machine Learning workshop run by Amazon Web Services, Databricks, and Immuta on September 10.
Apache Spark, AWS, Immuta, Sagemaker, Webinar
- Containerization of PySpark Using Kubernetes - Aug 6, 2020.
This article demonstrates the approach of how to use Spark on Kubernetes. It also includes a brief comparison between various cluster managers available for Spark.
Apache Spark, Containers, Kubernetes
- 5 Apache Spark Best Practices For Data Science - Aug 4, 2020.
Check out these best practices for Spark that the author wishes they knew before starting their project.
Apache Spark, Best Practices, Data Science
- KDnuggets™ News 20:n29, Jul 29: Easy Guide To Data Preprocessing In Python; Building a better Spark UI; Computational Algebra for Coders: The Free Course - Jul 29, 2020.
An easy guide to data pre-processing in Python; Monitoring Apache Spark with a better Spark UI; Computational Linear Algebra for Coders: the free course; Labelling data with Snorkel; Bayesian Statistics.
Apache Spark, Bayesian, Data Preprocessing, Linear Algebra, Python
- Monitoring Apache Spark – We’re building a better Spark UI - Jul 23, 2020.
Data Mechanics is developing a free monitoring UI tool for Apache Spark to replace the Spark UI with a better UX, new metrics, and automated performance recommendations. Preview these high-level feedback features, and consider trying it out to support its first release.
Apache Spark, Monitoring, UI/UX
- Apache Spark Cluster on Docker - Jul 22, 2020.
Build your own Apache Spark cluster in standalone mode on Docker with a JupyterLab interface.
Apache Spark, Data Engineering, Docker, Jupyter, Python
- Apache Spark on Dataproc vs. Google BigQuery - Jul 15, 2020.
This post looks at research undertaken to provide interactive business intelligence reports and visualizations for thousands of end users, in the hopes of addressing some of the challenges to architects and engineers looking at moving to Google Cloud Platform in selecting the best technology stack based on their requirements and to process large volumes of data in a cost effective yet reliable manner.
Apache Spark, BigQuery, Google
- The Benefits & Examples of Using Apache Spark with PySpark - Apr 21, 2020.
Apache Spark runs fast, offers robust, distributed, fault-tolerant data objects, and integrates beautifully with the world of machine learning and graph analytics. Learn more here.
Apache Spark, Data Management, Python, SQL
- Spark NLP 101: LightPipeline - Nov 27, 2019.
A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. Now let’s see how this can be done in Spark NLP using Annotators and Transformers.
Apache Spark, NLP, Pipeline, Spark NLP
- Time Series Analysis: A Simple Example with KNIME and Spark - Oct 23, 2019.
The task: train and evaluate a simple time series model using a random forest of regression trees and the NYC Yellow taxi dataset.
Apache Spark, Knime, Rosaria Silipo, Seasonality, Time Series
- Learn how to use PySpark in under 5 minutes (Installation + Tutorial) - Aug 13, 2019.
Apache Spark is one of the hottest and largest open source project in data processing framework with rich high-level APIs for the programming languages like Scala, Python, Java and R. It realizes the potential of bringing together both Big Data and machine learning.
Apache Spark, Big Data, Data Science, Python
- Spark NLP: Getting Started With The World’s Most Widely Used NLP Library In The Enterprise - Jun 18, 2019.
The Spark NLP library has become a popular AI framework that delivers speed and scalability to your projects. Check out what's under the hood and learn about how to getting started leveraging Spark NLP from John Snow Labs.
Apache Spark, Enterprise, John Snow Labs, NLP, Spark NLP
- Scalable Python Code with Pandas UDFs: A Data Science Application - Jun 13, 2019.
There is still a gap between the corpus of libraries that developers want to apply in a scalable runtime and the set of libraries that support distributed execution. This post discusses how to bridge this gap using the the functionality provided by Pandas UDFs in Spark 2.3+
Apache Spark, Big Data, Pandas, Python
- What you need to know: The Modern Open-Source Data Science/Machine Learning Ecosystem - Jun 10, 2019.
We identify the 6 tools in the modern open-source Data Science ecosystem, examine the Python vs R question, and determine which tools are used the most with Deep Learning and Big Data.
Anaconda, Apache Spark, Big Data Software, Deep Learning, Excel, Keras, Poll, Python, R, RapidMiner, scikit-learn, Software, SQL, Tableau, TensorFlow
- Why physical storage of your database tables might matter - May 31, 2019.
Follow this investigation into why physical storage of your database tables might matter, from problem identification to possible issue resolutions.
Apache Spark, Databases, Postgres, SQL
- Python leads the 11 top Data Science, Machine Learning platforms: Trends and Analysis - May 30, 2019.
Python continues to lead the top Data Science platforms, but R and RapidMiner hold their share; Almost 50% have used Deep Learning tools; SQL is steady; Consolidation continues.
Pages: 1 2
Anaconda, Apache Spark, Deep Learning, Excel, Keras, Poll, Python, R, RapidMiner, scikit-learn, Software, SQL, TensorFlow
- Analyzing Tweets with NLP in Minutes with Spark, Optimus and Twint - May 24, 2019.
Social media has been gold for studying the way people communicate and behave, in this article I’ll show you the easiest way of analyzing tweets without the Twitter API and scalable for Big Data.
Pages: 1 2
Apache Spark, Big Data, Deep Learning, Machine Learning, NLP, Optimus, Python, Twint
- Data Science with Optimus Part 2: Setting your DataOps Environment - Apr 16, 2019.
Breaking down data science with Python, Spark and Optimus. Today: Data Operations for Data Science. Here we’ll learn to set-up Git, Travis CI and DVC for our project.
Apache Spark, Data Operations, Data Science, Python, Workflow
- Data Science with Optimus Part 1: Intro - Apr 15, 2019.
With Optimus you can clean your data, prepare it, analyze it, create profilers and plots, and perform machine learning and deep learning, all in a distributed fashion, because on the back-end we have Spark, TensorFlow, Sparkling Water and Keras. It’s super easy to use.
Apache Spark, Data Science, Python, Workflow
- Practical Apache Spark in 10 Minutes - Jan 11, 2019.
Check out this series of articles on Apache Spark. Each part is a 10 minute tutorial on a particular Apache Spark topic. Read on to get up to speed using Spark.
Apache Spark
- Apache Spark Introduction for Beginners - Oct 18, 2018.
An extensive introduction to Apache Spark, including a look at the evolution of the product, use cases, architecture, ecosystem components, core concepts and more.
Apache Spark, Beginners, Hadoop, R
- Project Hydrogen, new initiative based on Apache Spark to support AI and Data Science - Aug 16, 2018.
An introduction to Project Hydrogen: how it can assist machine learning and AI frameworks on Apache Spark and what distinguishes it from other open source projects.
AI, Apache Spark, Data Science, Databricks, Distributed Computing, Production
- Introduction to Apache Spark - Jul 6, 2018.
This is the first blog in this series to analyze Big Data using Spark. It provides an introduction to Spark and its ecosystem.
Apache Spark, Data Processing, Distributed Systems
- The 6 components of Open-Source Data Science/ Machine Learning Ecosystem; Did Python declare victory over R? - Jun 6, 2018.
We find 6 tools form the modern open source Data Science / Machine Learning ecosystem; examine whether Python declared victory over R; and review which tools are most associated with Deep Learning and Big Data.
Anaconda, Apache Spark, Data Science, Keras, Machine Learning, Open Source, Poll, Python, R, RapidMiner, Scala, scikit-learn, TensorFlow
- Apache Spark : Python vs. Scala - May 4, 2018.
When it comes to using the Apache Spark framework, the data science community is divided in two camps; one which prefers Scala whereas the other preferring Python. This article compares the two, listing their pros and cons.
Apache Spark, Java, Python, Scala
- Deep Learning With Apache Spark: Part 1 - Apr 18, 2018.
First part on a full discussion on how to do Distributed Deep Learning with Apache Spark. This part: What is Spark, basics on Spark+DL and a little more.
Apache Spark, Databricks, Deep Learning, Pipeline
- [ebook] 7 Steps for a Developer to Learn Apache Spark - Apr 17, 2018.
We offer a step-by-step guide to technical content and related assets that to help you learn Apache Spark, whether you're getting started with Spark or are an accomplished developer.
Apache Spark, Databricks, Developer, ebook, Spark SQL
- Ranking Popular Distributed Computing Packages for Data Science - Mar 20, 2018.
We examined 140 frameworks and distributed programing packages and came up with a list of top 20 distributed computing packages useful for Data Science, based on a combination of Github, Stack Overflow, and Google results.
Apache Spark, Data Science, Distributed Systems, GitHub, Hadoop
- A powerful new IDE to build, test, and run Apache Spark applications on your desktop for free! - Feb 23, 2018.
Build enterprise-grade functionally rich Spark applications with the aid of an intuitive drag-and-drop user interface and a wide array of pre-built Spark operators.
Apache Spark, Impetus, StreamAnalytix
- Top 15 Scala Libraries for Data Science in 2018 - Feb 9, 2018.
For your convenience, we have prepared a comprehensive overview of the most important libraries used to perform machine learning and Data Science tasks in Scala.
Apache Spark, Data Analysis, Data Science, Data Visualization, Machine Learning, NLP, Scala
- Graph Analytics Using Big Data - Dec 4, 2017.
An overview and a small tutorial showing how to analyze a dataset using Apache Spark, graphframes, and Java.
Pages: 1 2
Apache Spark, Big Data, Graph Analytics, India, Java
- Natural Language Processing Library for Apache Spark – free to use - Nov 28, 2017.
Introducing the Natural Language Processing Library for Apache Spark - and yes, you can actually use it for free! This post will give you a great overview of John Snow Labs NLP Library for Apache Spark.
Apache Spark, API, GitHub, John Snow Labs, Machine Learning, NLP
- PySpark SQL Cheat Sheet: Big Data in Python - Nov 16, 2017.
PySpark is a Spark Python API that exposes the Spark programming model to Python - With it, you can speed up analytic applications. With Spark, you can get started with big data processing, as it has built-in modules for streaming, SQL, machine learning and graph processing.
Pages: 1 2
Apache Spark, Big Data, DataCamp, Python, SQL
- A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets - Aug 22, 2017.
In this blog, I explore three sets of APIs—RDDs, DataFrames, and Datasets—available in a pre-release preview of Apache Spark 2.0; why and when you should use each set; outline their performance and optimization benefits; and enumerate scenarios when to use DataFrames and Datasets instead of RDDs.
Apache Spark, API
- Apache Flink: The Next Distributed Data Processing Revolution? - Jul 5, 2017.
Will Apache Flink displace Apache Spark as the new champion of Big Data Processing? We compare Spark and Apache Flink performance for batch processing and stream processing.
Apache Spark, Big Data, Flink, Streaming Analytics
- How Feature Engineering Can Help You Do Well in a Kaggle Competition – Part I - Jun 8, 2017.
As I scroll through the leaderboard page, I found my name in the 19th position, which was the top 2% from nearly 1,000 competitors. Not bad for the first Kaggle competition I had decided to put a real effort in!
Apache Spark, Feature Engineering, Jupyter, Kaggle, Machine Learning, Python
- Data Science for Newbies: An Introductory Tutorial Series for Software Engineers - May 31, 2017.
This post summarizes and links to the individual tutorials which make up this introductory look at data science for newbies, mainly focusing on the tools, with a practical bent, written by a software engineer from the perspective of a software engineering approach.
Apache Spark, Data Science, Jupyter, Machine Learning, Pandas, Python, Reddit, Scala, SQL
- Gartner Data Science Platforms – A Deeper Look - Mar 3, 2017.
Thomas Dinsmore critical examination of Gartner 2017 MQ of Data Science Platforms, including vendors who out, in, have big changes, Hadoop and Spark integration, open source software, and what Data Scientists actually use.
Apache Spark, Data Science Platform, Gartner, IBM, Python, R, SAS, Thomas Dinsmore
- Apache Arrow and Apache Parquet: Why We Needed Different Projects for Columnar Data, On Disk and In-Memory - Feb 16, 2017.
Apache Parquet and Apache Arrow both focus on improving performance and efficiency of data analytics. These two projects optimize performance for on disk and in-memory processing
Apache, Apache Arrow, Apache Spark, Data Science, Dremio, In-Memory Computing, Machine Learning, Python
- A simple approach to anomaly detection in periodic big data streams - Aug 24, 2016.
We describe a simple and scaling algorithm that can detect rare and potentially irregular behavior in a time series with periodic patterns. It performs similarly to Twitter's more complex approach.
Anomaly Detection, Apache Spark, BMW, Time Series, Twitter
- Big Data Key Terms, Explained - Aug 11, 2016.
Just getting started with Big Data, or looking to iron out the wrinkles in your current understanding? Check out these 20 Big Data-related terms and their concise definitions.
Pages: 1 2
3Vs of Big Data, Apache Spark, Big Data, Business Intelligence, Cloud Computing, Data Warehouse, Explained, Hadoop, Key Terms, Predictive Analytics
- Apache Spark Key Terms, Explained - Jun 13, 2016.
An overview of 13 core Apache Spark concepts, presented with focus and clarity in mind. A great beginner's overview of essential Spark terminology.
Pages: 1 2
Apache Spark, Databricks, Dataset, Explained, Key Terms, RDD, Tungsten
- XGBoost: Implementing the Winningest Kaggle Algorithm in Spark and Flink - Mar 24, 2016.
An overview of XGBoost4J, a JVM-based implementation of XGBoost, one of the most successful recent machine learning algorithms in Kaggle competitions, with distributed support for Spark and Flink.
Apache Spark, Distributed Systems, Flink, Kaggle, XGBoost
- Introducing GraphFrames, a Graph Processing Library for Apache Spark - Mar 7, 2016.
An overview of Spark's new GraphFrames, a graph processing library based on DataFrames, built in a collaboration between Databricks, UC Berkeley's AMPLab, and MIT.
Apache Spark, Databricks, Graph Analytics
- Top Big Data Processing Frameworks - Mar 3, 2016.
A discussion of 5 Big Data processing frameworks: Hadoop, Spark, Flink, Storm, and Samza. An overview of each is given and comparative insights are provided, along with links to external resources on particular related topics.
Apache Samza, Apache Spark, Apache Storm, Flink, Hadoop
- Top Spark Ecosystem Projects - Mar 2, 2016.
Apache Spark has developed a rich ecosystem, including both official and third party tools. We have a look at 5 third party projects which complement Spark in 5 different ways.
Apache Mesos, Apache Spark, Cassandra, Databricks, Distributed Systems
- Data Science Skills for 2016 - Feb 12, 2016.
As demand for the hottest job is getting hotter in new year, the skill set required for them is getting larger. Here, we are discussing the skills which will be in high demand for data scientist which include data visualization, Apache Spark, R, python and many more.
Apache Spark, CrowdFlower, Data Science, Python, Skills, SQL
- Auto-Scaling scikit-learn with Spark - Feb 11, 2016.
Databricks gives us an overview of the spark-sklearn library, which automatically and seamlessly distributes model tuning on a Spark cluster, without impacting workflow.
Apache Spark, Databricks, Open Source, scikit-learn
- Python Data Science with Pandas vs Spark DataFrame: Key Differences - Jan 29, 2016.
A post describing the key differences between Pandas and Spark's DataFrame format, including specifics on important regular processing features, with code samples.
Apache Spark, Pandas, Python
- Deep Learning with Spark and TensorFlow - Jan 28, 2016.
The integration of TensorFlow with Spark leverages the distributed framework for hyperparameter tuning and model deployment at scale. Both time savings and improved error rates are demonstrated.
Apache Spark, Deep Learning, Distributed Systems, TensorFlow
- How to Check Hypotheses with Bootstrap and Apache Spark - Jan 28, 2016.
Learn how to leverage bootstrap sampling to test hypotheses, and how to implement in Apache Spark and Scala with a complete code example.
Apache Spark, Bootstrap sampling, Dmitry Petrov, Statistical Analysis
- Spark + Deep Learning: Distributed Deep Neural Network Training with SparkNet - Dec 4, 2015.
Training deep neural nets can take precious time and resources. By leveraging an existing distributed batch processing framework, SparkNet can train neural nets quickly and efficiently.
Pages: 1 2
Apache Spark, Caffe, Deep Learning, Distributed Systems, H2O, Matthew Mayo, Neural Networks
- Fast Big Data: Apache Flink vs Apache Spark for Streaming Data - Nov 10, 2015.
Real-time stream processing has been gaining momentum in recent past, and major tools which are enabling it are Apache Spark and Apache Flink. Learn with the help of a case study about Data processing, Data Flow, Data management using these tools.
Pages: 1 2
Apache Spark, Big Data, Flink, Streaming Analytics
- Spark SQL for Real-Time Analytics - Sep 4, 2015.
Apache Spark is the hottest topic in Big Data. This tutorial discusses why Spark SQL is becoming the preferred method for Real Time Analytics and for next frontier, IoT (Internet of Things).
Ajit Jaokar, Apache Spark, Real-time, SQL, Sumit Pal
- Which Big Data, Data Mining, and Data Science Tools go together? - Jun 11, 2015.
We analyze the associations between the top Big Data, Data Mining, and Data Science tools based on the results of 2015 KDnuggets Software Poll. Download anonymized data and analyze it yourself.
Apache Spark, Data Mining Software, Excel, Hadoop, Knime, Poll, Python, R, RapidMiner, SQL
- R leads RapidMiner, Python catches up, Big Data tools grow, Spark ignites - May 25, 2015.
R is the most popular overall tool among data miners, although Python usage is growing faster. RapidMiner continues to be most popular suite for data mining/data science. Hadoop/Big Data tools usage grew to 29%, propelled by 3x growth in Spark. Other tools with strong growth include H2O (0xdata), Actian, MLlib, and Alteryx.
Actian, Apache Spark, Data Mining Software, H2O, Knime, Poll, Python, R, RapidMiner, SQL
- Exclusive Interview: Matei Zaharia, creator of Apache Spark, on Spark, Hadoop, Flink, and Big Data in 2020 - May 22, 2015.
Apache Spark is one the hottest Big Data technologies in 2015. KDnuggets talks to Matei Zaharia, creator of Apache Spark, about key things to know about it, why it is not a replacement for Hadoop, how it is better than Flink, and vision for Big Data in 2020.
Apache Spark, Big Data, Databricks, Flink, Hadoop, Matei Zaharia, MLlib, Spark SQL
- Interview: Arno Candel, H2O.ai on the Basics of Deep Learning to Get You Started - Jan 20, 2015.
We discuss how Deep Learning is different from the other methods of Machine Learning, unique characteristics and benefits of Deep Learning, and the key components of H2O architecture.
Apache Spark, Arno Candel, Deep Learning, H2O, Machine Learning
- Predictions: 2015 Analytics and Data Science Hiring Market - Jan 13, 2015.
Thanks to Big Data, analytics have become inescapable. Forget the C-Suite if you’re not a Data Geek, recruiting for startups gets harder, analytics salary bands get a lift, and more 2015 predictions.
Apache Spark, Burtch Works, Data Science, Hiring, MOOC, Predictions for 2015, Salary, Startups
- Apache Spark: O’Reilly Certification, EU Training, University Program - Sep 26, 2014.
Recent news on Apache Spark includes developer certification from O'Reilly, upcoming training workshops in EU by Databricks, and Spark tutorial events at major universities.
Academics, Apache Spark, Big Data, Certification, Databricks, Paco Nathan, Strata, Training
- 18 essential Hadoop tools - Aug 1, 2014.
Hadoop tools develop at a rapid rate, and keeping up with the latest can be difficult. Here we detail 18 of the most essential tools that work well with Hadoop.
Apache Spark, Data Infrastructure, Hadoop