Which Languages Should You Learn for Data Science? A Detailed Look at Ruby
Choosing which programming languages to learn is one of the most important decisions you‘ll make as an aspiring data scientist. The languages you know will shape what tools and frameworks you can use, what companies you can work for, the types of analyses you can do, and more.
There‘s no one "best" language for data science. Different languages have different strengths and weaknesses. Some, like Python and R, have a vast ecosystem of data science libraries and tools. Others, like Scala and Java, offer better performance for big data processing. And some newer languages like Julia hold promise for the future.
In this article, we‘ll take a detailed look at one language that‘s not often thought of as a "data science language", but is well worth considering: Ruby.
We‘ll dive into the strengths and weaknesses of Ruby for data science, look at some companies and projects using Ruby, and discuss how it compares to the most popular data science languages like Python and R. By the end, you‘ll have a sense of whether Ruby might be a good fit for your data science journey.
But first, let‘s do a quick overview of some of the most popular programming languages used in data science.
Popular Data Science Programming Languages
Here are some of the programming languages most commonly used for data science, along with their key strengths:
Python
Python is probably the most popular language for data science today. It offers a great balance of performance, productivity, and a huge ecosystem of data science libraries for everything from data processing (NumPy, Pandas) to machine learning (scikit-learn, TensorFlow, PyTorch) to visualization (Matplotlib, Seaborn).
R
R is a language built specifically for statistics and data analysis. It has an extensive collection of packages for statistical modeling, machine learning, and data visualization. While it‘s not as general-purpose as Python, it‘s great at what it does.
SQL
SQL is a must-know for data scientists, since so much data lives in relational databases. It‘s the lingua franca for querying and manipulating data in databases. While not a general-purpose language, SQL is a critical tool in the data scientist‘s toolbelt.
Java & Scala
Java is a popular general-purpose language that‘s also used for data science. It offers great performance, especially for processing large datasets. Scala, which runs on the Java Virtual Machine (JVM), is particularly popular in the big data world thanks to tools like Apache Spark.
Julia
Julia is a newer language that shows a lot of promise for numerical computing and data science. It aims to combine the ease of use of Python with the performance of compiled languages like C++. It‘s still young, but has a growing community and ecosystem of packages.
This is just a sample – there are many other languages used for data science work as well, like MATLAB, SAS, C/C++, and more.
Now let‘s dive into Ruby and see how it fits into the data science landscape.
Ruby for Data Science
Ruby is a general-purpose, object-oriented programming language. It was designed to be easy and productive to write, with a focus on simplicity and elegance.
While Ruby is not one of the first languages that comes to mind for data science, it actually has a lot going for it. Let‘s look at some of the strengths and weaknesses of Ruby from a data science perspective.
Strengths of Ruby for Data Science
Readability and Ease of Use
One of Ruby‘s biggest strengths is its clean, readable syntax. This makes it easy to write and maintain complex data processing pipelines and analysis scripts. Data scientists often spend more time cleaning and pre-processing data than actually analyzing it, so a language that‘s optimized for productivity can help a lot.
Here‘s a simple example of reading a CSV file and calculating the average of a column in Ruby:
require ‘csv‘
sum = 0
count = 0
CSV.foreach("data.csv", headers: true) do |row|
sum += row["value"].to_f
count += 1
end
puts "Average: #{sum / count}"
As you can see, the code is concise and easy to understand, even if you‘re not a Ruby expert.
Scripting and ETL Capabilities
Ruby is a great language for writing scripts to process, clean, and transform data. Its standard library and ecosystem have great tools for tasks like reading/writing CSVs, parsing JSON and XML, calling APIs, and manipulating data.
For example, Ruby has built-in support for regular expressions, which are very handy for data cleaning tasks like parsing log files or extracting structured fields from text.
Here‘s an example of using a regular expression to parse a log file and count different types of log events in Ruby:
log_lines = File.read("application.log").split("\n")
event_counts = Hash.new(0)
log_lines.each do |line|
case line
when /Completed 200 OK/ then event_counts["success"] += 1
when /Completed 4\d\d/ then event_counts["client_error"] += 1
when /Completed 5\d\d/ then event_counts["server_error"] += 1
when /Started POST|GET|PUT/ then event_counts["requests"] += 1
end
end
puts event_counts
Libraries like nokogiri for parsing HTML and XML, HTTParty for making HTTP requests, and Sequel for querying databases make ETL tasks even easier.
Active Community and Ecosystem
While the data science community in Ruby is much smaller than Python or R, Ruby does have an active, thriving open source community. Rubygems.org, the main repository for Ruby libraries (called gems), hosts over 170,000 gems. And there are many active user groups and conferences for Rubyists around the world.
This ecosystem includes a number of gems aimed at data analysis and scientific computing, such as daru for data frame processing, numo/narray for n-dimensional arrays, and rubex for writing high-performance extensions in Ruby.
Scientific Computation Capabilities
In addition to libraries for general data processing and ETL tasks, there are a number of projects bringing scientific and numerical computing features to Ruby:
-
SciRuby is a collection of gems for scientific computation, including tools for linear algebra, plotting, statistics, optimization, and more.
-
Rubyvis is a data visualization library for Ruby based on the popular D3.js JavaScript library.
-
Statsample and statsample-glm are gems that bring R-like statistical analysis and modeling capabilities to Ruby.
-
Rubex allows writing high-performance, native extensions to Ruby in a Ruby-like syntax. This can be used to optimize critical paths in data science code.
These projects demonstrate that, while not as mature as the Python or R ecosystems, Ruby is capable of the performance and feature set needed for many data science applications.
Weaknesses of Ruby for Data Science
Now let‘s look at a few areas where Ruby currently falls short for data science.
Smaller Ecosystem and Community
The biggest weakness of Ruby for data science is that its ecosystem and community are much smaller than that of Python or R. While there are certainly many useful data science gems, they don‘t have the same level of adoption, active development, and community support as the main Python and R libraries.
This means it can be harder to find help and resources online when you run into issues, and some more advanced or niche data science techniques may not have a well-maintained Ruby implementation.
Performance Limitations
As an interpreted language, Ruby is generally slower than compiled languages like Java and C/C++. For many data science use cases this is not a major issue, but for applications that require processing huge datasets or training complex machine learning models, the performance gap can be significant.
Tools like Rubex and JRuby (Ruby on the JVM) can help close this gap for performance-critical code paths, but they require additional work and expertise to use effectively.
Less Adoption in Academia and Industry
Compared to Python and R, which are widely used in both industry and academia for data science and machine learning, Ruby has less adoption specifically in the data science world.
There are some notable companies using Ruby for data science, which we‘ll look at in the next section, but it‘s not one of the most common languages you‘ll see in data science job postings or academic research.
This means that, as a data scientist using Ruby, you may have to do more pioneering and advocacy to use Ruby in your projects, and you may have a harder time finding Ruby-specific data science resources and community support.
Ruby Data Science in Practice
Despite the limitations we‘ve discussed, there are many companies and projects using Ruby for successful data science work. Here are a few notable examples:
Airbnb
Airbnb, the popular vacation rental platform, uses Ruby extensively for their data processing pipelines. They‘ve written about how they use Ruby and Spark to process and analyze petabytes of data to power features like search rankings, pricing recommendations, and fraud detection.
Shopify
Shopify, an e-commerce platform, uses Ruby for a variety of data science and engineering tasks. They‘ve built a data science platform called Shard using Ruby and Spark to power things like A/B testing, machine learning model training, and ad-hoc data analysis.
The iRuby Notebook
Rubyists have created a kernel for the Jupyter notebook system (popularized by Python‘s IPython notebook) called iRuby. This allows doing interactive, exploratory data analysis in Ruby with a web-based notebook interface.
Daru: Ruby Data Analysis Toolkit
Daru is a gem that provides data frame and vector structures for data analysis in Ruby, similar to Python‘s pandas library or R‘s data.frame. It integrates with many of the visualization, statistics, and machine learning gems in the SciRuby ecosystem.
These are just a few examples – there are many other companies and open source projects using Ruby for data science and machine learning applications.
Getting Started with Data Science in Ruby
If you‘re interested in trying out data science in Ruby, here are a few resources to get started:
-
The SciRuby homepage has a great list of libraries and tools for scientific computing and data analysis in Ruby: https://sciruby.com/
-
If you‘re coming from Python, this guide shows how to do many common data analysis tasks in Ruby using Daru and other libraries: https://github.com/SciRuby/sciruby-notebooks/blob/master/Data%20Analysis/Python_vs_Ruby_for_Data_Analysis.ipynb
-
The Ruby Data blog has tutorials and examples of data science and machine learning in Ruby: https://rubydata.org/
-
The Practical AI with Ruby book teaches machine learning in Ruby from scratch: https://www.practicalaiwithrby.com/
Conclusion
In this post, we‘ve taken a detailed look at Ruby for data science.
We‘ve seen that Ruby has significant strengths, like readability, productivity, and scripting/ETL capabilities, as well as weaknesses, like a smaller ecosystem and less adoption compared to Python and R.
Despite the limitations, there are many exciting projects and companies using Ruby for successful data science work, and a growing ecosystem of tools for data analysis, visualization, and machine learning.
Ruby may not be the obvious choice for data science, but for Rubyists interested in data, or polyglot data scientists looking to add another tool to their belt, Ruby is definitely worth considering.
At the end of the day, the "best" language for data science is the one that lets you be most productive, while still meeting the performance and capability needs of your projects and organization. For some, that may be Python or R, for others, it could be Ruby, Julia, or something else entirely.
The most important thing is to pick a language (or a few languages) and dive deep. Focus on learning the fundamentals of data science, and don‘t get too caught up in chasing the trendiest new tool.
And, if you‘re already a Rubyist, hopefully this post has shown you that it‘s completely viable to practice data science with the language you love. Happy data sciencing!