Parsing CSV Files in Ruby with SmarterCSV

Tilo Sloboda
6 min readJul 8, 2024

--

Working with CSV (Comma Separated Values) files is a very common task, and SmarterCSV is the easiest and most powerful tool for the job. This article gives an overview of its capabilities, and contrasts it against Ruby’s “csv” library.

Introduction

CSV files are a common data exchange format. They are simple text files with rows of data, where each row has columns that are typically separated by commas. This article will show you how to use the SmarterCSV Ruby gem to easily read and process CSV files efficiently.

Intelligent Defaults

While Ruby’s “csv” library has a bare-bones feel to it, returning an array for each row, SmarterCSV uses intelligent default settings, and a standardized pre-processing of input data, to massage the data into a more usable format, and it returns an easy-to-use Ruby hash for each row. This eliminates a lot of the manually post-processing that would be needed otherwise. These smart defaults help you get more consistent data quality when importing CSV data over which you typically have little or no quality control.

Instead of just spitting-out raw data in array format, SmarterCSV massages the data into a usable form and returns a Ruby hash. Here’s an example of a CSV file with non-standard column separators, and emojis:

$ cat /tmp/my.csv
"First Name "|" Last Name"|Emoji | pets
José | Corazón| ❤️ | 2
Jürgen|Müller |😐 | 1
Michael |May| 😞 |0

SmarterCSV is like CSV processing on auto-pilot:

require 'smarter_csv'
filename = '/tmp/my.csv'

data = SmarterCSV.process(filename)

=> [
{:first_name=>"José", :last_name=>"Corazón", :emoji=>"❤️", :pets=>2},
{:first_name=>"Jürgen", :last_name=>"Müller", :emoji=>"😐", :pets=>1},
{:first_name=>"Michael", :last_name=>"May", :emoji=>"😞", :pets=>0}
]

Using its default setting of col_sep: :auto, and row_sep: :auto, SmarterCSV automatically figures out that the line endings, and that the column separator is |. Emojis and UTF-8 characters are handled properly by default. It automatically trimmed off the extra spaces in the headers and the data rows, and transformed the headers into Ruby symbols to be used as keys in the individual data hashes. Numerical values are converted automatically into the appropriate type. You can also add your own custom value converters.

This automatic detection of the most common CSV formats helps making your code more resilient against variations in the input data over which you have no control — e.g. when users upload CSV files to your service.

And did you notice the output format?

Ruby Hashes for the Win!

Perhaps the most important design choice for SmarterCSV is to represent the rows of input data as simple Ruby hashes, using symbols as keys.

This allows very easy processing of the data, no matter if you need to pass it on to ActiveRecord, S3, or Sidekiq — SmarterCSV makes it a breeze to integrate with these use cases.

Memory Consumption

Reading large CSV files can consume a lot of memory if loaded entirely.
SmarterCSV handles this efficiently by processing files line-by-line. For very large files and for feeding parallel processing, it can also process the input data in chunks. Here’s how to process very large files in chunks:

require 'smarter_csv'

SmarterCSV.process('your_file.csv', { chunk_size: 1000 }) do |array-of-hashes|
# do bulk-inserts, -upserts, or pass chunks of data to a Sidekiq worker
MyModel.insert_all(array-of-hashes)
end

Not Only File Input

SmarterCSV is not limited to processing input files, but can also process any input that responds to readline, e.g. StringIO, or an open file handle.

csv_string = "name,age,city\nJohn,30,New York\nJane,25,San Francisco"

SmarterCSV.process(StringIO.new(csv_string)) do |row_hash|
array.each do |row|
# Process each hash representing a row of data
end
end

Using StringIO, you can easily feed string data to SmarterCSV, but be careful to not read the complete contents into memory all at once. It is best to chunk your data, and release already processed chunks.

Importing CSV Data into Rails Models

Importing data from CSV into Rails models is straightforward with SmarterCSV. Here’s an example of how you can import company data, assuming that each row has a unique tax-id:

class Company < ApplicationRecord
end

file_path = 'path/to/companies.csv'

SmarterCSV.process(file_path, { chunk_size: 1000 }) do |chunk|
Company.upsert_all(chunk, unique_by: :tax_id)
end

Importing CSV Data with Sidekiq

When passing large amounts of input data to Sidekiq, a good pattern is to use an enqueuer worker that accepts chunks of data, which then kicks-off individual workers for each row of data. This way, you can enqueue and hand-off the work fast, and still track the individual progress for each individual worker, representing one row of data.

file_path = 'path/to/upload.csv'

SmarterCSV.process(file_path, { chunk_size: 1000 }) do |array-of-hashes|
MySidekiqEnqueuer.perform_async(array-of-hashes)
end
class MySidekiqEnqueuer
include Sidekiq::Worker

def perform(chunk)
chunk.each do |record|
# kick-off individual workers for each row
MySidekiqWorker.perform_async(record)
end
end
end

Example: Importing Course Data

When working with a Rails application, you might need to import data into your models. For instance, let’s consider a scenario where you have a list of courses in CSV format that you want to import into your Rails application as Course models.

First, ensure that you have a Course model in your Rails application. It might look something like this:

class Course < ApplicationRecord
# your code
end

Normally you’d need to create a parser that reads the CSV file, normalizes the input data, and then creates Course records. You can do this in a Rake task, a service object, or directly in a controller action, depending on your needs and the application's overall architecture.

With SmarterCSV the rake task becomes trivial:

# Place in lib/tasks/
require "smarter_csv"

namespace :import do
desc "Import courses from a CSV file"
task courses: :environment do
SmarterCSV.process(("your_file.csv").each do |row_hash|
# { name: "CS 101", description: "CS basics", instructor: "John Smith" }
Course.create!(row_hash)
end
# this could be even faster using bulk-inserting (see chunked example)
end
end

This rake task assumes you have the file you’d like to import in directory/your_file.csv, and it contains Name, Description, and Instructor headers. It will then create and persist a Course object with the data in those columns.

In contrast to a solution using the csv gem, SmarterCSV already eliminates a lot of the possible data issues.

Example: Exporting Course Data to CSV

SmarterCSV can also be used to create CSV files. e.g. it can provide an export functionality — exporting data from your application’s database into a CSV file.

For simplicity, let’s assume you’re exporting the data from a controller action.

Here’s a controller export action that creates a CSV file from the Course model, and only exports the given attributes:

require "smarter_csv"
require "fileio"

class CoursesController < ApplicationController
CSV_OPTIONS = {
map_headers: {
name: 'Course',
description: 'Content Summary',
instructor: 'Professor',
}
}.freeze

def export
file_path = "/tmp/courses-#{Date.today}.csv"

# create the CSV temp file
SmarterCSV.generate(file_path, CSV_OPTIONS) do |csv_writer|
Course.find_in_batches(batch_size: 100) do |batch|
csv_writer << batch
end
end

# send it!
send_file file_path, filename: File.basename(file_path)

ensure
# Ensure the temp file is deleted after sending
File.delete(file_path) if File.exist?(file_path)
end
end

SmarterCSV.generate is a convenience method around creating a SmarterCSV::Writer instance, which can take one or many calls of the << operator to append data to a file. Using map_headers is one way to define the CSV headers for each attribute. We get this output:

> cat /tmp/courses-2024-07-04.csv
Course,Content Summary,Professor
15-213,Introduction to Computer Systems,Brian Railing
15-852,Parallel and Concurrent Algorithms,Walter Tichy

Comparison with Ruby’s Standard CSV Library

While Ruby’s “csv” library provides basic CSV parsing capabilities, SmarterCSV offers several advantages.

SmarterCSV is optimized for performance, allowing for fast parallel processing of large files. It can process files in chunks, reducing memory consumption significantly.

Using intelligent defaults, SmarterCSV pre-processes and normalizes input data for you. Additionally, SmarterCSV supports custom data transformations and handles various file encodings effortlessly.

These features make SmarterCSV a superior choice for handling CSV data, especially when dealing with large datasets and complex processing requirements.

Conclusion

SmarterCSV is a powerful and efficient gem for handling CSV files in Ruby. With features like chunk processing, custom data transformations, intelligent default settings, and easy integration with Rails, it is an excellent choice for managing CSV data. Its ease-of-use, data-normalization, and flexibility make it superior to the Ruby “csv” library, especially when you have large datasets and want to do batch processing.

For more details and updates, check out the SmarterCSV GitHub repository

--

--