Rails: Use JSON to serialize model attributes

Scalling a Rails application’s response is often done by using another programming language. As we might know, Twitter has started as a Rails application, and ended up as Scala, or later as a Javascript backend application.

In my oppinion, a first step to this kind of migration would consist in normalising all the serialised data you have in your database.

Personally, i use ActiveRecord::Base#serialize method to handle most of the custom data that can be resulted by a STI model, or to store any dynamic extra data. I consider to be a good example the situation when you need to keep some additional information about the user, like the company info if is a company .

Some of the Ruby on Rails projects are starting to use plain serialize method, which can be exemplified by Ryan Bates tutorial named PayPal Notifications, or by Diaspora’s codebase or Spree codebase.

One easy trick that you can do in any 3.x & 4.x application is to define your serialize method like :

class User < ActiveRecord::Base 
  serialize :other_data, JSON
end

This way your application will use JSON column coder, which in my opinion, is a better alternative which fixes some of the problems for you:

  • Allows you to use same database backend for multiple applications, written in multiple languages (if is the case)
  • Avoids Syck vs Psych serialization problems (Psych is default Yaml-er starting with Ruby 1.9.3)
  • Json is much faster than Yaml (check here, here and here)

Some of the problems caused by YAML are described in Arne Brasseur’s post.

Updating an existing application to use JSON serialized fields

In order to make your existing application to use JSON serialized field, you would need to perform some changes to your models, mainly to convert:

class User < ActiveRecord::Base 
  serialize :other_data
end

To

class User < ActiveRecord::Base 
  serialize :other_data, JSON
end

Notice the “JSON”, argument for the serialize method.

Other change that you would need to do is to add a migration to change your existing data, from YAML serialized string to JSON serialized string. To do so, you would need to add a migration or a code snippet somewhere in your application to perform the conversion operation:

class ChangeSerializationOnUser < ActiveRecord::Migration
  class YamlUser < ActiveRecord::Base
    self.table_name ="users"
    serialize :other_data
  end

  class JsonUser  < ActiveRecord::Base
    self.table_name ="users"
    serialize :other_data, JSON
  end

  def up
    YamlUser.where(other_data: '---
    ').update_all(other_data: nil)

    YamlUser.find_each do |yaml_user|
      next unless yaml_user.other_data.present?
      next unless yaml_user.other_data.respond_to?(:to_hash)
      hash = yaml_user.other_data.to_hash

      JsonUser.where(id: yaml_user.id).update_all(other_data: nil)
      json_user = JsonUser.find(yaml_user.id)
      json_user.other_data = hash || {}
      json_user.save!
    end
  end

  def down
    raise ActiveRecord::IrreversibleMigration
  end
end

The migration above is doing the following things:

  • Defines a YamlUser class that would handle the Yaml serialize part of your migration. Assuming you added JSON parameter to your class, YamlUser is performing the simple task of converting from string to whatever data you have serialized.
  • Defines a JsonUser class that would handle the JSON serialize part of your migration. This class is defined to perform one single thing, that to convert and save the serialized info field, without validations, without ActiveRecord callbacks.
  • Cleans up all the empty serialized objects. Depending of your data, you might add also an update for  ‘— \n[]’
  • Sometimes, the information you have saved might come as an HashWithIndifferentAccess, which for this operation would require a manual deserialization. That is why, i am using .to_hash
  • Before instantiating a JsonUser object, we would need to update the record in order to avoid any errors caused by the object hydration.
  • Of course, i consider this to be an “ActiveRecord::IrreversibleMigration”

I consider this to be a first step in order to migrate to multiple backend applications.

Read more:

Exposing serialised fields – meta programming way

class SomeClass < ActiveRecord::Base
  belongs_to :user

  def self.serialize(attr_name, class_name = Object, exposed_fields = [])
    super(attr_name, class_name)
    serialized_attr_accessor attr_name, exposed_fields
  end

  def self.serialized_attr_accessor(attr_name, *args)
    args.first.each do |method_name|
      eval "
        def #{method_name}
          (self[:#{attr_name}] || {})[:#{method_name}]
        end
        def #{method_name}=(value)
          self[:#{attr_name}] ||= {}
          self[:#{attr_name}][:#{method_name}] = value
        end
        attr_accessible :#{method_name}
      "
    end
  end

  serialize :other_data,
    Hash,
    %w(some other values you want to store in serialized field)
end

Cleanup Big mongodb Collection

Recently i  have come across one small problem that i needed to fix. I had many records in a DB that i do not needed. I could not delete the entire collection, as i needed some of the records to be left alone. I have come up with this script, which allows me to delete records as I need.

query = {created_at: {"$gte": new ISODate("2012-11-01T00:00:00Z"), "$lt":  new ISODate("2012-12-01T00:00:00Z")}}

items = db.<COLLECTION>.find(query).count();
count = 0;
batches =  parseInt(items / 1000);

for (var i = 0; i < batches; i++) {
  print("Remaining: "+ parseInt(batches-i));
  db.<COLLECTION>.find(query).skip(count).limit(1000).forEach(function(p) { 
    if (p.has_transaction && p.has_transaction == 1) { 
      count++;
    } else {
      db.<MY BACKUP COLLECTION>.insert(p); 
      db.<COLLECTION>.remove(p,1); 
    }
  });
}
print(db.<COLLECTION>.find(query).count());

Romanian Phone Number validator

Recently i had to implement a Romanian Phone Number validator… and i have managed to implement it as a method of the jQuery Validation Plugin.

here is the whole method

$.validator.addMethod("phoneRO", function(phone_number, element) {
  phone_number = phone_number.replace(/\(|\)|\s+|-/g, "");
  return this.optional(element) || phone_number.length > 9 &&
phone_number.match(/^(?:(?:(?:00\s?|\+)40\s?|0)(?:7\d{2}\s?\d{3}\s?\d{3}|(21|31)\d{1}\s?\d{3}\s?\d{3}|((2|3)[3-7]\d{1})\s?\d{3}\s?\d{3}|(8|9)0\d{1}\s?\d{3}\s?\d{3}))$/);
}, "Please specify a valid romanian phone number");

The ReGex of interest is:

/^(?:(?:(?:00\s?|\+)40\s?|0)(?:7\d{2}\s?\d{3}\s?\d{3}|(21|31)\d{1}\s?\d{3}\s?\d{3}|((2|3)[3-7]\d{1})\s?\d{3}\s?\d{3}|(8|9)0\d{1}\s?\d{3}\s?\d{3}))$/

Some of the formats this ReGex is able to recognise are:

00 40 722 000 000
00 40 218 032 329
00 40 243 253 398
00 40 343 254 398
00 40 800 801 227
00 40 318 032 329
0722 000 000
0800 801 227
0800 801227
0318 032 329

Have a try: http://rubular.com/r/2ufyprKWGz

sphinx mysql command line

I’ve been searched for a method on how to work using sphinx from console, I needed to access the interface in order to see what’s indexed, how is stored and  also test my searches.

After my google searches i’ve ended up with this.

mysql -h 127.0.0.1 --prompt 'SphinxQL>' --port 9312

Horizontal scaling using Db Charmer

I was looking for a way to scale horizontally a Ruby on Rails application, and i have tried several methods to scale it. A method would be using a MySQL cluster, but that would require some serious database administrator skills, which unfortunately i don’t have.

dbreplication173Mainly i have an application that is read intensive (80% reads vs 20% writes) so, i have considered to use a MySQL master – slave configuration. The problem is that there is nothing about it in Rails documentation, however, after a short look in ruby-toolbox.com I have discovered that I am not the only one who encountered this problem.

I have tried octopus as my first choice, but i have soon discovered that is not fit for my application. For some reasons, not all my “read” queryes were passed to my slave connection. I have tried to see why, but because I was kind of pressed by time, i have dismissed this gem, even if i love the simplicity of the models.

After dismissing octopus, I have tried db charmer gem, which is pretty active. This is yet another Active Record Sharding gem that offers you the possibility to split database reads and writes.

The method i have chosen for my first try was to split my actions that were 100% reads, and push them to a slave. That was pretty simple using a before filter in my rails controllers.

class ProfilesController < Application
  force_slave_reads :only =>  [ :show, :index ]
end

This action allowed me to scale the application by keeping the same amount of servers, but the main effect was a drop in the response time of the applications.

The second action i have taken was to get all the heavy queries like counts out of the mysql master server and move them to slave.

class User < ActiveRecord::Base
  def some_some_heavy_query
    self.on_slave.joins(:profile, :messages).count(:group => ['messages.thread_id'])
  end
end

In my enthusiasm of having a mysql slave I have thought that it would be nice to have “ready” 3 slave instances in my config. I have later realised that this “optimisation” caused problems because those 3 connections multiplied by the number of max_child in my apache configuration and also multiplied by the number of the servers exceded the number of the max_connection on my mysql slave server.

After a small fix in my database.yml files I was back online with a more performant application.

How to use aggressive file caching

Speed up your site

Recently I have observed that one of my servers took long time to respond to users. After an investigation I have seen that i had a lot of TIME_WAIT connections, because each request needed to process some output. My application serves some user widgets that are connecting a 3rd Party server, which can cause a lot of delays regarding my output. Given the fact the application did not used secured content (did not required for user to be signed in), I have decided to use aggressive file caching strategy. Basically i have used PHP’s  ob_start function and its callback in order to write the application’s response on disk.

I had an YII Framework application, so i have modified index.php file to look like this:

<?php
function callback($buffer)
{
  if (empty($buffer)) {
    return $buffer;
  }
  try {
    $file_name = $_SERVER['REQUEST_URI'];
    if (preg_match("/\?/", $file_name)) {
      $file_name = substr($file_name, 0, strpos($file_name, '?'));
    }
    if (substr($file_name, -3, 3) == '.js') {
      file_put_contents(dirname(__FILE__) . $file_name, $buffer);
    } else if (substr($file_name, -9, 9) == 'some custom name') {
      mkdir(dirname(__FILE__) . substr($file_name, 0, -9), 0777, true);
      file_put_contents(dirname(__FILE__) . $file_name, $buffer);
    }
  }catch(Exception $e) { }
  return $buffer;
}

ob_start("callback");

// change the following paths if necessary
$yii=dirname(__FILE__).'/some/path/to/yii/framework/yii.php';
$config=dirname(__FILE__).'/protected/config/main.php';

// remove the following lines when in production mode
//defined('YII_DEBUG') or define('YII_DEBUG',true);
// specify how many levels of call stack should be shown in each log message
//defined('YII_TRACE_LEVEL') or define('YII_TRACE_LEVEL',3);

require_once($yii);

Yii::createWebApplication($config)->run();

ob_end_flush();

Given the fact that my application needed to return JSON objects, i had to added in my NGINX de following lines:

location ~ ^/js/.*\.js$ {
  #access_log  off;
  access_log    /var/log/nginx/hostname-access-log main;
  add_header Content-Type application/javascript;
  add_header Access-Control-Allow-Origin *;
  if (-f $request_filename) { break; }
  try_files $uri  @apachesite;
}

location ~ ^/js/.*/some custom name$ {
  #access_log off;
  access_log    /var/log/nginx/hostname-access-log main;
  add_header Content-Type application/json;
  add_header Access-Control-Allow-Origin *;
  if (-f $request_filename) { break; }
  try_files $uri  @apachesite;
}
location / {
  # some more config here 
}
location @apachesite {
  # some more config here 
}

The result was a immediate drop of TCP connections on that server, a CPU usage decrease and no difference regarding the functionality. Even more, all what I could see it was a performance improvement. However now I got two other issues: the size of the folder and the cache expiration. Given the fact I wrote the files on disk in one single folder, there was a response time issue (again) because of the big number of files. Those 2 issues, were easier to fix by adding some small script to my crontab:

#Added cronjob to delete old files
0 * * * * /some/path/for/cache/expire/script.sh

And the source of: /some/path/for/cache/expire/script.sh

#!/bin/bash

BASE='/just/another/htdocs/public/folder/matching/my/url'
#age in minutes
AGE=60

find $BASE/* -mmin +$AGE -exec rm -r {} \;

Warning!! This aggressive file caching strategy cand cause serious response time issues if the number of the files is too big (I let you decide what “big” means to you). By implementing the cron job from above ensures the cache expiration but also the cleanup of the folder by deleting the files that have not been accessed in a while.

Database optimization technique and how storage engine works

Today i was discussing with a colleague of mine about a mysql database optimization technique, and some of the things i have learned about how mysql works on big table datasets. This article is not about “mysql”, is not about “Mongodb”, but more about the principle behind of it.

I will try to give you an idea about this using i using a book library example.

In any database engine you might have tables / collections that have million of rows, which needs to be sorted, ordered, filtered using conditions, and every time is a pain finding the right solutions.

I have indexes, so is not a problem for me.

Well, if you’re reading this, then you might be wrong. First of all, if it is a query (and by query i mean any SQL / NoSQL statement meant to retrieve informations from the table) that is not ran on a constant basis, then most probably, your database engine does not have it cached. This means every time you’ll run that query the database engine will actually look in the store and actually filter all of your statements.

Yes, but i am running statements using indexes …

Well … imagine that you’re in a library, and you ask the librarian to give you a certain book, knowing just the author. He will leave the office and go to search your book, knowing just a fragment of the specs. He will know that he needs to go to a certain row, but will not know which shelf, or what is the position of the book, so he will lose some time to read all those authors in that shelf, until he provides you the book.

This is happening in any Database engine, when you ask for a book that is not read too often.

Now imagine that you’re asking a book that was just returned by another reader. Now, the librarian might have not put the book back the shelf, so, he might give you faster because is just few meters away.

Yeah… but the database engine is using “index tables”… there are pointers … etc

Well … yeah and no … imagine that you already know that the book is on a row and a shelf… you’ll still lose time by moving there, and also, searching on that shelf. The story is more like the one above.

Ok … might get it … but how can i get a random record?

Well, in MySQL and other SQL languages you might do something like this :

SELECT @v:=RAND() * (SELECT MAX(id) FROM your_table)

then:

SELECT * FROM your_table WHERE  id > @v AND  the_rest_of_your_condition = 'something';

in MySQL RAND() function will always return a real number between 0 and 1, which means that any number will get generated will be actually between record 1 and your MAX(id).

This kind of hack is useful when your need to perform a query like…

SELECT * FROM your_table WHERE your_condition = 'something' ORDER BY RAND();

At this very moment i don’t know how could you achieve that in a NoSQL engine.

Well my problem is a pagination issue.

Some pagination ...

Good luck with that! Maybe you have something like:

SELECT COUNT(SELECT * FROM my_table)

to find the number of rows that you have in your database… Well That’s bad … a mysql database optimization technique would be to use CALC_FOUND_ROWS function call which might solve your issue like this:

SELECT CALC_FOUND_ROWS * FROM my_table WHERE my_conditions = '1' LIMIT 0, 10;

Then you can use

SELECT FOUND_ROWS()

However, your problem is still not fixed … On big datasets, you will still have a MySQL issue when using OFFSET or LIMIT. A query like:

SELECT * FROM my_table LIMIT 10000, 10;

is similar with

SELECT * FROM my_table OFFSET 10000 LIMIT 10;

Which in MySQL, will mean that the storage engine will actually load in memory and read 10010 rows. And once the offset gets higher, and higher, so the time waiting for the response will increase.

Those are just 2 optimisations that you can perform at least on MYSQL to get a faster response. There are more ways to optimize your mysql queries, but, because is not a “one size fits all” solution, i cannot speak about them, as i did not encountered them … YET.

How to reduce code duplication

By using Ruby or Pyton you might know about the fact that both are dynamically typed, as a result a developer can choose some of the variants available to remove some of the duplication. There are at least 2 methods that can be used to reduce code duplication.

Dynamic declaration:

class MyClass
  STATUS_ARRAY = [ :pending, :closed, :rejected, :waiting_for_payment ]
  STATUS_ARRAY.each do |method|
    class_eval %{
      scope :#{method}, where('column_name = ?', #{method.to_sym}) 
      def #{method}?
        self[:column_name] == '#{method}'
      end
      alias :is_#{method}?, #{method}?
    }
  end
end

In the example above we managed to write 8 lines that compress the code by declaring dynamically the equivalent of a 20 lines.

Inheritance

In Ruby you can easily use modules that you can include into your classes or you could use extend.

The below example you might encounter in all the Rails application upon class definitions, but this method is somehow limiting as you cannot extend with more than 1 class. In the below example we are extending the Foo class from Class1. Better said, Class1 is an ancestor of Foo

class Foo < Class1
end

However you could also use multiple inheritance by using the following example:

class Foo
  extend Class1
  extend Class2
end

The above example allows you to use several classes as a parent for Foo, however this might not be the best approach, because defies some OOP principles. As a workaround, is better to use modules.

Modules usage

class Foo < Class1
  include Module1 
  include Module2
end

I have shown here how to create a module

Mysql field types and their charsets

When having to administer a big database having tables that contains millions  of rows, a big issue emerges. Optimizing the tables and the data stored.

A short version would tempt me to say: Don’t use the utf8 encoding for all the table. Use it for the damn fields where you really need it.

Longer version would require some calculus to be performed, and that would lead to a better understanding of what happens in a MySQL server.

When varchar is it used, the storage engine, allocate exactly the amount of bytes required to store the value, no matter the charset. However during the creation of the temporary tables and internal buffers (during joins, ordering, basic reading from the storage and other operations), the maximum amount of bytes for the column is allocated.

This means is you have a column that is Varchar(255) utf8, and you write a single letter in it, the disk will need just 2 bytes  to store it: 1 byte for its length and 1 byte to store the value. The issue becomes trickier, when you need to fetch that value, as MySQL not knowing what it is stored in that field, will allocate 768 bytes (you have read right 768… 256 bytes the length of the field * 3 bytes per utf8 character).

Assume you have a table that contains 2 columns of type varchar(255) utf8 and you might run a query like this:

SELECT column1, column2 FROM table1 ORDER BY column1 DESC;

If you don’t have index on column1 and the table has 1 Million rows, MySQL would require to create a temporary table to do the sorting, which would lead to a temporary table of something like 1.43 Gb. The calculus is symple:

768 (Bytes) *  2 (columns) * 1.000.000 (rows) = 1536000000 Bytes  which would give a roughly 1.43 Gigabytes of temporary table on disk.

If you have values that are actually that long, it makes sense, but if the maximum length is around 30 – 50 (let’s consider a username or a password or even an email field) the temporary table would be roughly around 290 Megabytes

150 (Bytes) * 2 (columns) * 1000000 (records) = 300000000 Bytes which is a give a roughly 290 Megabytes.

It is obvious that 290 Megabytes are faster to read than a 1.43 Gigabytes.

Of course this could be pushed a little bit more, and if you don’t need utf8 encoding, you could  switch to latin1 which would mean around 95 Megabytes.

The above stuff is valid even you have a table of 20 Megabytes in size on disk.

Lessons that i am trying to express:

a) use the charset you need!
b) use the length of varchar in a responsible matter, don’t make it just “as much as possible”

Read more about: MySQL Storage Requirements

This post is written with help from my friend: Rene Cannao