Monday, June 14, 2010

When File Beats Database

It might seem odd or unusual to compare system files to databases. You might think that the comparison is by far in favor of databases that present a huge advantage in manipulating data in every way one would image. While is this completely true, there might be cases when you really discover that system files are needed to assist databases due to better performance.

In the financial field there are market data providers with whom you can connect to get information about financial instruments, their prices and values. The data received are a stream in the form of tickers. You have the TickerID, the DateTime and the values which can be Ask, Bid, and Mean.

You might receive thousands and thousands of data in a very short time. Certainly you will have to store them in a database in order to use them later in your computation. However, if you have a standard database, you will experience very poor performance trying to insert those data as they arrive.

Consider this MySql database example in which there is Tickers table like the following:


CREATE TABLE `tickers` (
`TIK_ID` varchar(50) NOT NULL,
`TIK_DATE` datetime NOT NULL,
`TIK_BID` double NOT NULL,
`TIK_ASK` double NOT NULL,
`TIK_MEAN` double NOT NULL,
PRIMARY KEY (`TIK_ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Inserting market data into this table leads to execute an SQL statement like “INSERT INTO tickers (TIK_ID, TIK_TIME, TIK_ASK, TIK_BID, TIK_MEAN) VALUES (@id, @time, @ask, @bid, @mean);” .

Suppose you have 1000 tickers to store in the table. You have to execute the previous SQL statement 1000 times. Whether you use a StoredProcedure or not the execution time varies between 30sec and 40sec.

Certainly you don’t want your database to be locked down by a continuous, resource consuming process like that.

Luckily there is workaround for this problem. The solution consists of storing this stream of data into a text file and periodically imports it into the database. Let’s say you have incoming tickers stream, instead of executing SQL statement for each one, what you do is append the data to the end of a text file. When the file grows big enough, you import it into the database table using a special import tool that usually ships with most databases.

I have rewritten the same example but this time using text file. The performance was clearly far better. The time needed to write 1000 tickers into the text file and import them into the database is only 500ms. This constitutes a gain of 600%.

Here is the code snippet:

private void SaveToFile() {
    DateTime t0 = DateTime.Now;
    DateTime t1 = DateTime.Now;
    StreamWriter wrt = null;
    MySqlConnection con = null;
    try {
        Random rand = new Random();
        wrt = File.AppendText("dbvsfile.txt");
        for (int i = 0; i < 1000; i++) {
            String id = "ticker" + i;
            DateTime dt = DateTime.Now;
            double ask = rand.NextDouble() * 100;
            double bid = rand.NextDouble() * 100;
            double avg = (ask + bid) / 2;
            wrt.WriteLine(String.Format("{0};{1};{2};{3};{4}", id, dt, ask, bid, avg));
        }
        if (wrt != null) {
            wrt.Close();
        }

        con = new MySqlConnection(Properties.Settings.Default["dbvsfileConnectionString"] as string);
        con.Open();
        MySqlCommand cmd = con.CreateCommand();
        cmd.CommandText = @"LOAD DATA infile 'dbvsfile.txt' into table tickers fields terminated by ';' lines terminated by '\r'";
        cmd.ExecuteNonQuery();
    } catch (Exception x) {
        MessageBox.Show(x.Message);
    } finally {

        if (con != null) {
            con.Close();
        }
        t1 = DateTime.Now;
        TimeSpan ts = t1 - t0;
    }
}


There is no doubt that the second method is by far more advantageous than the first, however everything comes at a price. The file solution has the following issues:

  • Data are not stored in real-time into the database: there is a laps of time (estimated in minutes) in which the data reside in a text file before being imported into the table. During this time those data are not exploitable.
  • Management overhead: An extra code must be developed to verify proper import and to handle errors that might occur during the process.



In case you are asking yourself why it is so, the response is crystal clear. It is similar to the previous blog “Why Excel COM API is a Bad Choice”. Every time you are executing an SQL command, this order travels through layers of software framework to arrive to the database server to be executed over there. While the import tool avoids this entire complex trip and writes directly to the database.

Once again, it is better to avoid (when possible) the use of APIs when there is another clean solution.

No comments:

Post a Comment