For the first time, I was let down by my esteemed GSP(Google Search Professional) credentials when I couldn't search for a single white paper or article about data segmentation techniques for storing large number of files on a disk. Thus, I had to spend couple of hours to create a test harness and gather results to figure out the optimal number of files in a folder.
The Context
We have tens of millions of files to be served from our media server and we were looking for optimally arranging them in a folder structure so that it doesn't impact performance. The options were either stuffing files on a single folder or creating a hierarchial structure of folders to store these files. If we were to opt for the latter then we had to solve another problem -- the optimal number of folders in a given folder? (It turns out that this really is not a problem as the performance level for files and directories are very comparable)
The Approach
I decided to write a simple program which would create N number of files in a folder and would then try to locate a particular file by its name. It turned out to be very simple program to write but I had lot of "doh" moments while running the program. For creating the files, I just created a simple method to copy an image file and paste it with a unique name --
private static void CreateFiles(int numberOfFiles)
{
Stopwatch stopWatch = Stopwatch.StartNew();
for (int i = 0; i < numberOfFiles; i++)
File.Copy(sourceFile, path + "thermometer" + i + ".jpg");
stopWatch.Stop();
Console.WriteLine("It took {0} seconds to create {1} files", stopWatch.ElapsedMilliseconds/1000, numberOfFiles);
Console.Read();
}
And, to locate a file --
private static void LocateFile(string filePath)
{
Stopwatch stopper = Stopwatch.StartNew();
bool isFound = File.Exists(filePath);
Console.WriteLine(isFound);
stopper.Stop();
Console.WriteLine("It took {0} milliseconds to find the file", stopper.ElapsedMilliseconds);
Console.Read();
}
I ran the program for the following configurations --
- 1,000 files in a folder
- 10,000 files in a folder
- 15,000 files in a folder
- 18,000 files in a folder
- 1,000 files residing in their own unique folders. Thus, it had 1,000 folders.
- 10,000 files residing in their own unique folders. Thus, it had 10,000 folders.
The Results
I ran the tests in a single user mode on WindowsXP with 1.6 GHz single core, 1 Gb of RAM laptop having a 5400 RPM disk. I've taken out the results for configuration #5 and #6 as they were identical to #1 and #2 respectively.
Our overall SLA to serve an image is 500ms thus we want the image retrieval cost to be within 40-50ms as the application has to fire business rules before serving an image. According to the test results, I've come to the conclusion that the ideal size for our case is really 10K files in folder as the OS was able to serve an image in 37ms at the first invocation and 6ms in the subsequent invocations. The overall strategy that has been devised is --
- Create a Hash function which can take the name of the image file and output the folder name. The folder name will not be more than 4 digits so that it can support maximum of 10k entries.
- Create folders underneath the hashed folder with the unique ids(For our case, it would never be more than 10K at this level)
Next Steps
- Create an ASP.NET site which would serve the images.
- Run the ASP.NET site in a program like JMeter to check the results in a multi user mode.
Please feel free to comment on the above approach and share your experiences in case you have designed for such a scenario.
1 comment:
Thanks for this info. I was trying to do the same stuff
Post a Comment