If you have millions urls to save for later search, without proper tuning, probably you have to fight with database performance.

The reason is that urls usually are saved as text or long char attribute into relational database, thus, hard to index.
Here is a way to let you quickly save, load and search.

I save url's md5 checksum to make a index key so later on, then  use a url's md5 checksum to search it's row in the database.

It true, it is possible that different urls' md5 checksum could get to the same, though the chance is quite small, not hard to deal with it. Search the md5sum value, then get a short list of urls, then search by the url you want to find.

Indeed, the chance to get same md5sum value is  very low if proper md5sum module used. I prefer to use md5->b64digest.

the length of the md5->b64digest returned md5 checksum string will be 22 and it contains characters from this set: 'A'..'Z', 'a'..'z', '0'..'9', '+' and '/'.  so, 22^64 is a huge number.

See more detail in my another article how to calculate md5 of a file/string in perl

Below is just an example I did using BerkeleyDB(version 1), apprently, if you want to do with other database for more complicated things, just change to use DBI for other databases, but use the same method metioned in below, won't be diffcult, let me know if you have trouble.

The example idof.pl is the tool I'm using to save filepath to its id mapping, same way for urls. On my desktop, I loaded 9M files mapping into db in less than 300 secs.

 $./idof.pl -r yes -if /home/idofpath/idspath
reading input file at 1378507569 ...
time elapsed 98 secs for md5 compute 9264153 pnfs mapping
time elapsed 164 secs for loading 9264153 pnfs mapping

You see, the text map file is 2GB, after loaded into db, the db file actually is only 680MB. Save you a lot of disk space, not just speed.

$ls -l /home/idofpath/idspath
-rw-r--r--. 1 cindy cindy 2108873748 Sep  7  2013 /home/idofpath/idspath
$ls -l /home/iddb/idof.db
-rw-rw-r-- 1 cindy cindy 687144960 Sep  7 13:52 /home/iddb/idof.db

The idspath file looks like the map below, url and id separated with '|',  the id below has special meaning to  me, but for your case, you can put everything you want.


To retrive a mapping  back

./idof.pl -durl http://fibrevillage.com/scripting/108-how-to-calculate-md5-of-a-file-string-in-perl

The whole perl script is attached at bottom, take a try



Comments powered by CComment