Since the IP geographic data operates on ranges of IP addresses, we cannot simply join the log data to the IP data (using RDBMS or MapReduce), and we cannot use a simple key/value cache to look-up the IP data. Currently the entire IP2Location DB-5 data set consumes 1+ GB of memory when loaded into a Java object array. While we can currently fit this in memory with the current infrastructure we will have issues if this data grows significantly beyond it’s current size. Therefore we have to look at some alternatives for looking up the data without loading it all in memory. We cannot use a middleware or shared database solution since these will quickly become overwhelmed by requests from our cluster. Likewise, we cannot afford to take a network hit for every look-up and we currently cannot batch look-up requests to reduce network hits. We need a shared-nothing architecture, and therefore copying a Java embedded database locally to pull our small data close to our big data seems the best approach. This page evaluates several Java embedded databases as a fit for the use case.
This test is Hadoop/Cascading specific in that it uses the standard set of jar files included in the class path of Apache Hadoop 20.2 and Cascading 1.2. For example, we used the hsqldb-1.8.0.10.jar bundled with Hadoop. Leveraging an embedded DB which is not only thread safe but also concurrent will allow re-use of in-process memory cache across multiple threads using the Hadoop Task JVM Reuse feature.
Databases Evaluated
The following databases were evaluated:
- Baseline load into in-memory array.
- Apache Derby 10.7.1.1
- HSQLDB 1.8.0.10
- H2 1.2.147
Each of the databases were set up, configured, and queried using identical SQL scripts and JDBC code, except where specifically noted in the test results section. Also, we require the exclusive use of a JDBC interface to the database so we can easily swap DBs based on performance and stability.
Test Results

Test results for the in-memory baseline and each of the embedded DBs follows:
| DB |
Load Time |
Load Rows |
Load Rate |
IP Lookup Rate |
| Derby |
1 min, 51 sec |
3,237,642 |
29168 / sec |
82486 / sec **** |
| H2 |
1 min, 10 sec |
3,237,642 |
46252 / sec *** |
61527 / sec |
| in-memory array |
32 sec |
3,237,642 |
101176 / sec |
133 / sec |
| HSQLDB |
3 min, 28 sec |
3,237,642 |
15565 / sec * |
1 / 22 sec ** |
* = -Xmx3G JVM setting was required to load the data
** = -Xmx1G JVM setting was required to query the data
*** = -Xmx1G JVM setting was required to load the data
**** = Only Derby supported concurrent thread access, and 4 threads were used in the IP Lookup test
Test Cases
Create Table
The table was created with the following columns as:
CREATE TABLE IP_INFO (
IP_FROM BIGINT,
IP_TO BIGINT,
COUNTRY_CODE VARCHAR(2),
COUNTRY_NAME VARCHAR(64),
REGION VARCHAR(64),
CITY VARCHAR(64),
LATITUDE DOUBLE,
LONGITUDE DOUBLE
)
Load Data
The following load scripts where used to load up to the maximum 3,237,642 rows. Data set based on IP2Location DB-5. Some of the Java database tested could not adequately handle the data size so for those databases a smaller number of rows were loaded, as noted in the test results section:
INSERT INTO IP_INFO (
IP_FROM,
IP_TO,
COUNTRY_CODE,
COUNTRY_NAME,
REGION,
CITY,
LATITUDE,
LONGITUDE
) VALUES (
?,
?,
?,
?,
?,
?,
?,
?
)
In addition, the following JDBC code to load the data utilizes prepared statements and JDBC batch processing:
Connection conn = DriverManager.getConnection(<db url>);
PreparedStatement ps = conn.prepareStatement(insertString);
// read data file and get fields for each line ...
// for each field, bind values to the PreparedStatement
ps.setLong(1, g.getIpFrom());
ps.addBatch();
// add batch for every line, and execute batch every 100 lines
if(lineNumber % 100 == 0) {
ps.executeBatch();
}
// at the end, execute final batch and commit ...
ps.executeBatch();
conn.commit();
Create Index
A b-tree index on IP_TO is required to select the matching IP range:
CREATE INDEX IP_INFO_I01 ON IP_INFO ( IP_TO )
Select IP Number in Range
The first row matching the following SELECT query matches the geographic information for an IP address range. Any additional rows are not read and are ignored:
SELECT *
FROM IP_INFO
WHERE IP_TO >= ?
ORDER BY IP_TO
The following JDBC code utilizes prepared statements and bind variables to select the first matching row:
PreparedStatement ps = conn.prepareStatement(<sql>);
ps.setFetchSize(1);
ps.setMaxRows(1);
// the following code iterates for each look up test execution
long random = (long)(Math.random() * Integer.MAX_VALUE);
ps.setLong(1, random);
ResultSet rs = ps.executeQuery();
if(rs.next()) {
// test getting fields, and verify result
}
Test System Configuration
Test was run on the following platform configuration:
Hardware Overview:
Model Name: MacBook Pro
Model Identifier: MacBookPro6,2
Processor Name: Intel Core i7
Processor Speed: 2.66 GHz
Number Of Processors: 1
Total Number Of Cores: 2
L2 Cache (per core): 256 KB
L3 Cache: 4 MB
Memory: 4 GB
Processor Interconnect Speed: 4.8 GT/s
Boot ROM Version: MBP61.0057.B0C
Serial-ATA:
Intel 5 Series Chipset:
Vendor: Intel
Product: 5 Series Chipset
Link Speed: 3 Gigabit
Negotiated Link Speed: 1.5 Gigabit
Description: AHCI Version 1.30 Supported
Disk:
Hitachi HTS545050B9SA02:
Capacity: 500.11 GB (500,107,862,016 bytes)
Model: Hitachi HTS545050B9SA02
Revision: PB4AC60W
System Software Overview:
System Version: Mac OS X 10.6.5 (10H574)
Kernel Version: Darwin 10.5.0
JVM Overview:
java version "1.6.0_22"
Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261)
Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03-307, mixed mode)
Modifications to Tests
Some tests had to be modified, either due to time constraints (test could not run in a reasonable time) or because the test crashed or hung.
In Memory Array
We had to run the test with JVM parameter -Xmx2G to ensure all the data could be loaded into memory. Since we are not using a DB, the query logic is simply to iterate through the array to find the first row with IP_TO >= target IP.
Please note: we cannot use a binary search algorithm since we do not know the specific value we are searching for a priori. We considered writing a binary search variant to find the first element >= the target IP, however we could not find a library which does this and we don’t have the time to write and test one ourselves.
Derby
We had to set the block cache size above the default to improve cache hits for multi-threaded access:
System.getProperties().setProperty("derby.storage.pageCacheSize", "524288");
We set the JVM -xmx1G for the IP Lookups and used the JVM default 128M for the data load.
HSQLDB
We had to make many modifications to the code, the DML, and SQL as well as the JVM memory settings to make the data load work. We could not find a way to force HSql to use the b-tree index so the query performance represents the default plan selected by HSQLDB.
Here is a description of the changes and why they were made:
JVM Memory Settings
We started out with the 128M default JVM heap and received an out of memory error before we completely loaded 200,000 rows. We had to conclude that HSql only saves to disk after a commit or checkpoint, and keeps the entire set of data buffered in memory. This was confirmed when we added a CHECKPOINT after each 100,000 rows and still got the same “java.lang.OutOfMemoryError: Java heap space”, but we did see data on disk after each checkpoint.
So we had to set memory to -Xmx3G to load the data set, and we had to take the CHECKPOINT out because it looks like HSql writes all the data to one big file. So every time you do a CHECKPOINT the overhead becomes larger and larger to write the data. It is also possible it’s writing the entire file with every checkpoint as well. Definitely never use CHECKPOINT or incremental commits when writing data set on GB+ sizes, and make sure your available memory is 3 times the size of the data on disk after the load.
When querying the data, again we have the problem of HSql loading the entire data set into memory. I.e. we could not query the data at all with the default 128MB of JVM memory allocated. Setting -Xmx1G works around this problem.
Code and SQL Changes
We could not create the index due to this error:
java.sql.SQLException: java.io.IOException: S1000 Data file size limit is reached in statement [CREATE INDEX IP_INFO_I01 ON IP_INFO(IP_TO)]
We tried to load only 100,000 rows to see if the index is actually used but the best query rate we could get was 17 / sec so if the index was used the performance definitely would not scale to 320 times that data size.
We had to specify CREATE CACHED TABLE in the table DDL to get the data load to work.
H2
H2 also required changes to the JVM settings to work properly.
JVM Memory Settings
With the default 128M JVM settings we were able to reach just over 400,000 rows loaded before running out of heap space. Setting the heap to -Xmx1G fixed this issue. Similarly to HSQLDB it looks like H2 loads the entire data set into memory for the transaction.
We had to specify CREATE CACHED TABLE in the table DDL to get the data load to work.
Conclusion
Derby provides the best fit for this use case due to:
- Concurrency: multiple threads can query embedded Derby within the same JVM process.
- The performance of Derby in multi-threaded mode outstrips H2′s single threaded performance on a 2 core machine. We expect this to scale linearly to our target 16 core production deployment.
- Sharing process memory across multiple threads allows us to scale look up data size while maximizing shared cache hits.
However, H2 demonstrated superior single-threaded performance and admirable load speeds and used 40% less space on-disk than Derby.
All-in-all Derby wins for this use case, and H2 gets admirable mention.