Pervasive DataRush logo

Pervasive Software is a company focused on dealing with the complexity of multicore CPU architectures. Much of the current software out there does not take advantage of multicore CPUs; thus, many systems are utilizing their hardware inefficiently. To address these issues, Pervasive offers an entire framework called DataRush. Here is an exclusive interview with Michael Hoskins, CTO and General Manager, of Pervasive Software.

Michael Hoskins

With so many technologies out there what does Pervasive DataRush offer?

Our framework provides a way for Java developers to implement highly parallel data-intensive applications. While some folks in the industry are suggesting that we need to shift to functional languages such as Haskell or Erlang, there’s a huge pool of Java coders in organizations of all sizes, and it is unlikely they will switch to these languages away from their current and familiar IDEs, libraries, frameworks, and methodologies – and the cost of switching would be immense. Multi-threaded apps are hard to code and test—and even harder to design.

As we move into the hyper-parallel world of hundreds or even thousands of threads, this problem can become unmanageable. Pervasive DataRush™ is a framework that hides the complexity of parallel programming: memory management, threading, queuing, and deadlock detection. It is an implementation of dataflow technology, which is a variant of flow-based programming. These concepts were developed over decades as an alternative to the von Neumann model of serial computation. Although we touch on this on our website, you might want to check out Wikipedia on both flow-based programming and dataflow. The dataflow technology is based on Kahn process networks as extended by Park’s work on bounded scheduling.

What are some areas in which DataRush is heavily used?

Pervasive DataRush is targeted to applications that are both compute- and data-intensive. Typically these are batch jobs that crunch their way through massive datasets doing sorts, transforms, analysis, matching, and so on. Think of a retail chain, for example, which generates hundreds of gigabytes of transaction data every week. A classic scenario would involve data mining across millions of rows to look for interesting patterns. And of course, users demand results in ever-shrinking time windows. For example, a current prospective customer needs to validate and transform individual files containing as many as 1 billion records. Other classic data-intensive industries include bioinformatics, manufacturing, risk management, customer relationship management, fraud detection, business intelligence, environmental engineering, patent law, healthcare management, telecommunications, auditing, supply chain management, homeland security, and records management.

Is DataRush a compute or data grid?

Neither. Pervasive DataRush is designed for standard SMP boxes, powered by multicore CPUs. The design-time productivity of clusters and grids with their stone-age, message-passing interface is legendarily bad. With emerging quad-core chips and even greater SMP parallelism on deck via Sun’s “Victoria Falls” Niagara2 chip and Azul’s 768-core monster, we see no need for the complex IT challenges and power-hungry consumption that legacy brittle, heterogeneous multi-machine approaches present – not to mention their high maintenance and operations costs.

Is DataRush a competitor to JavaSpaces, MPI, or MapReduce?

Our framework is an alternative to these approaches which we believe is easier to understand, design, and code for the corporate Java application developer and lends itself particularly well to the commodity single address-space SMP boxes we see as a dominant fixture in future IT infrastructure.

Where do you think the future of CPUs are heading?

The next 5 years seems very clear. On the mainstream side, AMD, IBM, Intel and Sun will continue to proliferate multicore, putting more cores on each chip and more sockets on each motherboard. Many of these boxes will use virtualization to allow the data center to optimize their servers. This is already driving rapid adoption and deployment of multicore machines. In addition, a number of these multicore machines will be used for data-intensive applications. And the recent multicore powered leap in the processing power of commercial off-the-shelf servers is quite remarkable. The inherent productivity of having a highly parallel SMP box at your fingertips for iterative data runs is tantalizing.

Another fascinating development in CPUs is the move away from general-purpose cores to much cheaper, dedicated processing cores. One of the players here who interests us is Azul Systems – they’ve put together a 768-core SMP box, all running under a single JVM.

What is the scalability of DataRush technology?

We continue to test the scalability and performance of Pervasive DataRush as bigger and bigger boxes become known to us. Results so far are exciting. Our internal tests using our benchmark applications show near-linear scalability up to 32 cores on hardware as diverse as AMD, HP, Intel and Sun. Externally, we have lit up as many as 384 cores on one server. In both cases, the magic is we didn’t have to rewrite the application to do it. Because a developer using the platform is no longer responsible for manually designing and coding for a specific number of cores, a DataRush-based application can easily be deployed across machines of different capabilities. This is especially important to ISVs and commercial software partners who want robust data-intensive applications to scale and deliver faster performance as the number of cores they run on multiplies.

What platforms does DataRush support or integrate with?

Pervasive DataRush is 100% pure Java, so it runs on any Java 6-enabled platform. Today we list Windows Server, Red Hat and Suse Linux, Mac OS X, Azul Systems, Solaris, HPUX and AIX.

How much does it cost?

Pervasive DataRush is in beta now, and is free to download at http://www.pervasivedatarush.com. It will always be free for academic and research use. We are evaluating general availability pricing options that best serve the market and maintain Pervasive’s reputation for high ROI.

What’s next?

We see the leap in platform power enabling whole new applications for existing and emerging markets. What were once esoteric Ph.D.-powered projects will now be within reach of mainstream Java developers: highly parallelized, scalable applications that crunch huge quantities of data in small time windows on mainstream, commodity SMPs. A full range of organizations will be able to take operational advantage of technologies that until now only global enterprises have had the resources to exploit.

Is there anything you would like to add?

I have attached a recent presentation given by our solutions architect at our user conference. These were great questions!