…and this database is Just Right

By cwardell • May 26th, 2009

I have been fortunate to have created data warehouses in Oracle, Teradata, Netezza, SQL Server and Vertica. (I did a POC with Greenplum when it was called Metapa but it was not quite ready for prime time.)

I have never been able to achieve performance on multi-terabyte databases using SMP (Symmetrical Multi Processing) servers and OLTP databases. In my opinion, the relational/OLTP databases like Oracle, SQLServer, MYSql, etc.. are just the wrong tool for adhoc queries, analytics and reporting of VLDB’s (very large databases). The results I obtained when utilizing MPP have been staggering when compared to the response times of query and analytics of OLTP databases.

There is no theoretical size limit on databases based on MPP architectures. (Massively parallel processing). The more nodes, SPUS, blades, or pizza boxes you have in your cluster, the more capacity your database has and the faster it will perform.  With that said, I am pretty confident that Teradata, Netezza and Vertica can scale to the Petabyte size. There seems to be no real inhibiting factor that I know of.

So what does one choose?

I maintained a Teradata environment for about 6 years and have gone under the hood pretty deeply with Netezza and Vertica. I have programmed at the FPGA level for Netezza and had the opportunity to pick the brains of the senior engineers at Vertica for a while. I am pretty confident that at this point, the MPP technologies I mention below should handle the workload.

The question comes down to “What is your budget and business need?”

Teradata

Teradata has a wide range of logical models and applications they can sell along with the WH. Teradata supports transactional processing and ad-hoc queries very nicely. Teradata also has a professional services team which contributes a large portion of their revenue, but all this comes at a premium price.

Netezza

Netezza is a beast of a machine that does full table scans extremely well, their FPGA technology is very impressive, and the zone indexing does a nice job in minimizing I/O. Netezza’s secret sauce is in their FPGA (Field programable gate arrays). Not to long ago, Netezza opened up the FPGA to developers and now give tremendous power by creating FPGA based UDF’s (User defined functions). These UDF’s perform exceptionally well and can put the afterburners on your query and analytics if done correctly. Netezza has proven the test of time and has a pretty impressive client list and is a very viable alternative to Teradata.

Vertica

Vertica is relatively a new comer to the VLDB space and they are definitely the one to watch. I like the technology and the fact that Vertica is led by Stonebraker who is a veteran in the industry. ALso quite notable is Jerry Held (Exec from Oracle and Tandem) who sits on the board.  You will need to grasp a slight paradigm shift in the way their database operates but it supports ansi standard SQL and ODBC connections so you shouldn’t have any problems. Vertica’s compression algorithms are extremely clever and when combined with a columnar data store they are breaking records. Vertica will definitely give you an A-HA! moment once you understand how the magic happens.

By the way, Syncsort and Vertica broke the world * TPC record for data loads at a fraction of a price when compared to Netezza and Teradata.

* A clarifying point on the TPC record. This test was not an instance of the TPC-H benchmark but rather a benchmark on load performance based on the TPC-H data produced by DBGEN.  The benchmark was audited by a 3rd party, published, and available for others to contest.

Summary

Technically they will all crunch the data and they will all do it MUCH faster than OLTP based databases. Some questions you will need to consider:

    • Q) Do I need or am I looking at other logical data models and applications to kick start my project?
    • Q) Do I want to leverage a professional services team?
    • Q) Do I need user defined functions or stored procedures?
    • Q) Does purchasing proprietary hardware scare you?
    • Q) Do I have a substantial budget or is this a warehouse on a shoestring?
    • Q) Is my DBA adaptable?
    • Q) Do I have a mixed workload?
    • Q) How many users will be hitting the database concurrently?

I know both Vertica and Netezza offers proof of concepts.

TAKE ADVANTAGE OF THE POCS!


One approach you may want to consider is to start with a technology like Vertica. They offer a free trial and it installs easily on commodity hardware. This will provide the lowest cost of ownership and it just may fit your needs.  In parallel, work on setting up the POC with Netezza. This may take a little longer because staging a Netezza environment within your walls has many logistics involved.  Choosing the right technology is a crucial step in the development of a Very Large Data Warehouse. You should take your time on this critical phase and choose wisely.

Comments

Syncsort and Vertica broke the world TPC record for data loads

This is a false statement. There is no TPC benchmark for data loads. They used the datagen tool for TPC-H, but that is the only association with TPC.

Lucas,

Have you reviewed the Audit report? It is very clear with regard to the Transaction Council’s TPC-H Benchmark that it was not an instance of the TPC-H benchmark but rather an instance of the load performance portion of the benchmark. Section 4.2 Benchmark Definition. The audit report goes into great detail about using DBGEN and ALL the players in the field have the opportunity to do exactly what they did to disprove the claim.

If have an audit report, benchmark, or similar report that you can disprove the Press Release, I would really love to see it. I have not seen any formal rebuttal from TPC, Netezza, Sybase, Teradata, or Paraccel but would welcome the post.

Thanks

Lucas,

I reread your post and went back to the audit report to review section 4.2 a few more times. I believe your point is valid enough that the blog should be clarified. I have adjusted the blog above.

Thanks for pointing it out.
Charlie

The issue is that the Vertica benchmark is not a “TPC record” as you claim. If this were so, it would be listed on the TPC website, http://tpc.org, which it is not.
It may be a ETL world record. It may be a data load record. But is certainly not a TPC record.
Vertica never once mentions “TPC record”. Read their press release, especially take notice to the foot note.

I’m sorry for being picky about this but making inaccurate claims is a pet peeve of mine.

No worries about being picky. This is an open forum and your feedback is welcome.

Thank you for your insight.
Charlie

A few comments on the “ETL World Record”:

1. Who judges this world record and what are the qualifying criteria? Does it just depend on which “independent” analyst is looking at what solution at a particular point in time?

2. Who held this “world record” previously – does Vertica even know or are they guessing? In which case, do they really know they’ve beaten it?

3. The most obvious point. Nice performance over an hour in a controlled lab environment. But what sort of comparable stats can Vertica produce for a very large production environment (100 TB+) over a sustained period (i.e. weeks/months/years) – assuming they have such a production environment. And not the marketing-speak terms of peak performance, but TB/hr in continued, 24/7, mixed workload. There are systems out there loading 2TB/hr in real time into something as mainstream as Oracle 10gR2 (peaking at 5x that in busy hours), so is this really that remarkable or just more hollow marketing bluster from the niche players?

Ash,

You ask some great questions. I am going to do a little research and see what I can find.

What I do know is that this effort involved three core technologies. DMExpress, HP Blade System C-Class, and Vertica. The DMExpress PR is here -> http://www.bcsolution.com/2009/05/syncsort-world-record/ – From what I understand, they were all extremely excited from the results.

I find Syncsort to be and extremely reputable brand. For over 30 years they have provided rock solid technology in the very large data space. I have only worked with them during the past 10 years but remember their advertising campaigns in the late 80′s.

So I believe you will find that Syncsort can stand behind their claims. They have some clients with enormous data deploys that I am sure have run through weekends.

I will see what I can find out.

Charlie

The strange thing is that you forgot about Sybase IQ.

Hi Alex,

I did not forget about Sybase IQ, When I first had the opportunity to come across them, I did not see MPP as being the underlying architecture. At that time, they still had many components of their technology that was shared and could be considered a potential bottleneck. Things may have changed since I did my eval, but at that time, Column Store – Yes, Shared Nothing/MPP – No. As a result, I took them off my VLDB and MPP watch list.

Charlie

I need to develop reporting and BI solution for CDRs DB which has new 50K CDRs every second.

Can you please share some comparison table for VLDB and DW, which can handle this kind of data.

Nir,

Have you taken a look at VOLTDB. http://voltdb.com/
It can handle a huge volume of concurrent transactions and I suspect that there may be some vertica integration points on the horizon to handle the Analytics.

Hi
would like to know more abt the Logical data models which teradata has(Retail LDM).
If a client has EDW and all the Data attributes with the Bottom line all KPI’s
Will they still go for Teradata’s Logical Data model.If yes why?
Where will this LDM go and sit in the EDW.
Sorry if it’s very basic

Swetha,

I believe that the Logical Data Models (Finance, Retail, Travel, etc) is a way for Teradata to sell “Solutions” as opposed to “Just Technology”. For example, you could very easily buy the NCR self service checkout registers as seen in Home-Depot knowing that it would feed a retail data model in Teradata.

Teradata has a pretty significant professional services group providing domain expertise not only for the Teradata database but for the vertical market as well.

Teradata handles transactions very well. So you can have your KPI’s sit on top of the transactional model. Depending on the size of data and the # of nodes, you may never need to aggregate your data. In some cases you would just let the parallel processing aggregate on the fly. This is what Teradata calls Real-Time data warehousing.

When you get into the MPP realm, many times it is not needed to create the KPIs as materialized tables.

I am pretty sure that Teradata will not sell there LDM without their appliance though. Let me know if you feel I addressed your questions.

Regards,
Charlie

My query is, How do we query this type of columnr database from Mysql. What differenrt settings would be requierd in the normal select query

Radhika,

When utilizing a columnar db for your data warehouse, you are typically abstracted from the underlying column store. To the client, the data appears to be stored in table, rows and columns. You simply create ansi standard sql as you would in any relational database. The the columnar db (Vertica in my case) takes care of the details behind the scenes.

Under the hood, the column store DB does not read a block of the table into memory but rather reads ONLY the columns needed to satisfy the query.

This alone minimizes the I/O. Vertica goes quite a few steps further with some advanced compression algorithms per column and a concept of bucket. Tremendously bringing down the cost of I/O.

Additionally, they have some more advanced topics utilizing their Flexstore technology that allows you to group columns for single disk reads allowing finer tuning when trying to minimize disk I/O

Hello,

Can you throw some light on Asterdata as compared to Netezza and Vertica.
Which do you think would really break the ice.

Aditya,

So let me start by sending a disclaimer. I have not had hands on with Aster Data so I can’t really comment on it as a product. However, the technology is interesting and I hope to eventually kick the tires.

Here is my take. It depends on what you want to do with your cluster. Aster is based on map-reduce so it opens up opportunities for many interesting things. With map-reduce, you have the ability to run application logic in parallel at the node level. However, it is a row-store database so the I/O is going to be higher than that of a columnar store like Vertica and I would expect that given the equal # of nodes, Vertica will out-perform Aster in query response.

Netezza has the ability to push algorithms down into the FPGA which made for lightning fast computations. With their twinfin product Netezza makes it relatively easy to move algorithms to the SPU level through user defined functions. This opens them up to a lot of opportunities to play in the HPC (high performance computing) space.

Vertica has done an awesome job with the compression combined with the Columnar approach to storing the data. I/O is really minimized and the query response times are incredibly fast. HPC was not their focus so running application logic at the node level was less of a priority. With that said, there is no reason why you could not have something like Hadoop running along side of your Vertica cluster. You would get map-reduce, incredible compression and the benefits of a column store. With the recent version 4.0 of Vertica, external user defined functions are possible, so they are closing that application logic gap.

So the question is what do you want to do? Query? HPC, both? Do you mind proprietary hardware? Do you want to grow your warehouse incrementally with the smallest possible initial capex? What does your data management staff look like? These are just a few of the many considerations you need to give when selecting the technology that suits your business. It will be easier to do a POC on Vertica or Aster, while Netezza will require quite a bit of co-ordination and qualification before they roll a Twinfin frame in your data center.

Netezza and Vertica have an INCREDIBLE client list. Aster is starting to make strides.

One last thing, the Patent office awarded Google a software method patent on MapReduce. What this means for hadoop I really can’t say, but make sure you investigate that path and ask the right questions.

Lot’s to consider as this is an area where you want to be very diligent. I hope this helps. Feel free to contact me if you would like to explore the options more formally. http://www.bcsolution.com/contact/

Best Regards,
Charlie

 

Leave a Comment

« | Home | »