Blue Chip Blog
Innovation – MPP and Streaming BIG DATA
Silent but busy! I have been pretty quiet in the world of data warehousing but for good reason. I have been in Decooda’s innovation lab building a Massively Parallel Processing platform for analytics and complex event processing. All I can say is that it is really quite amazing. We created an amazing infrastructure that scales incredibly well and takes advantage of multi-core processors in a very efficient way. (Yes, we avoid threading)
I can’t really say too much about the platform at this point in time, but it has the potential of being a MAJOR game changer because of its simplicity and performance. I am pretty confident that you will not find a simpler solution that scales as well as our platform.
As a use case, we will be setting up our dynamic real-time grid in a pretty beefy High Performance Computing center (HPC). We will be receiving over 250 Million documents per day and applying the majority of 100 different algorithms in real-time and in PARALLEL.
I can see this platform sitting along side of current investments like Netezza, Vertica, Greenplum, etc.. The potential role within data warehousing is to have this platform play as the stored procedure or UDF enablement engine along side of the existing warehouse. It’s just philosophically wrong to embed business logic processing within your database and this platform addresses that concern. Imagine having a 250 million++ record result passed on from Vertica that you would like to perform complex event processing on. Complex Algorithms, Calculation, notification, integration, scoring, tagging, AND linguistically analyzed NOW all made possible in PARALLEL without impacting the performance of your existing warehouse.
I am a co-founder and CTO at decooda.com and have been working very closely with David Johnson, CEO to bring some of his forward thinking concepts to the market. He is laser focused on Solutions and I am laser focused on technology around big-data. The blend of the two is extremely complimentary.
The Parallel platform will be launched by Decooda.com and is currently being used to support the market research applications. The first application to market is the analysis of open-end text in the form of survey responses and the same parallel processing platform is being used to analyze the social media and blogosphere.
Take a look at the “Decooda Market Research Survey Assistant” entry in Ideascales current contest. We can really use your vote.
* Please give us a Thumbs Up by Clicking “I Agree” – this is the actual vote to get a past the elimination round.
* Please Facebook “Like” the Decooda Market Research Assistant
* Please re-tweet this post http://wp.me/pwYHU-cE
Take a look at what we are up to and vote for the Most Advanced Market Resarch Survey Assistant.
If you would like to know more about the parallel processing platform, please send me an email and I will be sure to get back to you.
Thanks,
Charlie
charles dot wardell @ bcsolution.com
Gartner DWH and my Mystical Quarter Circle Survey
By Vallabh on June 24th, 2010 at 6:10 am eHello Charlie,I found an interesting article in gartner.Hope everybody will find it interesting too. This article compares all technologies with pros and cons we have been discussing:http://www.gartner.com/technology/media-products/reprints/microsoft/vol13/article5/article5.htmlThanks,Vallabh
Thanks Vallabh,
It’s really hard to use something like a magic quadrant to pick a technology for your firm. For example, if you were putting in a retail POS system and wanted to have all transactions feed a real-time DWH, I “might” select Teradata. Apart very specific applications where an enterprise strategy makes sense, Teradata would not be my first choice for many reasons.
Do you agree with Vertica and HP being closely grouped? How many implementations of Neoview are there? Would you consider Microsoft a real DWH player? I guess it’s all about your perception of what a data warehouse is.
You know, I would be really interested in seeing what YOU say.
Would you take the Mystical 1/4 of a Circle Survey?
The Survey is short and the intention is for you to answer the questions quickly and off the cuff so we can get a sense of your actual perception.
Adhoc, MPP, and IN-MEMORY BI
By vallabh on June 21st, 2010 at 8:29 am
Hello Charlie,
Thanks for such a quick response. It is definitely very helpful information. I am trying to fit in Oracle Exadata when compared to the other tools. Could you also tell me whatis a better way to handle ad-hoc analytics.Use a MPP in place of existing database or instead use microstrtegy or spotfire with the web logic.
Thanks, Vallabh.
Oracle Exadata is Oracle RAC at it’s core. Which means it’s based on Oracles OLTP Engine. With that said, there are locking and memory sharing issues that RAC needs to deal with (not to mention the shared disk in which it stores it’s data.) These could be potential reasons to lean more towards a BI tool that supports the slicing and dicing of cubes.
If you find that your queries are just not performing because of any underlying technology bottlenecks, this approach will give you options as long as you can fit the cube refreshes within your maintenance window. The problem with utilizing cubes, is that they need to be pre-defined and you lose you ability to do complete ad-hoc analysis outside of the cube definition.
The MPP Databases I have worked with have all been “Shared Nothing” architectures, so there were no potential bottlenecks with the data warehouse technology. I would simply have my BI go straight against my transactional (relational) model. A recent query I have going against 9 Billion records with 2 columns as my predicate AND performs a sum, completes in about 45 seconds. With speeds like this, you typically do not generate cubes or aggregate tables unless you absolutely have to. So the BI really becomes and issue of function and presentation.
Occasionally, I will run into adhoc queries will just put an overwhelming burden on the system. Only then will I try and aggregate, or denormalize the data. If I can avoid it, I would rather not add additional operational aspects to the data warehouse. (IE: Scheduling cube gerenation, report publication, maintaining the OLAP servers, etc..)
I came up with an architecture about 2 years ago and I coined it Executive Warehousing. It was primarily based on technologies like QlikView or Spotfire and you can consider it an alternative to a full blown DWH environment. I called it Executive Warehousing, because the architecture and components are within budget of most executives without having to get approval for a full blown enterprise data warehouse and the cost and committees that they bring.
It begins by keeping a copy of your purified source data extracted as flat files sitting on inexpensive commodity disk. You would then create a process to subset the flat files and populate the Qlikview or Spotfire repositories. The IN-MEMORY model of QlikView or Spotfire would allow you to maintain the data at the transactional level so you would not be loosing adhoc capabiliites as you would when generating a cube and at the same time, your adhocs would be fast as it will allow the slicing and dicing to happen in “near” real-time.
The size of your data, the allowable operations window for refreshes and your budget are all the driving factors here. I hope that I have given you some food for thought and that you find this infomation helpful. In summary, if your query response times are fast, the world is your oyster for BI. Use what is easy, cost effective, and has great presentation capabilities. If your data warehouse is sluggish, opt for an IN-MEMORY or CUBE based approach to supplement your warehouse. I always try my solution going with straight SQL against an MPP database first, aggregation tables second, and cubes last. IN-MEMORY based BI tools are great and may be all you need?
As always, I would really love to hear YOUR experiences out there in the Large Scale DWH world.
Best Regards,
Charlie
Microstrategy, Spotfire and Tableau?
By vallabh on June 18th, 2010 at 1:04 pm
I am also looking for comparison of Microstrategy, Spotfire and Tableau…. Please let me know on what parameters can I compare the tools. I am looking for a technology that offers ad-hoc analytics.
Ok, so this has the potential of opening up quite a bit of dialog. So I am making it a stand alone post. Please comment and contribute. I too would like to see what others are thinking in this space.
Your backend database and the size of your data is a big consideration.
MicroStrategy has the ability to to pull data into a repository for it’s multi-dimensional analysis OR perform pass through sql. They have accelerators for specific databases (even has optimizations for Aster, Vertica, Greenplum, Netezza, Teradata, on and on..) Obviously very powerful and the local cubes provide the slicing and dicing you would expect. They even have a free version that you can use. The reason why I mention “What is your backend, is because if you are using an MPP technology like Netezza, Teradata, Vertica, you may not need to pre-aggregate your data, andyou therefore just need a good visualization dashboard on top of sql queries.
Spotfire is somewhat of a different class and is fits in a space with QlikView as IN-MEMORY analytics. I have not had hands on with Spotfire, but in the Qlikview world, you extract your data into MEMORY-MAPPED files. Qlikview has pretty amazing compression and an AWESOME set of charting objects. I have been able to create incredible BI dashboards in a few hours that were extremely compelling. You need to keep your memory mapped files updated with scheduled extractions and publication. Qlikview provides the reporting and publishing servers to do so. With that said, I would imagine Spotfire to be very similar. Being owned by Tibco is not necessarily a bad thing either but Qlikview may be a bit more nimble. The cool thing is that the data is stored at the transactional level, so you can aggregate on the fly “IN RAM”. It’s 64BIT, can support large memory mapped files, and is pretty intelligent in how it retrieves and buffers the data off of disk.
Tableau may be more in the space with Pentaho, LogiXML, Jaspersoft. Unfortunately, I have not used Tableau either but did work briefly with Pentaho and LogiXML. The thing I like about Tableau and tools like Qlikview is the interactive nature in which you can work with the data and build reports. Tableau can connect to just about anything from flat-files to data warehouses.
An area where you may want to investigate is DUNDAS dashboard. I was extremely impressed with the visualization and the speed in which I could create dashboards. The rendering is based on silverlight in the browser and the objects looked awesome. The price is not bad either.http://www.dundas.com/Dashboard/Start/Samples/index.aspx
I happen to like Qlikview and Dundas, but all these products ALL have a free trial and in most cases even a free limited use version. I typically work with MPP databases, so I tend to avoid the need for CUBES and multi-dimensional analysis. I am fortunate that my options are usually wide open.
I hope that helps.
Site Status
Hi Folks,
i its been awhile since my last blog posting. I am making a new commitment to get some of my experiences with varying technologies in the MPP space posted. If you read my blogs, you can see that I have a passion for parallel processing and have had some pretty interesting opportunities to work with Netezza, Vertica, Teradata and Greenplum, I will be branching off into other areas of MPP technologies like Grid based processing but for the most part will keep my posts primarily related data warehousing and the effort to support my peers in branching out into the world of MPP, database appliances, and Column Store Technologies.
Please feel free to ask questions, I do not claim to know all the answers, but their are experts lurking about willing to share their insights.
So in my efforts to give back to the DWH community a little bit, I thought I would give you a one stop shop for relevant DWH News and comments. Kurt Monash at DBMS2.com has a lot of this stuff nailed down pretty well. His analysis is great and the frequency of the posts are timely. My slant however is from a slightly different perspective as I focus primarily on implementation, architecture, and designs. I am still very much in the battlefield architecting systems and deploying dash boards while using a variety of technologies..
JOB Board
The jobs posted on this site are relevant to technology. I know that “Technology” is a broad term, but I had to prime the pump with something. My hope is that overtime, firms looking for experts in the Various MPP technologies will post here. Take a look and let me know what you think. http://www.bcsolution.com/job-search/
VLDB Dashboard
I created a mashup for you to keep your finger on the pulse of the Very Large Database Market. Although it is not perfect, it gives a pretty good sense of interest across the competing technologies over time. Check out the VLDB Dashboard and see how your favorite technology vendor i doing.
Forums
I could never get those forums to take off. And I am considering taking them down. I am going to take another stab at rearranging some of the topics. Perhaps a slant on technical questions will drive some participation.
This is not a job offer, but rather an opportunity to share the Blue Chip soap box with other passionate technologist. If you have a passion for data warehousing / primarily in the MPP & appliance space let me know? Perhaps you specialize in one technology and would like to host a weekly column on your topic. Let me know there is any interest out there. I can be reached info@bcsolution.com
