Data Warehousing Continued
Part of the reason I was hired by my company was to help the company’s largest customer connect their data warehouse to the company’s website database. I have data basing experience but I have never been involved in such a large project (don’t tell my bosses). Luckily for me, I wont be doing everything, but for starters, I had to get some customer data, mainly email addresses and build a newsletter mailing list. Sounds simple enough..
Unfortunately for me, I had no idea how difficult it is to take even simple data fields from 2-3 sources and merge it into one. Some of the systems were so old, that back when they recorded the email address, there were no validation rules to enforce the input from the users. I saw emails like this http://www.163.com@yingying837.cn (163.com is a Chinese email server). It seemed to me that the people who inputted this email address only recently purchased a computer.
Anyway, after looking at the long list, running some filters and changing by hand some things, I figured I have 95% cleaned and the rest will be done with the email program to check if the email address still exists.
I was amazed how complicated it is to data cleanse old systems and afterwards make it comply with a data warehouse schema. Especially in the case of data that’s in more the one languages and even more so when it’s a language that uses characters like Chinese.
Conclusion
From now on I plan to put a strong emphasis on data quality from any data base I start working on.
I think it would be good to use data quality methods in my simplified business rules engine idea and this whole thing made me realize about possible directions my “new pet project” might go. I have already written a few things down about it on paper.
Thank you for reading my blog.
Check out Pentaho Data Integration (formerly Kettle):
ReplyDeletehttp://kettle.pentaho.org/
Cheers
"From now on I plan to put a strong emphasis on data quality from any data base I start working on."
ReplyDeleteGood luck with that while you're still using MySQL.
You should think to build that warehouse of yours in PostgreSQL, it's much more suited for this kind of tasks than MySQL.
Check out Pentaho Data Integration - aka Kettle. Kettle is basically an ETL tool - it's made to do what you are doing now by hand.
ReplyDeletehttp://sourceforge.net/project/showfiles.php?group_id=140317&package_id=186321
or navigate there from the pentaho main site www.pentaho.org
This will allow you to transparently input data from multiple data sources. So databases, text files, xml, excel....Or you can use ftp or http to get data, whatever...you decide.
Once in kettle you can change the data and join it with data from other sources, and / or store it in a database.
Validating E-mails is something a struggle with all the time. It isn't just a "valid" E-mail address, most of the people I deal with don't know their E-mails. I work with schools and almost everyone is either ".org" or ".k12.us". However, they put in their first initial / last name and then ".com". They just don't know. Sigh. You can only lead a horse to water I suppose...
ReplyDelete