So I was trying to play around with SQL statements to brush up on my skills.
Luckily (or unluckily), someone at my work, a java developer, needed to do a data export and thought that I can do it in 1 SQL line. He said.... and I quote "You only need 1 SQL line. Whats the problem?".
Whenever I hear the "whats the problem" line, there usually is one.
So, with him having great confidence in my SQL-ing ability or a complete misunderstanding of database developing, I was forced to do the data export.
I realized early on that it would take a long time. Just to start off, I needed like 20+ columns, but each column was a row somewhere in 4 different tables.
To make a long story short, I did complete the task and to simply pat myself on the back, I will paste it here. Just to show that, yes, you can program in SQL.
(but next time, please give me rights on the server to do it with several views).
The SQL statement is about 2-3 pages long, by the way and you don't really need to read all of the statement
SELECT
CONTACT.CON_ID as contact_id,
contact.CON_EMAIL_LOWER as email,
contact.CON_CREATION_DATE as creation_date,
lastname.CONATTR_VALUE as givenname,
firstname.CONATTR_VALUE as familyname,
languages.CONATTR_VALUE as "LANGUAGE",
gender2.SELVAL_VALUE as gender,
mobile.CONATTR_VALUE as mobilephone,
office.CONATTR_VALUE as officephone,
home.CONATTR_VALUE as home,
address.CONATTR_VALUE as address,
zipcode.CONATTR_VALUE as zipcode,
province2.SELVAL_VALUE as province,
city2.SELVAL_VALUE as city,
intendedpurchase2.SELVAL_VALUE as intendedpurchase,
ownedcarbrand2.SELVAL_VALUE as ownedcarbrand,
minimodel3.SELVAL_VALUE as minimodel,
dealer3.SELVAL_VALUE as dealer,
subject3.SELVAL_VALUE as subject,
SUBSCRIPTION.sub_id as subscription_id,
brochure13.SELVAL_VALUE as brochure1,
brochure23.SELVAL_VALUE as brochure2,
brochure33.SELVAL_VALUE as brochure3,
message2.CAMATTR_VALUE as message
from CONTACT contact
left join CONTACT_ATTRIBUTE lastname on lastname.CONATTR_CON_ID = CONTACT.CON_ID and lastname.CONATTR_ATTR_ID = 1010
left join CONTACT_ATTRIBUTE firstname on firstname.CONATTR_CON_ID = CONTACT.CON_ID and firstname.CONATTR_ATTR_ID = 1020
left join CONTACT_ATTRIBUTE languages on languages.CONATTR_CON_ID = CONTACT.CON_ID and languages.CONATTR_ATTR_ID = 1000
left join
(SELECTION_VALUE gender2 INNER JOIN CONTACT_ATTRIBUTE gender on gender.CONATTR_SEL_ID=gender2.SELVAL_SEL_ID and gender2.SELVAL_LANGUAGE = 'zh' )
on gender.CONATTR_CON_ID = CONTACT.CON_ID and gender.CONATTR_ATTR_ID = 1030
left join CONTACT_ATTRIBUTE mobile on mobile.CONATTR_CON_ID = CONTACT.CON_ID and mobile.CONATTR_ATTR_ID = 1100
left join CONTACT_ATTRIBUTE office on office.CONATTR_CON_ID = CONTACT.CON_ID and office.CONATTR_ATTR_ID = 1110
left join CONTACT_ATTRIBUTE home on home.CONATTR_CON_ID = CONTACT.CON_ID and home.CONATTR_ATTR_ID = 1120
left join CONTACT_ATTRIBUTE address on address.CONATTR_CON_ID = CONTACT.CON_ID and address.CONATTR_ATTR_ID = 1200
left join CONTACT_ATTRIBUTE zipcode on zipcode.CONATTR_CON_ID = CONTACT.CON_ID and zipcode.CONATTR_ATTR_ID = 1210
left join
(SELECTION_VALUE province2 INNER JOIN CONTACT_ATTRIBUTE province on province.CONATTR_SEL_ID=province2.SELVAL_SEL_ID and province2.SELVAL_LANGUAGE = 'zh' )
on province.CONATTR_CON_ID = CONTACT.CON_ID and province.CONATTR_ATTR_ID = 510
left join (SELECTION_VALUE city2 INNER JOIN CONTACT_ATTRIBUTE city on city.CONATTR_SEL_ID=city2.SELVAL_SEL_ID and city2.SELVAL_LANGUAGE = 'zh' )
on city.CONATTR_CON_ID = CONTACT.CON_ID and city.CONATTR_ATTR_ID = 520
left join (SELECTION_VALUE intendedpurchase2 INNER JOIN CONTACT_ATTRIBUTE intendedpurchase on intendedpurchase.CONATTR_SEL_ID=intendedpurchase2.SELVAL_SEL_ID and intendedpurchase2.SELVAL_LANGUAGE = 'zh' )
on intendedpurchase.CONATTR_CON_ID = CONTACT.CON_ID and intendedpurchase.CONATTR_ATTR_ID = 1300
left join (SELECTION_VALUE ownedcarbrand2 INNER JOIN CONTACT_ATTRIBUTE ownedcarbrand on ownedcarbrand.CONATTR_SEL_ID=ownedcarbrand2.SELVAL_SEL_ID and ownedcarbrand2.SELVAL_LANGUAGE = 'zh' )
on ownedcarbrand.CONATTR_CON_ID = CONTACT.CON_ID and ownedcarbrand.CONATTR_ATTR_ID = 1400
left join SUBSCRIPTION on CONTACT.CON_ID=SUBSCRIPTION.SUB_CON_ID
left join
(SELECTION_VALUE minimodel3 INNER JOIN
(CAMPAIGN_ATTRIBUTE minimodel2 INNER JOIN SUBSCRIPTION minimodel on minimodel2.CAMATTR_SUB_ID=minimodel.SUB_ID)
on minimodel2.CAMATTR_ATTR_ID=52001)
on minimodel2.CAMATTR_SEL_ID = minimodel3.SELVAL_SEL_ID and minimodel3.SELVAL_LANGUAGE = 'zh' and minimodel.SUB_CON_ID=CONTACT.CON_ID
and SUBSCRIPTION.SUB_ID in (select distinct(CAMATTR_SUB_ID) from CAMPAIGN_ATTRIBUTE group by CAMATTR_SUB_ID)
left join
(SELECTION_VALUE dealer3 INNER JOIN
(CAMPAIGN_ATTRIBUTE dealer2 INNER JOIN SUBSCRIPTION dealer on dealer2.CAMATTR_SUB_ID=dealer.SUB_ID)
on dealer2.CAMATTR_ATTR_ID=52002)
on dealer2.CAMATTR_SEL_ID = dealer3.SELVAL_SEL_ID and dealer3.SELVAL_LANGUAGE = 'zh' and dealer.SUB_CON_ID=CONTACT.CON_ID
and SUBSCRIPTION.SUB_ID in (select distinct(CAMATTR_SUB_ID) from CAMPAIGN_ATTRIBUTE group by CAMATTR_SUB_ID)
left join
(SELECTION_VALUE subject3 INNER JOIN
(CAMPAIGN_ATTRIBUTE subject2 INNER JOIN SUBSCRIPTION subject on subject2.CAMATTR_SUB_ID=subject.SUB_ID)
on subject2.CAMATTR_ATTR_ID=50001)
on subject2.CAMATTR_SEL_ID = subject3.SELVAL_SEL_ID and subject3.SELVAL_LANGUAGE = 'zh' and subject.SUB_CON_ID=CONTACT.CON_ID
and SUBSCRIPTION.SUB_ID in (select distinct(CAMATTR_SUB_ID) from CAMPAIGN_ATTRIBUTE group by CAMATTR_SUB_ID)
left join
(SELECTION_VALUE brochure13 INNER JOIN
(CAMPAIGN_ATTRIBUTE brochure12 INNER JOIN SUBSCRIPTION brochure1 on brochure12.CAMATTR_SUB_ID=brochure1.SUB_ID)
on brochure12.CAMATTR_ATTR_ID=51001 and brochure12.CAMATTR_SEL_ID=5100101)
on brochure12.CAMATTR_SEL_ID = brochure13.SELVAL_SEL_ID and brochure13.SELVAL_LANGUAGE = 'zh' and brochure1.SUB_CON_ID=CONTACT.CON_ID
and SUBSCRIPTION.SUB_ID in (select distinct(CAMATTR_SUB_ID) from CAMPAIGN_ATTRIBUTE group by CAMATTR_SUB_ID)
left join
(SELECTION_VALUE brochure23 INNER JOIN
(CAMPAIGN_ATTRIBUTE brochure22 INNER JOIN SUBSCRIPTION brochure2 on brochure22.CAMATTR_SUB_ID=brochure2.SUB_ID)
on brochure22.CAMATTR_ATTR_ID=51001 and brochure22.CAMATTR_SEL_ID=5100102)
on brochure22.CAMATTR_SEL_ID = brochure23.SELVAL_SEL_ID and brochure23.SELVAL_LANGUAGE = 'zh' and brochure2.SUB_CON_ID=CONTACT.CON_ID
and SUBSCRIPTION.SUB_ID in (select distinct(CAMATTR_SUB_ID) from CAMPAIGN_ATTRIBUTE group by CAMATTR_SUB_ID)
left join
(SELECTION_VALUE brochure33 INNER JOIN
(CAMPAIGN_ATTRIBUTE brochure32 INNER JOIN SUBSCRIPTION brochure3 on brochure32.CAMATTR_SUB_ID=brochure3.SUB_ID)
on brochure32.CAMATTR_ATTR_ID=51001 and brochure32.CAMATTR_SEL_ID=5100103)
on brochure32.CAMATTR_SEL_ID = brochure33.SELVAL_SEL_ID and brochure33.SELVAL_LANGUAGE = 'zh' and brochure3.SUB_CON_ID=CONTACT.CON_ID
and SUBSCRIPTION.SUB_ID in (select distinct(CAMATTR_SUB_ID) from CAMPAIGN_ATTRIBUTE group by CAMATTR_SUB_ID)
left join
(CAMPAIGN_ATTRIBUTE message2 INNER JOIN SUBSCRIPTION message on message2.CAMATTR_SUB_ID=message.SUB_ID and message2.CAMATTR_ATTR_ID=50002)
on message.SUB_CON_ID=CONTACT.CON_ID
and SUBSCRIPTION.SUB_ID in (select distinct(CAMATTR_SUB_ID) from CAMPAIGN_ATTRIBUTE group by CAMATTR_SUB_ID)
I apologize in advance if I get anyone's feed-reader stuck.
Thank you for reading my blog.
LOL! Yes, SQL is a programming language. And like most programming languages, there is a graceful way to do design and architecture and then there is the way that I call "blecherous" (it makes you say "blechh!").
ReplyDeleteAs soon as I saw the fourth field in your select-list I knew what was up:
lastname.CONATTR_VALUE
I feel sure this is an instance of a common SQL design blunder called Entity-Attribute-Value. It's not your fault they designed the database this way, but now you can see how difficult it makes your job using that design.
People who design their database with Entity-Attribute-Value practically don't need a SQL database at all -- they need an XML file.
I'm currently writing a book called "SQL Antipatterns" and EAV is high on the list.
I'm also presenting "SQL Antipatterns" as a tutorial at the MySQL Conference on April 14.
http://en.oreilly.com/mysql2008/public/schedule/detail/1639
More details here:
http://karwin.blogspot.com/2008/03/speaking-at-mysql-conference.html
Hi!
ReplyDeleteyes - definitely an Enity Attribute (anti)pattern.
This design should be shunned because of how cumbersome and opaque it is to code against it. All the ID's in the join clause to pick the right attributes are a tell-tale.
But still...what if you have an application like MS Project: it lets you add custom properties to tasks, resources, etc. And, I've done a few projects where we really needed that to calculate custom performance indicators for project progress. Under the hood, MS Project has a entity value module in the database schema to implement this, and indeed - you get these typical queries, which sucks.
But what about the alternatives? Surely, MS Project could just add a new column for a custom property, but the problem is that there are different types of projects and we'd end up with many columns that are basically useless to most types of projects. So that does not seem a good solution. Alternatively, a separate table could be created to hold the custom properties for a single project or group of projects of a certain type.
The disadvantage of modifying the schema to accomodate custom properties would mean that all users that save the project in the db would need to have the privileges to ALTER or CREATE tables. After that, access to those new objects must also be granted to all other users that might need to read those projects.
Not at all impossible, but still, application users with the rights to create database objects....
Another option would be to indeed store all those custom things in some kind of semi-structured format such as XML. But would that really make it better? I seriously doubt it - esp. once we need to combine date stored in custom properties with other data in the db, I think I'd stil rather settle for an entity-attribute approach - rather that than springling XQuery / XPath constructs in my queries to get it done - I mean, it seems a cure that's worse than the disease.
So - what are your thoughts? How to deal with partially extensible datamodels/ structures in relational databases? Should we just give up, or put up with a number of these annoying edges? I really don't know...
My short answer is: if you can't describe your data, then SQL isn't the right tool to query it.
ReplyDeleteSometimes you need semi-structured data, or you need the capability of adding attributes without changing the application code or the database metadata.
In those cases, you are asking for a greater amount of flexibility than SQL was designed to provide. Every programming task makes certain assumptions to achieve simplicity, and SQL makes the assumption that a given entity has a stable set of attributes.
EAV breaks this assumption -- it's designed to accommodate dynamic attributes. But it should be no surprise that the resulting SQL isn't very graceful.
There are emerging standards that treat data and metadata as virtually the same thing, and allow both to be dynamic and queriable. RDF/XML is one example, and there's even a standard query language for it -- SPARQL.
I'd say we're 10+ years years away from having de facto SPARQL interfaces as common as SQL interfaces, and RDF data stores that are as efficient as RDBMS products are today.
In the meantime, EAV can be used for semi-structured data. Though we must recognize it comes at a high cost, and perhaps in the majority of cases where it is used, it'd be simpler to add attributes with ALTER TABLE ADD COLUMN.
Other options include a hybrid between conventional and EAV tables. Store common attributes in the conventional way as columns, but allow extended attributes to be stored in an EAV table referencing the conventional table. At least this allows some queries to avoid the EAV table, if the query only needs to reference the base set of attributes.
Another option is to use "subtables" that extend a parent table, just like an object-oriented derived class adds properties and methods to those inherited from its parent class. It takes a join to reconstitute the full row, but it's better than EAV which requires a join per attribute.
All the solutions naturally add complexity, and therefore you should evaluate them on a case-by-case basis.
@bill karwin
ReplyDeleteI saw that you will be speaking in MySQL conference and I have to say that your topic seemed the most interesting to me :)
So please link the audio/video/slide files somewhere so I can see afterwards.
Regarding the design of this database, I just call it "highly normalized". The people who made the database had to factor in many different circumstances, locations in which it could be deployed and languages that it could be used with. So they opted to do it this way.
My main problem is that if you want to produce reports of some kind, its very very slow.
I worked somwhere else where they had very much the same problem and it was my task to make it go faster.
So I suggest just having an extra table (or atleast view) that duplicates the data from many smaller tables and integrates them.
On topic, to do this in java would probably taken 10x more lines to develop.
Yes, adding an extra table (or tables) to duplicate the data is an important first step.
ReplyDeleteThe next step is to DROP the EAV table! :-)
You shouldn't describe EAV as "highly normalized." In fact, the opposite is true - EAV is not a normalized design at all.
It's not even in 1st normal form, because it uses the attr_value column for all attributes in the relation. In a normalized design, each logical attribute should be stored in a distinct column.
"My short answer is: if you can't describe your data, then SQL isn't the right tool to query it."
ReplyDeleteWell, in this case, the data can be perfectly described, just not at the same time as the database schema was designed.
Whether it is "highly normalized" or "not normalized at all" is a matter of perspective.
From the point of view of the developers of the original application, that could not see whatever custom attributes the users would be adding, this EAV pattern is a normalized design to solve the problem of custom attributes. The could also have created a bunch of "spare" or "extra" columns in the existing tables which would depending on what they would be used for hardly be better normalized.
On another level, it is more arbitrary than it may seem whether something should be considered to be unnormalized or 1st normal form.
For example, take phone numbers. In some cases, we don't mind the phone number having a structure of it's own and we choose to regard it as an atomic value. In other cases, we want to separately store and search parts like country and area codes. Similar cases are URLs and email addresses.
For multivalued attributes, we can witness the same. Sometimes it might be necessary to store someones initials. Although we know that each letter stands for a proper name, we will usually choose to treat it as an atomic value.
In the end, the requirements of the application will govern what will be considered an atomic value. There is no objective generic criterion that can tell us that.
i come from a sybase/db2 programming background. we use the entity attribute value pattern very very extensively. there is a very simple way to get around complex queries. use the pattern
ReplyDeleteselect entity.id,
case when attribute.attrib_id = 1 then attribute.attrib_value end as attrib1,
case when attribute.attrib_id = 2 then attribute.attrib_value end as attrib2
from entity e, attribute a
where e.entity_id = a.entity_id
and (your conditions here)
im not sure if the above is possible in mysql.
btw we use the above in entities and attributes having upwards of 50 million rows (and more than 20gb each) without any problems. if you have proper indexes queries are fairly fast.
we have very fast apps running on top of the above tables.
i wouldn;t say the above is an antipattern. i think the problem here is ignorance of techniques in sql.
thx,
ks