7.21.2009

ActiveRecord and Bulk Deletes

I came across an issue the other day that I thought would be fairly straightforward: deleting a bunch of records from the database based on certain criteria. I've done things like this plenty of times before in Java, but this was my first time doing it in .NET, using NHibernate and ActiveRecord.

The Problem

Let's pretend I want to delete all of a user's settings when they choose to import a new batch of settings (rather than deal with merging the two sets). In HQL I would say something like:

delete from UserSettings where User = :user

and then to actually run it, I'd replace the named parameter with an instance of the logged in user.

I wanted to use this same kind of HQL in my .NET application, and that's where I ran into some issues. There is almost no good documentation on the ActiveRecord project. The API documents are incomplete and the user guide is only good for the very basics. The one thing that the documentation does recommend, using the SimpleQuery or ScalarQuery objects, forgets to mention one key piece of advice: THEY DON'T WORK WITH DML-STYLE STATEMENTS. Seems like an important thing to mention.

Searching the internet for a solution wasn't much better. Most examples show some sort of retrieval happening before the DELETE, either retrieving the entire domain object, or retrieving just the PK's with a projection and passing those into a DeleteAll method. But I didn't want to have to retrieve things from the database just to delete them, so I came up with this approach instead.

The Solution

The first thing to do is extend the Castle.ActiveRecord.ActiveRecordBaseQuery, like so:

class DeleteUserSettingsQuery : ActiveRecordBaseQuery
{
private User user = null;

public DeleteUserSettingsQuery(User user) : base(typeof(UserSetting))
{
this.user = user;
}

protected override IQuery CreateQuery(ISession session)
{
IQuery query = session.CreateQuery("delete from UserSetting where User = :user");
query.SetParameter("user", this.user);

return query;
}

protected override object InternalExecute(ISession session)
{
return CreateQuery(session).ExecuteUpdate();
}
}

The trick here is to override the CreateQuery method AND override the InternalExecute method. If you don't override InternalExecute, it will use the default implementation of this method which calls List() which does not support DML-style operations.

Using it is as simple as:
UserSetting.ExecuteQuery(new DeleteUserUserSettingsQuery(user));

The ExecuteQuery method is available because UserSetting is a domain object that extends ActiveRecordBase.

That's it. Easy. Hopefully this solution saves someone else some time and frustration.

12.11.2008

Validating Domain Objects: Who's Responsible?

Overview

Whenever you persist data for an application, there are certain constraints that the data must adhere to. Some of those constraints can be defined on the database schema itself (assuming a database is the data store). Things like a category name being unique, or a price that must be defined (not null) for an item. There are also other constraints, business constraints, that don't map as well to a database schema. Making sure a person's age being between 0 and 125, or that a date for scheduling a bill payment occurs in the future are things that are valid to an application's flow, but not readily definable on your database schema. You don't want invalid data being persisted, so someone's got to step up to the plate and make sure that doesn't happen.

Who's in Charge?

So where is the data validation going to occur? There are a few options that are available. Validation can be done explicitly, either in the user interface itself or somewhere on the server, or it can be done implicitly by tossing the data to the database and seeing if the database has any complaints. You may also choose a combination of these approaches. Let's look at some of the pros and cons to each of these methods.

UI Validation

There are some definite benefits to putting data validation into the UI layer. You can do a large chunk of validation without ever having to make a call to the server. Because you never have to hit the server, validation can be done on every key event if needed (think determining password strength). Even if you don't go balls to the wall validating on keystrokes, you can still see everything that is invalid about an object all at once; you aren't locked into any kind of 'fail on first' validation error that masks other validation problems.

There are also some definite drawbacks to this approach. The UI can't validate data uniqueness without going back to the server. Even if you've opted to load the entire domain model into the UI, you can't guarantee uniqueness in a concurrent application (unless you've made the mistake of using an ORM solution, locking rows, not detaching objects and leaving sessions open indefinitely in transactions with a serializable isolation). The point is, you NEED to go to the database to guarantee uniqueness. Another drawback to validating in the UI is that you are now putting intelligence into the user interfaces that many (myself included) believe doesn't belong there. In a good design, the user interface should be as ignorant of the business logic as possible, and it should definitely be ignorant of the schema of the data store. By doing certain UI validations, you're breaking this rule. The UI needs to know that there is a NOT NULL column so that it can validate the existence of that column's value. Changes to business rules or data store schema require corresponding changes in UI components, making more development work and increasing the likelihood of a feature disconnect.

'Server' Tier Validation

I can't hide my true feelings, this is my preferred location for data validation. It shares many of the same benefits as UI validation. A well-designed validation framework on the server can tell you everything that is wrong with data, preventing the 'fail on first' validation issue. Client to server communication won't be as quick as all-client validation, but an intelligent communication protocol and well designed inputs and outputs could make even the validate on keystroke scenario viable with server validation. Server validation has the added benefit of supporting multiple user interfaces. If you need to provide server access as an API, or if you are developing an application that will have customizable interfaces, you don't want to reinvent the validation wheel every time. With server validation you don't have to.

Like the UI, the server by itself can't guarantee uniqueness. This seems to be the consistent wrinkle in otherwise clean validation framework solutions. You could query the data store prior to inserting or updating your data, but do you want to incur the extra query for every data manipulation? Maybe you do, maybe you don't; it's your call. Server validation will not provide as rich of a UI experience as actually doing the UI validation, at least not as easily. If you are doing something that is that visually rich and exciting and UI-centric, then my recommendation would be to enhance server validation with the necessary UI validation, instead of picking UI validation OVER server validation.

Database Validation

Obviously I've picked a favorite approach already, but this solution is still worth mentioning. Why? There may be situations where it is an appropriate solution. Please don't ask me what they are. The beauty of a database solution is that you don't really need to do any extra work. If you try to save data that isn't valid according to the schema, it simply won't work. You could simply pass the resulting error back up the stack to the UI, or even parse it, make it a little nicer and THEN send it up the stack to the UI. How does it deal with uniqueness? Same way it deals with any other invalid data, it doesn't work. No precheck necessary to see if the value already exists.

A big drawback to this approach is that it doesn't validate business logic. You CAN do some business logic validation, it just requires tighter database integration, more complex SQL statements and probably some good, old-fashioned vendor-specific functionality. If you're going to go through all that trouble, you may as well put a validation framework in place. Another big drawback is that the database will only report one error at a time, the first one it comes across. If you are trying to save a user without a first name or a last name (both required), you'll get a message to provide a first name. Send it again with a first name and it'll give you another message asking for a last name. By the time you've sent your third try, someone's taken the username you wanted. Not an awesome user experience.

Wrapup

Data validation is an important part of a quality data-driven application. There are different ways to do validation, and an argument can be made for each of those ways depending on the situation. Overall, a server-based validation approach is the best way to go. A server approach can be enhanced with some well-placed UI validation in certain cases. Database validation isn't really validation so much as it is hopeful laziness. A future article will probably cover integrating and extending Hibernate's Validation Framework.

Entity Relationships and DTOs: A Better Way (Part 2)

Recap

In the previous article we looked at a scalability issue that can arise from using a findById method to establish entity relationships for a collection of transfer objects. Our solution was to look up the collection of related entities (Role) with a single find, store those entities in a map keyed off the entity's primary key, and then retrieve the entity from the map when building the collection of enclosing entities (User) from the transfer objects (UserDto). It's a good solution.

Good But Not Great

The previous solution isn't appropriate in all scenarios, as you're about to see. Consider the following entity:
public class User {
private Long id;
private String firstName;
private String lastName;
private String email;
private Role role; // entity relationship
private ContactInfo contactInfo; // entity relationship #2
private Collection<User> dependents; // entity relationship #3

// getters/setters
...
}

We've added a new reference to the ContactInfo entity, which we've specifically chosen to be an entity rather than a component for management and lookup reasons. We've also added a collection of User entities to represent dependent Users (perhaps our fictitious system is of interest to the IRS). The DTO for this entity might now look something like this:
public class UserDto {
public Long id;
public String firstName;
public String lastName;
public String email;
public Long roleId;
public Long contactInfoId;
public Collection<Long> dependentIds;
}

To use our previous solution, we would do the following:
  • Get set of Role entities into a map
  • Get set of ContactInfo entities into a map
  • Get set of dependent User entities into a map
  • Create Users from UserDtos, pulling from the maps as necessary

The 9 extra lines of code is now 27 extra lines of code, unless you built the generic utility method of course. You DID build the generic utility method didn't you? For our collection of 100 transfer objects, we will have run 3 SELECTs, which is not as good as 1, but still better than the 300 we would have run otherwise. For a collection of 100 elements, or even 1000 elements, 3 additional statements against the database doesn't seem so bad. The problem is that managing a single User at a time (a likely scenario) still requires those same 3 SELECTs.

Taking Advantage of Proxies

It would be nice if we could eliminate the extra SELECT statements completely, but how? The entities don't have references to primary keys, they have references to full-fledged entity instances (and this is a good thing). But the database only needs certain keys to represent the entity relationships, not the whole entity structure, so any UPDATE or INSERT generated by Hibernate will only need the foreign key information from the related entities.
We need instances of the related entities that have their primary keys set AND that are managed by Hibernate's persistence context. We need the Session.load() method.

Session.load() Solution

Here's the code to use the Session.load() method. An explanation follows the code:
for(UserDto dto : userDtos) {
User u = new User();
u.setId(dto.getId());
u.setfirstName(dto.getFirstName());
u.setLastName(dto.getLastName());
u.setEmail(dto.getEmail());
u.setRole(roleDao.loadReference(dto.roleId));
u.setContactInfo(contactInfoDao.loadReference(dto.contactInfoId));
u.setDependents(userDao.loadReferences(dto.dependentIds));

// add u to list and save in bulk after loop
}

The loadReference method in the DAO would look something like this:
public Role loadReference(final Long id) {
return (Role) getSession().load(Role.class, id);
}

The loadReference() methods in the other DAOs would look eerily similar. This load method will create a proxy instance of the Role and put it into the persistence context, all without actually hitting the database. This instance can now be used wherever that Role instance is needed. The getter for the primary key will return the id value that the proxy was initialized with, so Hibernate can use these proxy instances to generate the database statements. We now have a solution with simplified coding that eliminates the need to hit the database before saving an entity relationship.

NOT a Silver Bullet

There are times when a load isn't appropriate. If you don't know for sure that a row with the specified identifier exists in the database, then you shouldn't use load. A proxy instance will still be created, but a runtime exception will be thrown when the transaction tries to actually write the rows to the database and runs into the constraint violation.
Additionally, if you need any information from the proxied instance other than its identifier the load() method has no advantage, since the first call to get any of that additional information will result in retrieving that row from the database.

12.10.2008

Entity Relationships and DTOs: A Better Way (Part 1)

Introduction and Simple Case

When entities are represented as transfer objects for a client, the server must assemble an entity representation of that transfer object before using it in a persistence layer. Sometimes this can be simple:
public class User {
private Long id;
private String firstName;
private String lastName;
private String email;

// getters/setters
...
}

In this case, the UserDto fields can exactly mimic the User fields and the server can create an entity with ease:
User u = new User();
u.setId(dto.getId());
u.setfirstName(dto.getFirstName());
u.setLastName(dto.getLastName());
u.setEmail(dto.getEmail());

Entity Relationships and a Naive Approach

Sometimes, a transfer object needs to represent an entity relationship, which adds to the complexity. For instance, if we were to add a one-to-many relationship from a User entity to a Role entity, the User entity would change as follows:
public class User {
private Long id;
private String firstName;
private String lastName;
private String email;
private Role role; // entity relationship

// getters/setters
...
}

Typically you would not add an entire RoleDto to the UserDto, instead opting to just add a field to the UserDto to hold the Role's primary key identifier. How does this affect the code to create a User entity from the UserDto?

A sensible first approach to the problem would be the following:
User u = new User();
u.setId(dto.getId());
u.setfirstName(dto.getFirstName());
u.setLastName(dto.getLastName());
u.setEmail(dto.getEmail());
u.setRole(roleDao.findById(dto.roleId));

A Problem of Scalability

From a coding standpoint, this seems like a nice, simple and clean approach. Unfortunately, this solution has a problem. The problem is the line that retrieves the Role entity using the findById() method. This hits the database with an additional SELECT statement every time you want to update the user. A SELECT and an UPDATE instead of just an UPDATE doesn't sound like the end of the world; and it isn't. Scalability is the problem. Suppose instead of having to create a single User entity to save, you have to create a collection of User entities. Obviously you could take the above code and put it in a for loop:
for(UserDto dto : userDtos) {
User u = new User();
u.setId(dto.getId());
u.setfirstName(dto.getFirstName());
u.setLastName(dto.getLastName());
u.setEmail(dto.getEmail());
u.setRole(roleDao.findById(dto.roleId));

// add u to list and save in bulk after loop
}

A collection of 100 User transfer objects now results in 100 extra SELECT statements! This seems highly unnecessary, especially considering that there probably aren't 100 distinct Role types in this fictitious system the example is using.

A Reasonable Solution

A straightforward approach to fixing the problem is to code around it. To do that, we'll first look through the collection of User transfer objects, pulling out the primary key identifiers of the Role entities. Then we'll get all the Role entities at once with a single find (single SELECT statement behind the scenes) and put them into a Map. Lastly, we'll set the User entity's role from an element in the Map, eliminating the SELECT statement per User entity that we previously had:
// get Role entities to load
Set roleIds = new HashSet(); // a set will manage dupes for us
for(UserDto dto : userDtos) {
roleIds.add(dto.roleId);
}

// load Role entities and store in Map
List roles = roleDao.findByIds(roleIds);

Map rolesMap = new HashMap();
for(Role role : roles) {
rolesMap.put(role.getId(), role);
}

// the original save loop, slightly modified
for(UserDto dto : userDtos) {
User u = new User();
u.setId(dto.getId());
u.setfirstName(dto.getFirstName());
u.setLastName(dto.getLastName());
u.setEmail(dto.getEmail());
u.setRole(rolesMap.get(dto.roleId)); // the slight modification

// add u to list and save in bulk after loop
}

The detail of the findByIds() method isn't that important. Maybe it uses an IN clause. Just as long as it doesn't do a SELECT for each id that is passed...

The upside to this code is that it addresses the issue of scalability and doesn't require any extra knowledge about Java or Hibernate or databases beyond what you probably already know.

The downside to this solution is that it requires an additional 9 lines of code for each relationship in the transfer object. The savvy developer could come up with a simple utility function that can produce the same Map through a clever combination of generics and reflection.

Be sure to check back for Part 2 of this article, which will explore a solution that takes advantage of Hibernate's ability to use proxy objects in its persistence context.