Real Lessons In Software Development Part 1
Design Considerations for Processing Data in Batch ProcessesIn my software development journey I've learned many lessons that have caused me and the teams I've worked in much grief and extra effort. Most of these issues are completely avoidable and avoiding them will extend your life. OK! OK! They won't extend your life but they may help in slowing down rate at which you either lose hair or acquire new gray hair. I want to emphasize that these lessons come from true pain and experience and you should really consider the points I make whenever you are faced with them. I plan to write about these lessons as a series. I don't know how far the series will go but I can assure you that there will be more to come.
In this particular lesson I will provide a few tips about batch processes. I don't want you to think of batch processes as they are implemented with any specific technology - batch (.bat) files for example. A batch process is simply a way to process records (I use the term records loosely but each data group can also be considered a transaction) that have been queued up for processing, The queue may be explicitly created or it may be implicitly created as part of some other process.
Consider for example if you were to have a process that read all of the emails in exchange server automatically forwarded them to different recipients based on particular keywords. For example emails that may contain inappropriate content and automatically forwarding them to the HR department for further screening. I'm not saying that this happens but if you are in HR or Security I'm sure this would be a handy process. Consider now that instead of doing it in real time this process ran the process on a weekly basis when there was little traffic on the network such as Sunday evening. That would be an example of an implicit Queue. It is implicit because the queue is not really a queue but rather a source of information that is already available.
An example of an explicit queue would be staging area where records are queued up for processing. For example this queue could be a folder location with images and their descriptions that you may place in a particular folder and they are automatically uploaded to your website on a particular schedule. Another example could be a flat file or series of flat files with particular data elements. Consider for example a movie catalog for upcoming movies that must be entered into your database because you provide certain information from them on your own system such as the description, the title, the rating, etc. Maybe you fetch this list weekly from a movie listing service and have it entered into your system automatically to display relevant information about the movies. You know there is a big benefit of automatically fetching the data into your system as opposed to having a user manually enter it.
The key concept about all of this is that at some point you are going to process all records one after the other in the batch queue. You may just be entering it into a database or you may be synchronizing it with your systems database. In some other case you might be generated other queues from that data. Whatever the case may be you have to consider the old adage; garbage in garbage out. You are going to have to do some minimal validation to insure that you maintain the integrity of the receiving system. This is where you have to be specially careful because one bad record can choke your entire process.
When processing data in a batch process you can process records independently of one another. You can also process them as one big process-all or one fail-all transaction - all records or not a single one gets processed. What I mean by processing records independently is that you should process the records and in the case where one record is not valid you set it to the side and process the next valid one. A record can be invalid for any number of reasons: the record is not formatted properly, it is missing some information, it has invalidly formatted data elements such as date without the month, some values are outside the ranges, has invalid characters, etc, etc, you should quarantine that particular record and continue to process the rest of the batch. Failure to validate the record first can mean a couple of things:
- It can cause your entire process to fail due to some invalid aspect of the record
- It may allow invalid data to get into yours system and cause future data integrity problems
But what about if instead of the system allowing the record to go through, it did not allow it and resulted in an unhandled exception. If your system is prepared for exceptions you can grab the record's information and send it to your quarantine area. If your system is not prepared what will it do? Will it roll back all of the changes before and after? Will it process all records before, but not that one or any other after that. Will the records before that one get committed, even though maybe they depend on this record so your system is an unstable state now. These are all questions that you must consider when processing data in batches.
There are times when it may really be an all or nothing requirement and you need to process all the records or none at all. This type of processing is considered transactional and your system should be prepared for that sort of thing. Most databases already have transaction functionality built into them but maybe you are not even using a database. In which case you must build a or use a bullet proof transactional system. If you need to process all records as part of one big transaction then you must make sure that you have considered the ramifications of that decision. In most cases you will find that what you really need is to process all records individually and simply send the invalid ones to your quarantine area. In other cases you may do a combination of both quarantine and transactions.
Consider for example if your processing flight reservations for a group. You wouldn't want to separate the members of the group into different flights. There may be cases when you validate them all and then inform the user of the combination of valid and invalid records and provide him with the option to proceed or fix and then proceed.
All I am recommending is to be cognizant of the design options when processing records in a batch. I've seen in one to many times where the process chokes on one record and the entire process fails. And worst of all that was not the intended design or it wasn't given much consideration as to what happens when the receiving system fails to do it's job because the batch process cannot process any of the data due to one invalid record.
Consider a mailing list that produces prints out thousands of records and you leave it process and because you know it takes a few hours to process you leave it processing take off for a few hours to come back and realize that it choked on the second record. So instead of your three thousand and something records process you have one big nasty message waiting for you to inform you that the process failed. Or worst yet no output and no message. Now you've just wasted a couple of hours.
If on the other hand when you came back you have a report that informs you that it processed 3999 records and was unable to process five other records for various reasons and provided you with which five it was not able to process, you could fix that issue and implement a work around to handle those five exceptions. Consider for example how windows behaves when moving files from one folder to another. If one file is set to Read Only the entire process fails. Wouldn't you rather just get a error log and move the ones that it can't. I've had to create my own file manipulation for this very reason in the past.
This is only a small example of how serious a failure of this type can be. What if your process is part of critical operations for a factory that depends on that process and you have say 100 or more employees idle because the process failed. That number can mean a lot of lost productivity and when you put a number on that I'm sure it's no insignificant matter.
To summarize everything I've discussed above please remember the following.
A batch process requires these design considerations regardless of the technology used to implement it. This means that it does not matter if you are using a messaging queuing system or a particular language. It does not matter if your data source is a flat file or records stored in a relational database. It does not matter if you use Java, C++ or FORTRAN. You will be required to consider these design options.
Consider what your process will do when it runs into an invalid record. Will it choke. Will stop and inform the user. Will it clog the system. I've seen cases where the user decided to pop a message on a server and no other batch jobs could be processed until some one came in the room and responded to the message prompt by clicking OK.
Consider quarantine options. There are many reporting mechanisms in existence to allow logging. Some of those are built right into the environment that you are working with. In other cases you may have plain text files that integrate with such systems. In this case you must consider what happens when those log files get too large. You must consider a purging mechanism.
l I hope that this post will save you headaches, loss of hair and the acquisition of new gray hair due to lost productivity and lost time debugging obscure bugs.
Until next time....happy coding!