Wednesday, December 27, 2006

Programming with Oracle SQL TYPE constructs, Part 2

In this two-part series I am following up on my UKOUG Annual conference presentation in an attempt to answer a question from Sue Harper on why more people don't program with Oracle's SQL Types. In the previous article I showed how Types still lack some crucial features which expereinced object-oriented programmers would expect. In this article I am going to address the other half of the question: whether PL/SQL programmers don't use Types because they do not understand when they would be appropriate. I hope to show why Types can still be a useful addition to the PL/SQL programmer's arsenal.

TYPE: what is it good for?


Let's look at the places where Types are used:

  • Collections in PL/SQL
  • Message payloads in Advanced Queuing
  • XML DB

These have one thing in common: Types are used to allow programs to handle different kinds of data in a generic fashion. For instance, a collection can be a table of NUMBER or a table of a Complex Data Type. In either case we can manipulate the set of records using standard collection methods such as count(), first(), last(), next(), prior, extend() and delete(). This flexibility has seen programmers to adopt collections in a wide variety of situations: Steven Feuerstein has a whole presentation devoted to nothing but collections. The application of Types in AQ and XML DB similarly provide a framework to handle data as an undifferentiated splodge which we can pass to consuming procedures which understand the data's actual structure.

This gives us a clue as to the kind of application which is suited to a Type application: when we have data with different structures which we want to process in the same general fashion. At the UKOUG conference I discussed my current project. This is a datawarehouse. Source systems provide daily feeds which have to be loaded in to a history schema. Each table has a different structure and so has its own set of DML statements but the whole process is entirely generic. This means that the decision making logic can be stored and executed in a supertype, which calls on the subtypes' overriding implementations to execute specific logic for each individual table. As I discovered at the conference the details of the process are too complex to explain in a presentation. Even in an article like this I don't think it is practical. Instead I am going to use a simpler example which we all understand: customers.

Customers: a worked example


I recently got told off for using contrived examples in my code so I hope the following business scenario doesn't seem too spurious (in real life I would not expect addresses to be stored in quite the way I show here). The business is a mail order company selling something like artists' supplies. It keeps a record of every customer. Some regular customers have accounts; these can either be retail customers (probably professional artists) or wholesalers. The system keeps additional information about its account customers. Wholesale customers may have multiple contact addresses, e.g. for payments and deliveries.

I am not going to build a whole implementation but just enough code to print the shipping address for all the customers. Furthermore, in order to avoid irrelevant complexity I am going to sidestep the whole issue of persistence and just focus on the Types I need to create an API. The first step is to create the Types for addresses and contacts. The ADDRESS Type has a couple of member functions to print the address's components as formatted string.

SQL> create or replace type address as object (
2 line_1 varchar2(100)
3 , line_2 varchar2(100)
4 , post_town varchar2(50)
5 , post_code varchar2(10)
6 , member function formatted_address return varchar2
7 , member function formatted_address
8 ( name in varchar2, fao in varchar := null)
9 return varchar2
10 );
11 /

Type created.

SQL> create or replace type body address as
2 member function formatted_address
3 return varchar2
4 is
5 return_value varchar2(4000);
6 begin
7 return_value := self.line_1;
8 if self.line_2 is not null then
9 return_value := return_value
10 ||chr(10)|| self.line_2;
11 end if;
12 if self.post_town is not null then
13 return_value := return_value
14 ||chr(10)|| self.post_town;
15 end if;
16 if self.post_code is not null then
17 return_value := return_value
18 ||chr(10)|| self.post_code;
19 end if;
20 return return_value;
21 end formatted_address;
22 member function formatted_address
23 ( name in varchar2, fao in varchar := null)
24 return varchar2
25 is
26 return_value varchar2(4000);
27 begin
28 return_value := name||chr(10);
29 if fao is not null then
30 return_value := return_value
31 || 'FAO: '||fao||chr(10);
32 end if;
33 return_value := return_value||self.formatted_address();
34 return return_value;
35 end formatted_address;
36 end;
37 /

Type body created.

SQL>

Wholesale customers may have multiple addresses, which I am representing as a collection. The collection type has methods to retrieve a specific contact address.

SQL> create or replace type contact as object (
2 cntct_code varchar2(10)
3 , name varchar2(50)
4 , cntct_address address
5 );
6 /

Type created.

SQL> create or replace type contact_nt as table of contact
2 /

Type created.

SQL> create or replace type contacts_list as object (
2 contacts contact_nt
3 , member function get_contact_by_code
4 (cntct_type varchar2) return contact
5 , member procedure get_contact_by_code
6 (self in contacts_list, cntct_type in varchar2, addr out address, fao out varchar2)
7 );
8 /

Type created.

SQL> create or replace type body contacts_list as
2 member function get_contact_by_code
3 (cntct_type varchar2) return contact
4 is
5 return_value contact;
6 begin
7 if self.contacts.count() > 0 then
8 for i in self.contacts.first()..self.contacts.last()
9 loop
10 if self.contacts(i).cntct_code = upper(trim(cntct_type)) then
11 return_value := self.contacts(i);
12 exit;
13 end if;
14 end loop;
15 end if;
16 return return_value;
17 end get_contact_by_code;
18 member procedure get_contact_by_code
19 (self in contacts_list, cntct_type in varchar2, addr out address, fao out varchar2)
20 is
21 l_cntct contact;
22 begin
23 l_cntct := self.get_contact_by_code(cntct_type);
24 addr := l_cntct.cntct_address;
25 fao := l_cntct.name;
26 end get_contact_by_code;
27 end ;
28 /

Type body created.

SQL>

Now I create the base type of Customer. This Type is used to represent the non-account customers. I have overloaded the get_mailing_address() function to provide a hook for extending the functionality in the subtypes.

SQL> create or replace type customer as object (
2 id number
3 , name varchar2(50)
4 , legal_address address
5 , final member function get_mailing_address return varchar2
6 , member function get_mailing_address
7 (addr_code in varchar2) return varchar2
8 ) not final;
9 /

Type created.

SQL> create or replace type body customer as
2 final member function get_mailing_address
3 return varchar2
4 is
5 begin
6 return self.legal_address.formatted_address(self.name);
7 end get_mailing_address;
8 member function get_mailing_address
9 (addr_code in varchar2) return varchar2
10 is
11 begin
12 return self.get_mailing_address();
13 end get_mailing_address;
14 end;
15 /

Type body created.

SQL>

To model the account customers I have an Account_Customer Type. It has member functions to get the payment and delivery addresses. This Type is not instantiable, which means that I will have to implement subtypes derived from it.

SQL> create or replace type account_customer under customer (
2 acct_no varchar2(20)
3 , credit_limit number
4 , member function get_billing_address return varchar2
5 , member function get_shipping_address return varchar2
6 , overriding member function get_mailing_address
7 (addr_code in varchar2) return varchar2
8 ) not final not instantiable;
9 /

Type created.

SQL> create or replace type body account_customer as
2 member function get_billing_address return varchar2
3 is
4 begin
5 return null;
6 end get_billing_address;
7 member function get_shipping_address return varchar2
8 is
9 begin
10 return null;
11 end get_shipping_address;
12 overriding member function get_mailing_address
13 (addr_code in varchar2) return varchar2
14 is
15 begin
16 case addr_code
17 when 'SHIPPING' then
18 return self.get_shipping_address();
19 when 'BILLING' then
20 return self.get_billing_address();
21 else
22 return self.get_mailing_address();
23 end case;
24 end get_mailing_address;
25 end;
26 /

Type body created.

SQL>

Now I will create two subtypes of Account_Customer to model retail and wholesale customers. These need to have code to implement the member functions specified in the parent Type.

SQL> create or replace type retail_customer under account_customer(
2 overriding member function get_billing_address return varchar2
3 , overriding member function get_shipping_address return varchar2
4 );
5 /

Type created.

SQL> create or replace type body retail_customer as
2 overriding member function get_billing_address return varchar2
3 is
4 begin
5 return self.get_mailing_address();
6 end get_billing_address;
7 overriding member function get_shipping_address return varchar2
8 is
9 begin
10 return self.get_mailing_address();
11 end get_shipping_address;
12 end;
13 /

Type body created.

SQL> create or replace type wholesale_customer under account_customer (
2 contact_addresses contacts
3 , overriding member function get_billing_address return varchar2
4 , overriding member function get_shipping_address return varchar2
5 , member procedure find_address
6 (self in wholesale_customer, cntct_type in varchar2
, addr out address, fao out varchar2)
7 );
8 /

Type created.

SQL> create or replace type body wholesale_customer
2 as
3 overriding member function get_billing_address
4 return varchar2
5 is
6 l_fao varchar2(50);
7 l_addr address;
8 begin
9 find_address('BILLING', l_addr, l_fao);
10 return l_addr.formatted_address(self.name, l_fao);
11 end get_billing_address;
12 overriding member function get_shipping_address
13 return varchar2
14 is
15 l_fao varchar2(50);
16 l_addr address;
17 begin
18 find_address('SHIPPING', l_addr, l_fao);
19 return l_addr.formatted_address(self.name, l_fao);
20 end get_shipping_address;
21 member procedure find_address
22 (self in wholesale_customer, cntct_type in varchar2, addr out address, fao out varchar2)
23 is
24 begin
25 if self.contact_addresses.contacts.count() = 0 then
26 addr := self.legal_address;
27 fao := null;
28 else
29 contact_addresses.get_contact_by_code(cntct_type, addr, fao);
30 end if;
31 end find_address;
32 end;
33 /

Type body created.

SQL>

In order to avoid having duplicated code in the get_billing_address() and get_shipping_address() functions I have introduced the find_address() procedure to hold the shared code. As discussed in Part 1 I must include this procedure in the Type specification, which clutters the interface but that is just the way it is.

Finally I need to create a nested table type which I can use for holding a collecton of customers.

SQL> create or replace type customer_nt as table of customer
2 /

Type created.

SQL>

The actual procedure to print the delivery addresses is very simple, with a straightforward interface. It takes a list of Customer objects and calls the get_mailing_address() member function to display the address. Because this function was overloaded in the Customer Type the procedure does not need to distinguish betweeen account and non-account customers.

SQL> create or replace procedure print_delivery_addresses (p_list in customer_nt)
2 is
3 begin
4 for k in p_list.first()..p_list.last()
5 loop
6 dbms_output.put_line(p_list(k).get_mailing_address('SHIPPING'));
7 end loop;
8 end print_delivery_addresses;
9 /

Procedure created.

SQL>

To use this procedure I instantiate four customers of three different Types. I pass them as a collection of Customer objects to my procedure.

SQL> set serveroutput on size 1000000
SQL> declare
2 -- sample address objects
3 a1 address :=
4 new address('1 Lister Road', 'Balham', 'London', 'SW12 8MM');
5 a2 address :=
6 new address('34 William St', 'Tooting', 'London', 'SW17 8YY');
7 a3 address :=
8 new address('124 Legal Road', null, 'London', 'NE4 7AA');
9 a4 address :=
10 new address('The Old Warehouse', '23 Museum Street', 'London', 'WC1 8UU');
11 a5 address :=
12 new address('Piggybank House', '86 Bill Row', 'London', 'WC2 8CC');
13 a6 address :=
14 new address('Stapler Mansions', '124 Monet Road', 'London', 'SW4 7GG');
15 c1 contacts_list := new contacts_list(contact_nt());
16 c2 contacts_list := new contacts_list(
17 contact_nt(contact('SHIPPING', 'Reg', a4)
18 , contact('BILLING', 'Raj Pasanda', a5)));
19 -- sample customer objects
20 cus1 customer :=
21 new customer(1111, 'Mr O Knox', a1);
22 rcus1 retail_customer :=
23 new retail_customer(2222, 'Ms Cecilia Foxx', a2, 'A/C 1234P', 1000);
24 wcus1 wholesale_customer :=
25 new wholesale_customer(3333, 'Cornelian '||chr(38)||' Sons', a3
, 'A/C 7890C',10000, c1);
26 wcus2 wholesale_customer :=
27 new wholesale_customer(4444, 'Brushes R Us', a6, 'A/C 9876C',200000, c2);
28 mailing_list customer_nt := new customer_nt (cus1, rcus1, wcus1, wcus2);
29 begin
30 print_delivery_addresses(mailing_list);
31 end;
32 /
Mr O Knox
1 Lister Road
Balham
London
SW12 8MM

Ms Cecilia Foxx
34 William St
Tooting
London
SW17 8YY

Cornelian & Sons
124 Legal Road
London
NE4 7AA

Brushes R Us
FAO: Raj Pasanda
Piggybank House
86 Bill Row
London
WC2 8CC

PL/SQL procedure successfully completed.

SQL>

Note that I edited the DBMS_OUTPUT output for clarity.

How it works as a Type implementation


The procedure print_delivery_addresses issues a call on the method Customer.get_mailing_address(): that is encapsulation. This method is implemented in different fashions in the Customer and Account_Customer Types. The version defined in Account_Customer is the one used by the Reatil_Customer anmd Wholesale_Customer subtypes: that is inheritance. The procedure takes a list of Customer Types but executes the code appropriate for the actual subtype: that is polymorphism.

Conclusion


Assembling this example was an interesting exercise. The starting idea was simple enough but modelling it in objects proved tricky. For instance I am not altogether happy about overloading the get_mailing_address() function in the Customer Type. But the only alternative seemed to be putting an IS OF test in the print_delivery_addresses() procedure to check each Customer object to see whether it is an account customer.

This may seem like a lot of complexity to do something so simple. What needs to be borne in mind is that this complexity is largely overhead. Writing a procedure to print out the billing addresses is another single call. We can submit a list of any or all kinds of Customer objects and be confident that the correct address will be printed. We can extend Customer or AccountCustomer with different implementations of get_mailing_address() and the original procedure will consume them quite happily.

Whether these benefits are sufficient to justify using Types is debatable. Modelling in Types requires a different approach from modelling with packages. Moreover it requires perseverence and flexibility. It is very easy to go off down one route and then find oneself derailed by some limitation of Oracle's Type implementation, or just by some unanticipated aspect of object behaviour. But it is worth having a play around with Types. Because some day you might find yourself designing an application with some complicated generic logic and lots of repetitious specific behaviours and Types may just fit the scenario.

Tuesday, December 19, 2006

What is the point of SQL%NOTFOUND?

A recent post on Oracle Brains reminded me of how the above question used to puzzle me. Then one day I stumbled across the answer in the documentation while I was looking for an answer to some other problem. There are no secrets, only parts of the manual we have yet to read.

%NOTFOUND is one of the four cursor attributes (%FOUND, %ISOPEN and %ROWCOUNT are the others) which give us information about the state of our cursors. The more common situation is to monitor the status of an explicit cursor. In the following example (all examples were run on Oracle 9.2.0.6) I use the %NOTFOUND attribute to determine whether my fetch has returned a row:

SQL> declare
2 cursor cur is
3 select *
4 from emp
5 where ename = 'APC';
6 rec cur%rowtype;
7 begin
8 open cur;
9 fetch cur into rec;
10 if cur%notfound then
11 dbms_output.put_line('There is no employee called APC');
12 end if;
13 close cur;
14 exception
15 when no_data_found then
16 dbms_output.put_line('EXCEPTION: no data found');
17 end;
18 /
There is no employee called APC

PL/SQL procedure successfully completed.

SQL>

%FOUND does the reverse of %NOTFOUND, returning true if the fetch succeeded. %ISOPEN allows us to test whether a cursor is open before we attempt to close it. And %ROWCOUNT gives us a running total of the number of rows fetched so far.

The attributes SQL%NOTFOUND, SQL%FOUND, SQL%ISOPEN and SQL%ROWCOUNT perform exactly the same functions but for implicit cursors. Except that SQL%NOTFOUND is ignored, because the NO_DATA_FOUND exception takes precedence:

SQL> declare
2 rec emp%rowtype;
3 begin
4 select * into rec
5 from emp
6 where ename = 'APC';
7 if sql%notfound then
8 dbms_output.put_line('There is no employee called APC');
9 end if;
10 exception
11 when no_data_found then
12 dbms_output.put_line('EXCEPTION: no data found');
13 end;
14 /
EXCEPTION: no data found

PL/SQL procedure successfully completed.

SQL>

So what is the point of SQL%NOTFOUND? Well, selects are not the only type of statement which open a cursor. DML statements do too. In the following example I use that cursor attribute to see whether my statement updated any rows:

SQL> begin
2 update emp
3 set sal = 10000
4 where ename = 'APC';
5 if sql%notfound then
6 dbms_output.put_line('There is no employee called APC ...');
7 end if;
8 exception
9 when no_data_found then
10 dbms_output.put_line('EXCEPTION: no data found');
11 end;
12 /
There is no employee called APC ...

PL/SQL procedure successfully completed.

SQL>

I could have used if sql%rowcount = 0 to achieve the same effect.

The PL/SQL documentation has more information on these attributes, including some useful advice on the positioning of calls in relation to implicit cursors.

Thursday, December 14, 2006

Why DUAL?

One of my colleagues has just started reading Oracle Insights: Tales From The Oak Table. He is halfway through Dave Ensor's chapter on the history of Oracle. This explains many things - why that ubiquitous file is called afiedt.buf, why Oracle had the world's first commercial RDBMS even though it was based on IBM's System-R, why nobody ever used MTS. One thing it doesn't cover is why the DUAL table is so named.

Now the reason for this appeared in an Oracle Magazine column a few years back. I clicked on my bookmark so I could forward the URL only to get a 404 error. Turns out there's been a recent purge of the Oracle Magazine online archive. I presume this is because of a desire to retire stale and potentially misleading information rather than because OTN is running out of disk space.

Fortunately the Wayback Machine still has a copy of the web page in question. But it is the Internet Archive only works if we know what and when we know what we are looking for. So, s a public service and to keep this information in a Google-isable domain, I reproduce the text here.


The History of Dual
In the November/December issue of Oracle Magazine you recount the history of the company, so I'm wondering if you can tell me how the DUAL table got its name. People I've asked, including seasoned Oracle gurus, have not been able to answer this.
Sean O'Neill

I created the DUAL table as an underlying object in the Oracle Data Dictionary. It was never meant to be seen itself, but instead used inside a view that was expected to be queried. The idea was that you could do a JOIN to the DUAL table and create two rows in the result for every one row in your table. Then, by using GROUP BY, the resulting join could be summarized to show the amount of storage for the DATA extent and for the INDEX extent(s). The name, DUAL, seemed apt for the process of creating a pair of rows from just one.
Chuck Weiss, Oracle

Source: Oracle Magazine "Sendmail column" January 2002

Update


In the Comments Marco Gralike linked to a thread on AskTom. That URL was broken by the migration of AskTom to a new version of ApEx. Here is the new URL for the thread.

Tuesday, December 12, 2006

Title is NULL

Gary Myers has just written about NULL, making the great point that NULL does not mean "unknown". He cites his middle name as proof of this. His parents never gave him one. Hence, a NULL for his middle name does not represent an unknown value because Gary knows his middle name does not exist. But how can we tell that from a SELECT on the CUSTOMER_NAMES table? This is why NULL is such a slippery customer: it is not just the absence of value, it is the absence of meaning too. A NULL in a column is fundamentally uninterpretable.

For instance, a NULL in the column MIDDLE_NAME might represent non-existence, someone like Gary who doesn't have a middle name. Alternatively it might mean inapplicability, because the entity we are recording is an organisation for which the concept of middle names is nonsensical. It might indicate data withheld because the person has an embarrassing middle name and declined to tell us what it is. It might portend cultural incomprehension, when the respondee literally does not understand the concept of "middle name". It might simply be a case of slack input: the operator forgot to ask or forgot to enter the value.

Hugh Darwen, in his UKOUG 2005 presentation, said that this aspect of NULL really bothered Ted Codd and that in his later life Codd devised a scheme for representing all these different flavours of NULL. Apparently Codd had identified twelve or thirteen distinct "meanings" of NULL. A project that surely invites an index finger twirling by one's temple.

The fact that NULL is not just an absence of value but an absence of meaning is one that often escapes people. Once on a project I had the task of editing a whole slew of Forms modules, setting every mandatory Y/N flag to be optional and changing the default value to be NULL instead of N. For the customer it was quite obvious that the default value for the question "Have we received the letter?" should be NULL because the answer is unknowable until the operator actually gets to that page and sets a value. So for them the meaning of NULL was "This question has not been answered". Unfortunately, in practice NULL could also mean "The operator doesn't know whether we have received the letter". Or "The operator skipped this question". Or "The customer has not sent us a letter." Or ... well, you get the drift.

Unfortunately the solution is not as simple as adopting Fabian Pascal's policy of no NULLs in the database. Because it is not just enough to slap a NOT NULL constraint on every column: that path leads to magic values and they are worse than NULLs. We need to rejig our data models to properly handle those legitimate situations for which no value exists. Which ultimately means having a MIDDLE_NAMES table, consisting of keys and a solitary MIDDLE_NAME column. But we also need better joining syntax to handle all those SELECT statements attempting to retrieve data from the MIDDLE_NAMES table. Outer joins are no good: they just re-introduce null values into the result set, so we are no better off. Modern SQL databases - not just Oracle - are not capable of representing all the different nuances of NULL.

Given the current state of SQL I think we have no choice but to live with NULLs. We just need to remember that when we find a NULL in our data it is not just that we don't have a value in this instance; we also don't know why we don't have a value in this instance.

Further reading


'NULL: Nothing to worry about' by Lex de Haan and Jonathan Gennick
'How to Handle Missing Information without Using NULL' by Hugh Darwen

Wednesday, December 06, 2006

Back to the Future of Forms Pt.94

An old chestnut from the UKOUG Development SIG has appeared in a thread on the ODTUG listserv: the Future of Forms. Specifically whether it is worthwhile doing new development work in Oracle Forms. The Tools Statement Of Direction states that extended support for Forms 11g will be available until 2013. There is no statement beyond that, but seven years is a long time in technology.

We do know that by the time Project Fusion has finished in 2013 the whole of the Oracle E-Business Suite will be all Java web stuff, no Forms whatsoever. But the next version of Oracle Applications (12) will still have a hefty slice of Forms in it. Whilst Apps is a consumer of Forms there will be support.

What happens to Forms after Apps no longer uses them is a different matter. Presumably as long as enough people are prepared to keep paying for the licences I guess Oracle will provide support, maybe even enhancements. Which is, I suppose, the real reason for the de-support of client/server Forms.

In the old days Oracle could afford to just licence the Forms IDE, as a way of subsidising the development environment for Oracle Apps. But once Oracle had decided to take the Java open road that stopped being such a big concern. It became more important that Forms customers pay their own way. By requiring Forms users to stump up for Application Server licences Oracle have turned the product into a sustainable revenue stream. Or killed it off. Take your pick.

Update: 19-DEC-2006


I have just noticed that Forms Product Manager Grant Ronald recently wrote on The 2013 Conspiracy.

Friday, December 01, 2006

Repetition

"This is the three Rs:
Repetition Repetition Repetition"

Repetition - The Fall

Bob Baillie's most recent blog entry reminds us of the Pragmatic Programmers' DRY principle: Don't Repeat Yourself. In other words, don't have duplicated code. Repetitious code is harder to maintain. It can also be harder to understand, because it is more verbose. Code shorn of duplication tends to be more modular and so lends itself to reuse.

These are all noble aims. The trick is to identify the duplicated code.

If you're using cut'n'paste you're probably making a design error


It is relatively easy to spot the duplication in a chunk of code like this:

...
x := get_some_value(4);
y := my_array.next();
call_some_routine(x,y,z);
if z = 0 then
x := get_some_value(5);
y := my_array.next();
call_some_routine(x,y,z);
if z = 0 then
x := get_some_value(6);
y := my_array.next();
call_some_routine(x,y,z);
...
end if;
end if;

There has to be a better way to manipulate that array. And those nested IF statements are the Suck! We can extract some of the duplicated code into a common function and call it from inside a loop

function dry (p1 in number, p2 in number) return number
is
n number;
return_value number;
begin
n := get_some_value(p1);
call_some_routine(n,p2,return_value);
return return_value;
end;
...
for i in my_array.first()..my_array.last()
loop
dry(i,my_array(i), z);
if z != 0 then
exit;
end if;
end loop;

Spotting repetitious code like that is fairly easy because the duplication is co-located and the same shape. Repetition can be harder to spot when the problem is code in different PL/SQL packages. Or when the duplication occurs in different shapes. For instance we might write an insert statement which assigns a sequence.nextval to the primary key column when the table owner has already created a before insert trigger to do precisely the same thing.

Digging out repetition


Repetition can arise in two ways: we are writing a new piece of functionality and we find ourselves doing cut'n'paste a lot. Or when related pieces of functionality get written by different people or at different times.

When we find repeating code in existing systems we can decide to leave it be or to remove it. Normally this happens because we are working on the program, during a bug fix or enhancement. So it makes sense to tackle it now by (dread word) refactoring the code. There are a number of different approaches to refactor the code, depending upon the precise form of the repetition. The important thing is to ensure that deduplicated code works exactly the same as the duplicated code. This means we need to have tests - preferably automated ones - which we can run before and after we change the code.

It is preferable to avoid repetitious code in the first case. Easier said than done. It can be hard to write shared code and get it right first time. Carl Manaster has written a very interesting article on a technique he calls Triangulation. The basic principle is:
  1. Write the code for the first case in full, to get the solution right.
  2. Write the code for the second case in full, to identify the common features.
  3. Extract the duplication into a common function and refactor the two cases to use it.

In other words, to avoid repetition we must first introduce it. Otherwise we cannot be sure that our common function correctly and completely captures the duplicated functionality. Carl's article again emphasises the importance of having automated unit tests to demonstrate that removing the repetition has not broken our code. There are just some things we can't repeat often enough.