Urine dipstick analysis is one of the most commonly performed tests in clinical laboratories. It is a simple and rapid test suitable for emergency as well as for primary care settings where urine dipstick analysis is often used to diagnose urinary tract infections, proteinuria, haematuria, and some other conditions (1, 2).
Unfortunately, urine dipstick testing suffers from a substantial variability among manufacturers respective to their sensitivity, specificity and measurement range (3). It has been demonstrated that some urine dipsticks have poor ability to accurately detect proteinuria due to their low sensitivity (4). Various dipsticks may differ in their diagnostic performance regarding leukocyte and erythrocyte detection (5). There is also evidence that urine dipstick pH analysis shows insufficient accuracy (6).
Such difference between manufacturers increases the possibility for diagnostic errors, leading to inappropriate decisions thus creating a serious risk for the patient. Obviously, it is highly desirable that results of urine dipstick testing are comparable between different test strip manufacturers.
There are 195 medical laboratories in Croatia, out of which majority (N = 174) perform urine dipstick testing. Based on the data of our national External Quality Assessment (EQA) provider (Croatian Centre for Quality Assessment in Laboratory Medicine, CROQALM), there are 14 urine dipstick manufacturers on the market, who all together offer 24 different types of urine dipsticks (EQA – CROQALM laboratory reports, unpublished data). Our hypothesis was that dipsticks used for qualitative urinalysis in Croatia are heterogeneous and poorly standardized. Although many authors have studied the comparability of several dipsticks, such a comprehensive analysis of 12 different dipstick manufacturers so far has not been done. Our aim was therefore: a) to determine the level of agreement between 12 most commonly used dipsticks in Croatia using urine samples, and b) to examine their analytical performance by determining their repeatability and analytical accuracy for glucose and total protein (by comparison with quantitative measurement on chemistry analyser).
Materials and methods
This analytical validation study was done in the University Hospital “Sveti Duh” (Zagreb, Croatia) between March and May 2017. We have collected 75 urine samples from in- and out- patients to validate comparability and accuracy of 12 dipstick brands used in Croatia. Samples were collected randomly (at any time) in polystyrene tubes (10 mL, 16x95, Deltalab, Barcelona, Spain) and analysed within 2 hours of sample receipt. Additionally, 12 urine samples were used to validate repeatability for each dipstick brand. The list of 12 dipsticks used in this study is provided in Table 1.
Urine samples were carefully chosen according to the results (negative, 1+, 2+ and 3+) obtained on automated urinalysis chemistry analyser (iChem Velocity, Beckman Coulter, Brea, USA) to ensure a wide range of concentrations of each dipstick parameter. Only urine samples with adequate volume (at least 5 mL) have been selected and further divided into three aliquots (1 mL each) and the rest of the sample was used for urine test strips dipping. Aliquotes were measured on three automated analysers to assess dipsticks accuracy for glucose and total protein. Patient data privacy was ensured throughout the study. Study was done with the approval of the hospital Ethical Committee.
Dipsticks comparability and repeatability
Comparability and repeatability of the dipsticks were performed according to the Clinical and Laboratory Standards Institute (CLSI) guideline EP12-A2 (7). The comparability of urine dipsticks was examined on 75 urine samples for parameters: glucose, total protein, erythrocytes, lekocytes, ketones, bilirubin, urobilinogen, nitrite, specific gravity (SG) and pH (acidity or basicity). Test strips were examinated visually by three observers at the same time, using the color scale provided by the manufacturer. In case when there was a disagreement between observers, a reassessment was done and final color was agreed by a consensus opinion of all three observers.
Dipsticks repeatability was tested on 20 repeated measurements of each dipstick brand. Replicates were done using the same urine sample in one laboratory (under the same ambient conditions, e.g. the same room temperature and light exposure). Three observers also visually examined these dipsticks.
Analytical accuracy: comparison of dipstick and quantitative measurement
Analytical accuracy assessment was performed according to CLSI EP09-A3 guideline (8). Accuracy of urine dipsticks for glucose and total protein was investigated on 75 urine samples. Glucose and total protein were quantitatively measured using three different analysers on three locations in Zagreb: AU400 (Beckman Coulter, Brea, USA) in University Hospital “Sveti Duh”, Architect plus c4000 (Abbott, Abbott Park, USA) in Children’s Hospital Zagreb, and Cobas 6000 c501 (Roche Diagnostics GmbH, Mannheim, Germany) in University Hospital Centre Zagreb. Urine aliquots (1 mL) were wrapped in aluminum, transported to other two laboratories on the same day and analysed within 4 hours. Urine proteins were measured with original reagents, by photometric dye-binding pyrogallol red molybdate assay on AU400 analyser, and turbidimetric method with benzethonium chloride on Cobas 600 c501 and Architect plus c4000. Glucose was measured by hexokinase method on all three analysers, with original reagents. Systems were monitored daily using commercial internal quality control (IQC) materials: AU400 (Liquichek urine chemistry control, Bio-Rad Laboratories Inc., Hercules, USA, LOT: 66781 and 66782), Architect plus c4000 (Multichem U, Technopath, New York, USA, LOT: 23110161 and 23109162) and for Cobas 600 c501 (Liquichek urine chemistry control, Bio-Rad Laboratories Inc., Hercules, USA, LOT: 66771 and 66752). Analysers were calibrated in case IQC results were out of range.
Since there is no recommendation for a reference method for urinary total protein measurement, and given the large differences between these two methods, dipstick results for proteins were compared with quantitative measurements by two methods (pyrogallol red molybdate and benzethonium chloride) separately (9). Furthermore, dipstick results for glucose were compared to mean value of all three chemistry analysers.
Day-to-day precision of glucose and total protein in urine samples
For each analyser included in this study, day-to-day precision was evaluated on measurements of two level control materials (Liquichek urine chemistry control, Bio-Rad Laboratories Inc. and Multichem U, Technopath) in 20 days. Day-to-day precision performance criteria (coefficient of variation: CV, %) were set in accordance with Reference Institute for Bioanalytics (RfB): for proteins 19.73% and 10.13% (at concentrations 0.15 and 0.97 g/L) and for glucose 10.94% and 7.81% (at concentrations 1.2 and 11 mmol/L).
Level of agreement between each dipstick and the reference dipstick was tested by weighted kappa test and expressed as Cohen kappa value (κ). The most commonly used brand in Croatia in 2017 (based on the data from our national EQA provider), served as a reference. Kappa value was considered acceptable if ≥ 0.80 (10). Although the number of fields for each parameter differed between the dipstick brands, for the purpose of the assessment of the agreement, the observers have merged some categories (where the number of observations was low) and results were classified into 4 categories (neg/norm (N), 1+, 2+, 3+). For each category at least 10 samples were used.
We have excluded from comparability analysis those dipstick brands which did not have concentrations assigned to categories: ChoiceLine 10 (Roche), Combur 10 Test UX (Roche), ComboStik 10M (DFI Co., Ltd.), ComboStik 11M (DFI Co., Ltd.), Combina 10M (Human) and Multistix 10SG (Siemens) for bilirubin and UriGnost 11 (BioGnost Ltd.) for erythrocytes.
Analytical accuracy of urine dipsticks for glucose and total protein was assessed by comparing the readings from the dipsticks with the true value of the parameter measured by the quantitative test results from chemistry analysers. Glucose and total protein concentrations were distributed into categories: for total protein: N = 0 - 0.29 g/L, 1 = 0.30 - 0.99 g/L, 2 = 1.00 - 2.99 g/L, 3 = more than 3.00 g/L); and for glucose: N = 0 - 2.79 mmol/L, 1 = 2.80 - 8.29 mmol/L, 2 = 8.30 - 27.99 mmol/L, 3 = more than 28 mmol/L. Categories obtained by dipstick and quantitative testing were compared and number of true positive and negative, and false positive and negative findings were established. According to these results, analytical sensitivity and specificity were calculated for each dipstick brand. Dipsticks with sensitivity and specificity ≥ 90% were considered excellent, those with ≥ 80% were satisfactory and the other dipsticks (< 80%) were considered as being of less than acceptable quality. Acceptance criteria for repeatability was 90% (18/20 results) of repeated measurements.
Data were analysed using MedCalc 220.127.116.11 (Ostend, Belgium) statistical software.
Combur 10 Test M (Roche) was chosen as a reference because it was the most commonly used dipstick brand in Croatia in 2017 according to the national EQA provider (44/174, 25%). Levels of agreement between dipsticks and the reference for each parameter, expressed as κ, are shown in Table 2. Combur 10 Test UX (Roche) showed the best agreement with the reference dipstick (κ > 0.80) for all parameters. The lowest level of agreement was shown for Combina 13 (Human) and the reference, particularly for bilirubin, urobilinogen, pH and SG (κ < 0.46).
The best overall comparability (κ > 0.80) was achieved for glucose and nitrite (11/11 brands) and total protein (10/11 brands). Moderate agreement (κ = 0.60 - 0.79) was observed for erythrocytes (9/10 brands) and leukocytes (9/11 brands). Overall, lowest kappa values were achieved for bilirubin. There was a weak level of agreement (κ = 0.44 - 0.54) for bilirubin in 3/5 brands and for the other two brands the agreement was minimal to none (κ = 0.33 - 0.16).
Repeatability was assessed on 20 replicates of each dipstick brand (Table 3). Repeatability for at least one parameter was < 90% for 6/12 dipstick brands. The most problematic parameter was pH, where as many as three dipstick brands had < 90% repeatability: ChoiceLine 10 (Roche), CombiScreen 10SL (Analyticon) and Combina 13 (Human).
Day-to-day precision of glucose and total protein in urine samples
Day-to-day precision (CV, %) for total protein measurement ranged 1.90 – 3.90% in the lower range (concentrations 0.18 – 0.27 g/L) and 1.10–2.88% in the higher range concentrations (0.62 – 1.26 g/L) on all three analysers. For urinary glucose measurement, CVs were 1.60 – 3.29% at lower concentrations (1.43 – 1.89 mmol/L) and 1.21 – 1.71% at higher concentrations (16.28 – 20.40 mmol/L) of control materials on all three analysers.
Analytical accuracy: comparison of dipstick and quantitative measurement
Analytical sensitivity and specificity of each dipstick for urinary glucose measurement is presented in Table 4. While sensitivity for glucose was > 90% for 5/12 dipstick brands, their specificity was modest (71 - 83%). Only three dipstick brands, Combina 13 (Human), Urignost 11 (BioGnost Ltd.) and Multistix 10SG (Siemens), were able to detect glucose with high specificity (> 90%), but with much lower sensitivity and higher false negative rate.
Analytical accuracy for urinary proteins is presented for each method (pyrogallol red and benzethonium chloride) separately (Table 5). Regarding pyrogallol red molybdate assay (AU 400, Beckman Coulter), none out of twelve dipsticks detected proteins with analytical sensitivity or specificity > 80%. Sensitivity was the highest (75%) for Combina 11S (Human), but this dipstick brand had lowest specificity (only 45%). Specificity was the highest (75%) for Combur 10 Test M (Roche), but its sensitivity was average (70%). Combina 13 (Human) had the lowest sensitivity for proteins (41%) and the highest false negative rate. Ability of other dipsticks to detect proteins specifically, varied between 63 - 74%.
As of the analytical accuracy respective to the turbidimetric method with benzethonium chloride, Combina 10M (Human) had the highest analytical sensitivity (92%) and several other dipsticks have achieved sensitivity > 80%. However, analytical specificities for these dipsticks varied between 41 – 72%. Combina 11S (Human) had the lowest specificity for proteins (42%) and the highest false positive rate (24/75). The lowest sensitivity (56%) was observed for Combina 13 (Human), with the highest false negative rate (15/75) and only average specificity (71%).
In this study, we performed comprehensive analytical verification of 12 most commonly used dipsticks in Croatia. Our results showed that these dipsticks are not sufficiently comparable and that they vary in analytical performance. Agreement between the dipsticks was acceptable for nitrites, proteins and glucose but there was remarkable diversity for other parameters like bilirubin, urobilinogen, pH and specific gravity. The most important clinically relevant finding was that most of the dipsticks did not accurately detected glucose and proteins.
As previously described in the literature, quantitative methods for urinary proteins are not mutually comparable and none of the available methods is considered as a “gold standard” method (9). In our study, the agreement of dipsticks was better with turbidimetric method for total urinary protein. Respective to pyrogallol red molybdate assay, none of the dipsticks showed acceptable accuracy for total urinary protein. On the other hand, respective to turbidimetric method with benzethonium chloride, seven out of twelve dipsticks showed satisfactory sensitivity but were lacking the adequate specificity for urinary proteins. Consistent with these observations, reference intervals for total urinary protein excretion recommended by the European Urinalysis Group are higher for pyrogallol red molybdate assay (< 180 mg/day) than turbidimetric methods (< 75 mg/day) (11).
In general, our results demonstrate that dipsticks have unacceptably high false negative rates and even higher false positive rates for total protein. Our findings are in line with several previous studies, who have also confirmed the suboptimal accuracy of qualitative urine dipstick analysis for total urinary protein (4, 12). Our findings also point to low accuracy of urine dipstick analysis for glucose. Only four dipstick brands have achieved both sensitivity and specificity higher than 80%. This is in line with some earlier observations (13). Considering this limitation, International Diabetes Federation suggests the use of glucose dipstick testing only in low resource settings, where other glucose tests are not affordable (14). Obviously, substantial improvement of the accuracy of dipsticks for protein and glucose is highly warranted.
Whereas the level of agreement between the dipsticks in our study was acceptable for nitrites, it was less than acceptable for erythrocytes and leukocytes. Given the widespread heterogeneity of available brands of dipstick manufacturers in Croatia, and probably even worldwide, such lack of agreement between various manufacturers creates the opportunity for patient misclassification in these conditions where parameters such as nitrites, erythrocytes and leukocytes are of diagnostic relevance (e.g. urinary tract infections). Moreover, at least for some manufacturers, low reproducibility for leukocytes might be an additional issue. Urine dipstick testing (especially the combination of leukocytes, blood and nitrites) has been proposed as a first step to diagnose urinary tract infection (UTI) (15, 16). National Institute for Health and Care Excellence (NICE) guidelines recommend using dipsticks as a screening tool, based on the assumption that UTI can be safely ruled out with both negative leukocyte esterase and nitrite in asymptomatic patients (17). Obviously, while this may be the case for some dipsticks, other may not be as accurate. Therefore, unless some improvement in this respect is made, it is to be expected that at least for the users of some dipstick manufacturers, the ability to detect UTI will remain less that acceptable. This is even more worrying, given the fact that positive leukocytes in extravascular fluids such as ascites and synovial fluid have recently been proposed as useful indication for some conditions like spontaneous bacterial peritonitis and periprosthetic joint infection, respectively (18-22).
Low level of agreement of urine dipstick parameters is an issue in some other health conditions where erythrocytes alone are used in diagnostic process. For example, dipstick blood assessment is often used for bladder cancer regular check-up. NICE guidelines state that asymptomatic microhaematuria may be an early sign of a bladder cancer in people aged 60 and older, but do not define whether dipsticks or microscopy should be used for asymptomatic microhaematuria assessment (23). Moreover, American Urological Association recommends that positive blood on the dipstick and negative on sediment count, should be followed by three additional sediment microscopic evaluations. If at least one of those tests is positive, further actions and treatment decisions should be taken (24). Apparently, the above-mentioned guidelines and recommendations do not take into account the low accuracy of dipstick testing for erythrocytes (haematuria) and low level of agreement between various manufacturers, and thus may lead to either over- or under-estimation of the occurrence of haematuria, which may significantly jeopardize patient safety. Due to unacceptable high false negative rate, negative dipstick test cannot rule out disease of symptomatic patients. False positive haematuria dipstick result can also lead to increased number of microscopic sediment examinations, further urological examinations and unnecessary testing like imaging or cystoscopy (25). Hence, high false positive rate of erythrocytes may also substantially increase laboratory workload and affect healthcare costs. Given the reasons discussed above, it is essential that dipstick manufacturers improve analytical performance for dipstick ability to accurately detect erythrocytes in urine. Otherwise, it is reasonable to consider diagnostic value of blood on the dipstick quite limited or even questionable.
In our study on 12 most common dipsticks in Croatia there was a wide heterogeneity in kappa values for bilirubin, urobilinogen, pH and specific gravity, pointing to the low comparability of the results obtained by different brands of dipsticks. Also, some dipsticks in our study were of unacceptable repeatability for pH. Some previous literature reports have also demonstrated unacceptable precision and accuracy of the dipsticks comparing them with gold standard, pH – meter (26). It has also been reported that dipsticks vary in accuracy due to proportions and combinations of the reagents (like methyl red and bromthymol blue) in pH fields provided by different manufacturers (27). Previous studies described usefulness of specific gravity as additional parameter which increases the accuracy for proteinuria assuming that concentrated urine is more likely to have positive protein field on the dipstick (28). Hillege opposed this statement claiming that this algorithm has nonsignificant yield in diagnostic accuracy (29). Furthermore, there is inconsistency in some earlier studies which described the use of specific gravity in evaluating the degree of dehydration and optimal urine output in patients with nephrolithiasis (30). Although bilirubin and urobilinogen in urine indicate several liver conditions like hepatocellular disease, biliary obstruction and cholestatic jaundice, it should be noted that liver diseases are diagnosed after clinical examination, some obvious symptoms like yellow skin and eye discoloration, imaging studies and liver tests in blood. Therefore, bilirubin and urobilinogen dipstick tests have no real diagnostic value (11). Given the low analytical quality and limited clinical utility of these parameters, it would be reasonable to question the need for these parameters in the first place.
Our study has some potential limitations. We have assessed the level of agreement of 12 most common dipstick brands by comparing them to the one which was the most common in Croatia. It could be that the agreement would be different if some other manufacturer was chosen as a reference. Also, we have analyzed dipstick repeatability by testing different urine sample for every dipstick brand, since it was logistically challenging to ensure an adequate amount of urine to do all testing in the same urine. We acknowledge this as a limitation and potential source of bias, due to matrix effects. Furthermore, only pathological samples were chosen for this part of the study thus possible endogenous and exogenous interferences could have also affected our results. Finally, we have assessed the accuracy only for glucose and proteins. We acknowledge that it would be beneficial to also evaluate the accuracy for some other parameters, such as leukocytes, erythrocytes and nitrites, by comparison with urine sediment microscopy and microbiological testing. Nevertheless, due to some local challenges and operational difficulties we were not able to perform such analysis in this study.
In summary, 12 most commonly used dipsticks in Croatia showed low level of agreement among each other. Dipsticks accuracy and precision showed considerable variability between different manufacturers. Most dipsticks do not accurately detect glucose and proteins. Given the widespread heterogeneity of available brands of dipstick manufacturers in Croatia, but also possibly even worldwide, these issues create the opportunity for patient misclassification, jeopardize patient safety and increase healthcare costs. Obviously, some improvement in that respect (i.e. standardization among manufacturers and improvement of the quality of dipsticks) is highly necessary to minimize patient risk. We believe that, although our study addresses the situation in Croatia, it is also relevant to other countries in Europe and beyond.