How to use UTF-8, UTF-8 with BOM marker, XML and Java iostreams together

UTF_BOM FAQ
www Escapes
Wikipedia UTF-8
kuinka ääkköset toimimaan servletissä (in finnish)

Use UTF8 for your html files
You should use utf8 for all your html files, it just make life easier. There are two things to keep in mind, see example html below. If you follow these simple rules your site readers should not have problems displaying text.


<html>
<head>
  <title>Page Title</title>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
  <meta name="keywords" content="some,fine,keywords" />
</head>
<body>
your html content goes here....
</body>
</html>

XML
You should put BOM marker at the start of text files if possible. Then to make all even more safe add xml header row and specify encoding you use within a document.
    <?xml version="1.0" encoding="UTF-8"?>

Windows Notepad (Win2k, XP) can save files with BOM marker. Change your favourite text editor if it cannot cope with standard bom markers.

Windows WordPad (Win2k, XP) can't save files using UTF-8 charset.

Here is a small example xml document.

<?xml version="1.0" encoding="UTF-8"?>
<note>
   <body1>J&#228;ttil&#228;inen meni keittiöön 
  ja kaatoi kaikki kattilat.    Hiiri
            meni puutarhaan
   ja söi kaikki puut.
   </body1>
   <body2>char entities: &lt; &gt; &amp; &quot; &apos;</body2>
   <body3>safe xml chars: \O/</body3>
   <body4>Decimal Numeric Character Reference: &#196; &#8364;</body4>
   <body5>Hex Numeric Character Reference: &#x00C4; &#x20AC;</body5>
</note>
You should see this after unescaping document.

   Jättiläinen meni keittiöön 
  ja kaatoi kaikki kattilat.    Hiiri
            meni puutarhaan
   ja söi kaikki puut.

   char entities: < > & " '
   safe xml chars: \O/
   Decimal Numeric Character Reference: Ä €
   Hex Numeric Character Reference: Ä €

Java BOM recognition
UnicodeReader class
JDK bug 4508058

Java default io reader does not recognize all BOM markers. It it known to be fixed in JDK6, but I havent tested it yet. You can use UnicodeReader class to overcome problems and auto-recognize bom markers. It will give a transparent behaviour to underlying inputstreams.

Example code using UnicodeReader class
Here is an example method to read text file. It will recognize bom marker and skip it while reading.


   public static char[] loadFile(String file) throws IOException {
      // read text file, auto recognize bom marker or use 
      // system default if markers not found.
      BufferedReader reader = null;
      CharArrayWriter writer = null;
      UnicodeReader r = new UnicodeReader(new FileInputStream(file), null);
		
      char[] buffer = new char[16 * 1024];   // 16k buffer
      int read;
      try {
         reader = new BufferedReader(r);
         writer = new CharArrayWriter();
         while( (read = reader.read(buffer)) != -1) {
            writer.write(buffer, 0, read);
         }
         writer.flush();
         return writer.toCharArray();
      } catch (IOException ex) {
         throw ex;
      } finally {
         try {
            writer.close(); reader.close(); r.close();
         } catch (Exception ex) { }
      }
   }

Example code to write UTF-8 with bom marker
Write bom marker bytes to start of empty file and all proper text editors have no problems using a correct charset while reading files. Java's OutputStreamWriter does not write utf8 bom marker bytes.


   public static void saveFile(String file, String data, boolean append) throws IOException {
      BufferedWriter bw = null;
      OutputStreamWriter osw = null;
		
      File f = new File(file);
      FileOutputStream fos = new FileOutputStream(f, append);
      try {
         // write UTF8 BOM mark if file is empty
         if (f.length() < 1) {
            final byte[] bom = new byte[] { (byte)0xEF, (byte)0xBB, (byte)0xBF };
            fos.write(bom);
         }

         osw = new OutputStreamWriter(fos, "UTF-8");
         bw = new BufferedWriter(osw);
         if (data != null) bw.write(data);
      } catch (IOException ex) {
         throw ex;
      } finally {
         try { bw.close(); fos.close(); } catch (Exception ex) { }
      }
   }

XML Test Application, Config Test Application
Example application using UnicodeReader class with full sources. It reads various unicode xml text files and output values to UTF-8_with_BOM text file. Application uses UnicodeReader class to autorecognize unicode bom markers.
TestXML = read and write xml file
TestConfig = read and write properties file

javaXMLTest.zip
Reference image of xml file output
Html test page


Run test application, open data.txt.rtf file to WordPad or any text editor able to use unicode truetype/opentype fonts. I have found Arial Unicode MS font to be a very good. File is just a text file even so it has .rtf suffix. You may open it to Notepad but it might not show all characters properly as default. You can still use Notepad but to save file just do not edit unknown blackbox character letters.