Spider | Description |
Notes spider | The Notes spider is fully multilingual-enabled thanks to LMBCS (the Lotus Multi-Byte Character Set). Attachments, stored in Lotus Notes documents are processed using the file filter module. The file filter's job is to extract the text and to strip any nonsignificant data (font attributes, images, etc.) from the attachment and then pass it on for XML-encoding. To get optimal fidelity from documents stored as attachments, consider upgrading their format to a file type which supports correct file filtering to Unicode, such as Lotus Notes. You should check the release notes of Discovery Server for a list of issues with specific file types. (The Notes spider is also used for QuickPlace, Domino.Doc, and Notes email.) |
File System spider | The File system spider relies heavily on the file filter. Therefore, the same considerations apply as they do with the Notes spider. The only file type that is not filtered by the file filter is HTML files. HTML files are filtered through Domino's HTML parser module. |
Web spider | The Web spider uses Domino's HTML parser. The HTML parser is very good. If you know that you will always spider Web sites that match the code page of the operating system of the server, you are likely not to have any multilingual or code page issues with the Web spider. Unfortunately, though, many Web sites do not express their encoding properly according to W3C standards. If you run across this, one way to get around it is to configure the "preferred codepage list" in Domino. For configuring Domino’s MIME preference, see the LDD Today article "Worldwide messaging: Using International MIME in R5." |
XML spider | The Discovery Server XML spider uses the Xerces/Xalan XML parsing engine, and thus the supported character sets are constrained by those supported by the engine. The list of supported languages for Xerces/Xalan is fairly thorough and covers all major character sets. However, when you are generating XML data, we recommend that you always generate it in one of the Unicode encoding schemes, namely UTF-8 or UTF-16BE, so that you are guaranteed that your XML is read correctly by any XML parser. |
Exchange spider | The Exchange spider processes attachments in the same way as the Notes spider. The document text is retrieved from Microsoft Exchange using Messaging Application Program Interface (MAPI) and is returned in Unicode format. |