Searching legacy data using the Discovery Server XML spider

Element name	Description	Mandatory/ optional	Can be empty?	Multiple entries?
DOCUMENT	The root element that contains an <IDENTIFIER> element followed by either a <DELETE/> tag or a set of metadata elements.	Mandatory	No	No
IDENTIFIER	Contains a document ID, a unique alphanumeric string that identifies the document. This tag has to be present for each document and appears only once. It cannot be empty. The <HTTPURL> or <NATIVEURL> elements may be good candidates for the <IDENTIFIER>. Note: A semicolon is not a valid character for the <IDENTIFIER>.	Mandatory	No	No
AUTHOR	Contains the name of the person who created the document. If the author is unknown, the tag can be left empty. Each <AUTHOR> element can contain only one name, but multiple occurrences of the element are allowed. For example, if the authors are Paul White and Jake Black, then the <DOCUMENT> element contains: <AUTHOR>Paul White</AUTHOR> <AUTHOR>Jake Black</AUTHOR> Authors are used to generate affinities.	Mandatory	Yes	Yes
UPDATEDBY	Contains the name of the person who last modified/saved the document. This tag follows the same rules as <AUTHOR>. In addition, it has the added requirement that it cannot be empty when the <AUTHOR> tag is filled. If people who modified the document are unknown, copy authors into <UPDATEDBY> elements.	Mandatory	Yes	Yes
CREATED	Contains the date/time when the document was first created. All time/dates must be GMT, which means the XML spider performs no conversion. The date/time should be presented in the YYYY-MM-DD-HH.MM.SS format, for example: 1999-03-30-21.19.26 Each <DOCUMENT> has to contain exactly one <CREATED> tag, unless the document is being deleted. This element cannot be empty. If the creation date is unknown, default to the current time.	Mandatory	No	No
LASTREAD	Contains the date/time when the document was last accessed. This tag follows the same rules as the <CREATED> element. If the time of the last access is unknown, copy the <CREATED> time into this tag.	Mandatory	No	No
MODIFIED	Contains the date/time when the document was last modified. This tag follows the same rules as the <CREATED> element. If the time of the last access is unknown, copy the <CREATED> time into this tag.	Mandatory	No	No
REVISIONS	Contains a set of <REVISION> elements. Only one <REVISIONS> per document is permitted.	Mandatory	No	No
REVISION	This child element of <REVISIONS> contains a date/time when the document was modified. This tag follows the same rules as the <CREATED> element, but multiple <REVISION> elements are permitted. If revision information is unknown, copy the <CREATED> time into this tag.	Mandatory	No	Yes
FIELD	Optional element. Reserved for future use.	Optional	Yes	Yes
TITLE	Contains the title of the document. Same requirements as for the <SUMMARY> element below. If omitted, Discovery Server attempts to generate a title using document subject, file name derived from the <HTTPURL>/<NATIVEURL> elements, or the first line of the document body. If unsuccessful, Discovery Server defaults to "[Untitled]."	Optional	No	No
SUBJECT	Contains the subject of the document. Same requirements as for the <SUMMARY> element below. If subject is omitted, Discovery Server attempts to generate one.	Optional	No	No
SUMMARY	Provides a short description of the document. If present, the element should not be empty. Only one <SUMMARY> per document is permitted. If this tag is omitted, Discovery Server tries to create a summary, providing that the document is written in a supported language. See the list of supported languages later in this article. Summary should be 256 characters or shorter.	Optional	No	No
KEYWORDS	Contains a set of <KEYWORD> elements. This element is optional, but if present, it should contain at least one <KEYWORD>. Only one <KEYWORDS> element per document is allowed. All keywords combined should be MAX_LEN characters or shorter, where MAX_LEN is computed according to the formula: MAX_LEN = 256 - (number of <KEYWORD> elements).	Optional	Yes	No
KEYWORD	A child element of <KEYWORDS>. Same requirements as for <SUMMARY>, except that multiple <KEYWORD> elements are allowed.	Optional	No	Yes
BODY	Contains the document body. This tag can be empty. Only one <BODY> element per document is allowed.	Mandatory	Yes	No
APPLICATION	Provides the name of the application that created the document. If present, this tag should not be empty. Note: This setting does not affect how the K-Map and the K-Map Editor display the document. Non-Domino files are viewed in the application registered by the operating system for the document's file extension.	Mandatory	No	No
LANGUAGE	Contains the default language of the document represented in the two-letter ISO 639 language abbreviation followed by an optional ISO 3166 country code. In the ISO 639/ISO 3166 convention, language names are written in lowercase, while country codes are written in uppercase, for example, en-US. If this tag is empty, Discovery Server guesses the correct language by examining the document body. This information is used to create a document summary and a list of keywords. Only one <LANGUAGE> element per document is allowed. In addition to the two-letter ISO 639 language abbreviations, use bk for Bokmal and ny for Nynorsk.	Mandatory	Yes	No
NATIVEURL	Contains the Notes or File URL of the document, for example: NOTES://epr.acme.com/Wsj.nsf/0/00E5CFF068B33B1A852569760021DDDF?Open file:////epracmepr/fdrive/Testfiles/fsspr/93sales.xls. <NATIVEURL> is used by the K-Map and the K-Map Editor to display the document. If the document cannot be retrieved, the <HTTPURL> tag is used instead. For Notes documents, native URL is used when the <USENOTES/> tag is present.	Optional		No
HTTPURL	Contains the HTTP URL of the document, for example: HTTP://epr.acme.com/Wsj.nsf/0/00E5CFF068B33B1A852569760021DDDF?OpenDocument URLs should be represented in absolute form and be consistent with Uniform Resource Identifiers (URI): Generic Syntax and Semantics, RFC 2396.	Mandatory		No
ACL	Optional tag that contains a document's access control list. Contains a collection of ALLOW and DENY elements. If empty or missing, it is assumed that everyone with repository access can view the document. Only one access control list per document is permitted. In addition to the document ACL, Discovery Server takes into account the repository ACL supplied in the Database.xml file. The document is exposed to a user only if the user is included in the repository and document ALLOW lists and is not included in the DENY list. Ensure that user identities defined in external repositories map to user identities Discovery Server uses to grant access to the K-map (via HTTP authentication and user identity and password contained within the DS Directory - Person record)	Mandatory	Yes	No
ALLOW	This child element of <ACL> contains a group or a user who can read the document. This element is optional, but if present, it should not be empty. Each <ALLOW> element can contain only one name or group, but multiple occurrences of the element are allowed. If <ALLOW> tags are not present, Discovery Server assumes that all users with repository access can view the document, except for those explicitly stated in the DENY list. NT users and groups should be listed in the DOMAIN/USER_NAME format. Currently, access checking of users in external NT domains is not supported. Names are case-insensitive.	Optional	No	Yes
DENY	This child element of <ACL> contains a group or a user who is denied reader access. Same requirements as for <ALLOW>.	Optional	No	Yes
LINKS	Contains a collection of <LINK> elements. This element is optional, but if present, it should contain at least one <LINK>. Only one <LINKS> element per document is allowed.	Optional	No	No
LINK	This child element of <LINKS> describes a link contained in the document. The URL attribute is an address that points to the destination anchor, that should be represented in absolute form, and that should be consistent with Uniform Resource Identifiers (URI): Generic Syntax and Semantics, RFC 2396. Multiple occurrences of the element are allowed.	Optional	Yes	Yes
USENOTES	Specifies a preferred viewer for a Lotus Notes document. If this tag is present, the document opens in a new Lotus Notes window. If the Notes client is not available, the default Internet browser becomes the fallback viewer. If the tag is omitted, the document is displayed in a new browser window. This element is always empty.	Optional	Yes	No
INCLUDE	Optional tag. Only one <INCLUDE> tag per document is allowed. If this element is present, Discovery Server merges the included file and the container. The FILEPATH attribute should not be empty. The resulting document is registered with the Discovery Server. The merge happens according to the following rules: Always use text (document body), keywords, and summary from the included file. Use container's metadata. If container is missing an author or title, then get omitted information from the included file. Use container's ACLs. Attribute: FILEPATH - location of the attached file.	Optional	Yes	No
ATTACHMENT	Optional tag. If this element is present, the attributes should not be empty. For multiple attachments in a document, Discovery Server processes the attached file in a manner similar to how attachments are processed for Domino. Attachments are registered with Discovery Server independently from the container. Rules used to process attachment metadata: If the attachment is missing an author, title, summary, or keywords, then get omitted metadata from the container. Combine the title/subject of the container with the attachment name. For example, if container's title is Container1, and attachment is named file1.doc, the new title is Container1(attachment - file1.doc). Use container's ACLs. Attributes: FILEPATH - Specifies the location of the attached file IDENTIFIER - Contains a unique ID for the attachment NATIVEURL - Specifies the Notes or File URL of the attachment HTTPURL - Specifies the HTTP URL of the attachment USENOTES - Specifies a preferred viewer for the attachment	Optional	Yes	Yes
DELETE	Used to remove previously processed documents from the data repository. If this tag is present, the document is deleted. This element is always empty. Note: If the document is new and has not been registered during a previous run of the spider, a message is logged.

Element name	Description	Optional/ mandatory	Can be empty?	Multiple entries?
DATABASE	The root element that contains <DATABASEID> and <ACL> elements.	Mandatory	No	No
DATABASEID	Contains a data repository ID as an alphanumeric string that identifies the repository. This tag is required, can appear only once, and cannot be empty. On the initial traversal, the ID should be new and unique. In other words, there should be no other data repository registered with this ID. On the following traversals, the ID should not be changed. Note: The semicolon is not a valid character for the <DATABASEID>.	Mandatory	No	No
NATIVEURL	Contains the Notes or File URL for the repository. <NATIVEURL> is used by the K-map and the K-map Editor to display the repository. If the repository cannot be retrieved, <HTTPURL> is utilized instead. For Notes databases, <NATIVEURL> is used when the <USENOTES/> tag is present.	Mandatory	No	No
HTTPURL	Contains the HTTP URL for the repository. URLs should be represented in absolute form and be consistent with Uniform Resource Identifiers (URI):Generic Syntax and Semantics, RFC 2396.	Mandatory	No	No
SUMMARY	Provides a short description of the repository. If present, the element should not be empty. Only one <SUMMARY> tag per database is permitted. The summary should be 256 characters or shorter.	Optional	No	No
OWNER	Contains the repository owner's name. If present, the element should not be empty. Only one <OWNER> tag per database is permitted.	Optional	No	No
ACL	Contains the repository access control list. Contains a collection of <ALLOW> and <DENY> elements. If not present or if empty, it is assumed that everyone has repository access rights. Only one access control list per repository is permitted. In addition to the repository ACL, Discovery Server takes into account ACLs provided for each individual document. You need to ensure that user identities defined in external repositories map to user identities that Discovery Server uses to grant access to the K-map (via HTTP authentication and user identity and password contained within the Discovery Server Directory - Person record). Attributes: DOMAINTYPE - Set to WindowsNT or Domino DEFAULTDOMAIN - This default name is used if <ALLOW> or <DENY> elements lack domain names and only contain user names (for example, <ALLOW>User1</ALLOW>). The Authentication part of Discovery Server must know about this domain to handle it.	Optional	Yes	No
ALLOW	This child element of <ACL> contains a group or a user who can read documents in the repository. This element is optional, but if present, it should not be empty. Each <ALLOW> element can contain only one name or group, but multiple occurrences of the element are allowed. If <ALLOW> tags are not present, we assume that all users are granted repository access rights, except for those explicitly stated in the <DENY> list. NT users and groups should be listed in the DOMAIN/USER_NAME format. Currently, access checking of users in external NT domains is not supported. Names are case-insensitive.	Optional	No	Yes
DENY	This child element of <ACL> contains a group or a user who is denied reader access. This element is optional. Same requirements as for <ALLOW>.	Optional	No	Yes