Lotusphere 2001 Daily Digest: KM108 Building a Corporate Taxonomy

Country/region select

developerWorks

AIX and UNIX
Information Mgmt
Lotus
	New to Lotus
	Products
	How to buy
	Downloads
	Live demos
	Technical library
	Training
	Support
	Forums & community
	Events
Rational
Tivoli
WebSphere

Java™ technology
Linux
Open source
SOA and Web services
Web development
XML

My developerWorks
About dW
Submit content
Feedback

developerWorks > Lotus > Technical Library

Printer-friendly

KM108 Building a Corporate Taxonomy

Download the presentation from the session

Before diving into the subject of corporate taxonomies, Wendi Pohs presented a brief overview of the Discovery Server. The Discovery Server categorizes expertise and content and discovers meaning, relationships, and values from that information. It personalizes and organizes knowledge both for individual users as well as for the community of users. And it serves as a place for communities and teams to work together, making decisions and taking action.

The Discovery Server's services act on the gathered information, such as content, user profiles, user actions, and places. Wendi described each service briefly. The spidering service goes out and collects information from all the sources being used for input. Clustering creates the categories that will be used; this is the basis for your corporate taxonomy, which is also called the K-map. The index service is enhanced indexing, or as Wendi put it, "full-text index on steroids." The profile service holds information about people, developing user's profiles. The metrics service keeps track of all categories and actions, and the categorize service looks at documents and groups them under the categories of the taxonomy. The user interfaces for all this are the administrative UI, the K-map editor UI, and the end user UI.

With this overview, Wendi turned her attention to the question: What is a taxonomy? She explained it was a hierarchical collection of categories and documents; you can think of Yahoo as having a taxonomy with its categories, subcategories, and documents. A taxonomy includes both a structure and content, both labeling and clustering information. The benefit of organizing knowledge in this way is that it provides access to corporate information such that users don't have to know what they are looking for in order to find it. Like a book index, it organizes topics in a way that clusters similar information, saving users trouble and time.

"You can think about creating a taxonomy as search turned inside out," Wendi explained. When you search for something, you need to know the topic area well, know what you want to find, and know the words or phrases that will result in finding it. Searching derives information from the words in the text of the content you are searching; the taxonomy does this up front, before users start looking for anything. You need similar knowledge to create the taxonomy, but once you create it, you are opening the door for people to browse through categories and discover information they may not have thought to look for.

Returning to Yahoo as an example, Wendi explained that the categorization at Yahoo is created and maintained by people. It takes about 50 people to maintain Yahoo currently. As she pointed out, corporations are not likely to devote that number of employees to maintain their corporate taxonomy. In addition, maintaining a taxonomy—sifting through new documents and adding new categories when necessary, for example—gets boring. It's also subjective, with different people making different judgements and placing documents in different categories. When people change jobs, it opens the process up to even more subjectivity. Automating the creation and maintenance of the taxonomy as much as possible can reduce these problems.

Wendi next defined several terms more precisely. Clustering involves automatically creating groups (categories) of similar documents based on distance or proximity measures. In the data mining that Discovery Server does initially, that is what it uses to create the initial taxonomy and group the documents. Categorizing involves analyzing documents and assigning them to predefined categories by analyzing the words and moving them accordingly (as the Yahoo indexers do manually). A category is the name for a group of documents, and a thesaurus, in this context, is a list of synonyms.

In developing a taxonomy, Wendi said her mantra is: "Knowledge is in the eye of the beholder, but reflecting end user needs is as critical as representing texts...and it takes work." It isn't sufficient to just group documents; you have to group them in ways that reflect the way your users will ask for information. In essence, when creating the taxonomy, you are building an information retrieval application.

The end user benefits because they now can discover information they wouldn't have found before, browse and discover relationships between categories, and analyze previously unknown related topics or even discover gaps in information.

There are many corporate benefits as well. Employees can easily reuse information; for example, they can use an existing proposal as the basis for a new one. Information quality also improves; Wendi mentioned that when she works with corporate groups they inevitably clean up their data before submitting it for analysis. In bringing all the corporate knowledge together, they also discover information they didn't know was in the company and make new connections among the information. The process also starts to define people's affinities for topics by seeing how individuals relate to documents and categories. Additionally, using categories broadens and enhances search capabilities.

The methodology for creating the taxonomy consists of several steps. First you need to determine what information users need. You can do this with a formal or informal information audit. You want to get a sense of the way people look for information. You also should do a content audit to see what the existing systems and information are. If you've got an existing taxonomy, you should consider using it as a basis for your work. At this point you can also review the meta-data attached to documents and make sure it is cleaned up. You might decide to "go wide" and select databases from different functional areas to get a representative sampling across the corporation, or you might decide to "go deep" and get nitty gritty detailed information in one area. In either case, you should select sources of information that you know you'll have access to, which will require buy-in from the groups that own the information. Additionally, as you gather sources, you should look for existing keyword fields; they are meaningful to users and are clues for potential categories.

From all this work, you will have a mental model of what the taxonomy might look like. If there is an existing taxonomy, you may just massage it manually; or you may decide to take an existing taxonomy, have the Discovery Server automatically create one, and then compare the two. If there is no existing taxonomy, you can have the Discovery Server generate a first draft from a "training set" of the documents. Wendi stressed that the training set documents should be of reasonable length so the algorithms can work properly, and that they should cover a variety of topics. You also need to decide how deep the level of categories should be; if your users are impatient, you may want to have more top-level categories so they do not have to drill down as many levels. You also should consider the number of categories any one document can be assigned to and how many documents any one category can have. Too many in either case can create usability issues.

The next step is to review this initial taxonomy, matching the way users ask questions with the way they will find information. This step involves reviewing terminology, editing, renaming, eliminating, and moving categories. Focus on using unique terms and looking for ways to reduce the taxonomy, perhaps by merging or eliminating parallel categories. As you do all this, keep in mind that categories are used for people's affinities as well as for documents. You should also spot check category assignments for documents to find misplaced documents.

Once you are satisfied with the taxonomy, you should categorize more documents to see how well it works. You can do this manually, automatically, or more likely, both. When you move documents manually, you begin to train the system to make similar decisions the next time it encounters similar documents. Also at this point, the taxonomy can be reviewed by content experts and others for appropriate terminology, and it may become apparent where meta-data for existing information needs to be added or updated.

Finally, you test the taxonomy out on users to see if the categories and their hierarchy actually work for your users. Can they find what they need? Are there any missing categories? Do the groups of documents make sense? Do the categories complement the full-text search? Are the affinities meaningful? Additional changes to the taxonomy may be necessary based on the answers to these questions and on user feedback.

Having explained the process in some detail, Wendi explored the different roles that might be needed to develop a taxonomy and implement knowledge management in a corporation. Roles that might already exist in an organization include content managers (who understand the content), corporate librarians (who know the data and how to classify and structure information), database managers (who understand the existing data structure and index), Web site managers (who also have to build categories and group information), technical writers and editors (who understand content and are good at organizing and structuring information), and IT data warehouse managers (who understand clustering and data mining).

There are also new roles that you may be created within the corporation as well. A knowledge manager/steward may be one person or, more likely, a committee. This person/group is responsible for getting buy-in from the different functional groups providing content. They must assess the different databases and sources of information and understand the value of the content based on their understanding of the information needs of users. They make decisions about content, based on user feedback and on their own knowledge of the company and user needs. In short, they manage the content.

Another new role is that of editor. The editor is a subject matter expert who also has expertise with words and organizing information. They must understand the overall taxonomy, its levels, and categories. The editor is really the person who is responsible for the corporate taxonomy or K-map. They should also understand basic user interface design concepts and be able to work closely with IT and the administrator. The Discovery Server administrator manages the server and its services, including tasks specific to the Discovery Server such as spidering data, and standard administrative tasks such as managing security.

Having reviewed the processes and roles, Wendi gave a demonstration of working with the K-map. She showed how you can use the K-map settings to control what the resulting taxonomy will look like, defining how many levels of categories you want, how many categories a document can be assigned to, and so on. The K-map Editor lets you see the categories and the documents assigned to them, including information about the documents. Wendi explained that fit values were an important indicator and that documents with the lowest fit values for a category are most likely the ones whose assignment you need to review. As you make changes to the taxonomy and document assignments to categories, you are training the system to make similar choices in the future.

Wendi concluded her presentation by mentioning some issues you should be aware of when setting up the Discovery Server and knowledge management system. For example, it's important to set appropriate expectations; the taxonomy created by the Discovery Server will need some tweaking to optimize it for users. At the same time, you have to learn to trust the system; it's new technology, but it works.

Download the presentation from the session.

About IBM

Privacy

Contact