INTERNET DRAFT EXPIRES OCT 1998 INTERNET DRAFT Network Working Group Ryan Moats INTERNET DRAFT Rick Huber Category: Informational AT&T April 1998 Directories and DNS: Experiences from Netfind Status of This Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." To view the entire list of current Internet-Drafts, please check the "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast). Distribution of this document is unlimited. Abstract There have been several internet drafts and RFCs written about the need for Internet Directories. This draft discusses lessons that have been learned during InterNIC Directory and Database's custodianship of the Netfind search engine and database, which have direct implications on providing and maintaining the mappings between domain names and company information that are essential for an Internet Directory. This work builds on that of Mike Schwartz and his team at the university of Colorado at Boulder [1]. 1. Introduction There have been several internet drafts [2, 3] and RFCs [4, 5, 6] written about approaches for providing Internet Directories. Many of the earlier documents discussed white pages directories that supply mappings from a person's name to their telephone number, email address, etc. More recently, there has been discussion of directories that map from a company name to a domain name or web site. >From July 1996 until our shutdown in March 1998, the InterNIC Directory and Database Services project maintained the Netfind search engine [1] and the associated "Seed Database" that maps organization information to domain names and thus acts as the type of Internet directory that associates company names with domain names. The experience gained from maintaining and growing this database has provided valuable insight into the issues of providing a directory service. Many people are using DNS as a directory today to find information Moats, Huber [Page 1] about a given company. Typically when DNS is used, users guess the domain name of the company they are looking for and then prepend "www.". This makes it highly desirable for a company to have an easily guessable name. There are two major difficulties here. As the number of assigned names increases, it becomes more difficult to get an easily guessable name. Also, the TLD must be guessed as well as the name. While many users just guess ".COM" today, there are many two-letter country code top-level domains in current use as well as other gTLDs (.NET, .ORG, and possibly .EDU in addition to .COM) with the prospect of additional gTLDs in the near future. Since both of these problems are or will soon be present in DNS, guessing is getting more difficult every day. 2. Building a Directory We are dealing here with directories whose goal is to map company names to domain names. The reverse lookup (domain name to "owning" company) can be done using WHOIS or similar tools (for TLDs where such tools are supported). A database that contains the mapping we want can be built from the WHOIS data, but we must first get data on what DNS names exist. There are three issues that must be addressed: - Finding new domain names for directory updates (and finding all domain names for the initial directory build). - Finding the company name associated with each domain name. - Determining when the data associated with an existing domain name has changed. 3. Finding New Domain Names One proposal to determine domain name existence is to use a variant of a "Tree Walk" to determine the domains that need to be added to the directory. Our experience with the Netfind database is that this is neither a reasonable nor an efficient mechanism for maintaining such a directory. DNS "Tree Walks" tend to be discouraged (as they should) by the Internet community for both security and load reasons. In addition, our experience has shown that data on allocated DNS domains can often be retrieved via other methods (FTP, HTTP, etc.). Therefore, to find new domain names FTP or HTTP should be used to download lists of allocated domains and DNS "Tree Walks" should be Moats, Huber [Page 2] used only as a last resort. 4. Associating Company Information with a Domain Name WHOIS appears to be the logical starting point for information relating company names to domain names, and several of the directory proposals [2,3] discuss using WHOIS for this purpose. As of the March 1998 release, the Netfind seed database had approximately 2.7 million records that contained data retrievable by WHOIS. This constituted 82.8% of the entire Netfind database, but our experience has shown that this information contains a number of factual and typographical errors. Further, those TLDs that have registrars that support WHOIS typically only support WHOIS information for second level domains as opposed to lower level domains. There also remains the other 17.2%: TLDs without registrars, TLDs without WHOIS support, and TLDs that use tools other than WHOIS (HTTP, FTP, gopher) for providing organizational information. In summary, using WHOIS alone is not sufficient to populate an internet directory. 5. Keeping Data Current Given the current size of the Netfind database and a reasonable processor, it requires somewhere between 7.2 million and 9.0 million seconds of CPU time to rebuild the entire portion of the Netfind database that is available from WHOIS lookups. This is roughly 85-105 CPU days if no parallel processing is done. Note that this estimate does not include other considerations that would increase the amount of time to rebuild the database. During our maintenance of the Netfind database, we provided monthly updates which would require between 3 and 5 machines dedicated full time to provide a full database rebuild every month. Such a dedication was unreasonable, given that the set of allocated domains changes currently by around 150,000 new allocated domains per month. Checkpointing the allocated domain list is checkpointed and rebuilding during one weekend of the month reduced the requirement to between 40 and 60 machines for a full update. A more reasonable approach was to do incremental updates of the directory. Such an approach allowed incremental updates to be handled on a monthly basis using a reasonable number (between 1 and 4) of machines. Coupling such an approach with a periodic refresh of already allocated domains allowed for older records to be updated when underlying information changes. Note that the periodic refresh Moats, Huber [Page 3] was not triggered by any event; rather, it was a scheduled procedure. When using an incremental approach, it was necessary to verify the information for domains that are already in the database. This was done by direct DNS lookups to verify the existence of the domain name in question and then using WHOIS lookups to determine if that information had changed. This was done on a rotating basis so that acceptable performance was maintained. In practice, we did a 100% check by direct DNS lookup and checked about 10% of the names in WHOIS each month. 6. Distributed vs Monolithic While a distributed directory is a desirable goal, the March 1998 Netfind database was monolithic in nature. Given past growth, it is not clear at what point migrating to a distributed directory becomes actually necessary to support customer queries. The current Netfind database holds approximately 3.26 million records in a flat ASCII file. Searching is done via a PERL script and an inverted tree. While admittedly primitive, this configuration supported over 70,000 queries per month (with a peak level of 200,000 in one month) from our production servers. Increasing the database size only requires more disk space to hold the database and inverted tree. Of course, using actual database technology would probably improve performance and scalability, but such technology has not yet been required. 7. Other Directory Considerations Availability goals can be met by having multiple copies of the database in place. InterNIC Directory and Database Services maintained 3 production copies of the Netfind database, and there are about a dozen others maintained by other organizations throughout the world. This ensures that users almost always have access to the database. At the InterNIC Directory and Database services sites, service downtime for database update was avoided by doing updates in series; only one server was being updated at any given time. 8.0 Security Considerations This document specifies methods of collecting and accessing data that is already freely accessible to anyone on the Internet. Such gathering will make access to this data easier, and may increase opportunities for abuse. Moats, Huber [Page 4] 9. Acknowledgments The work described in this document was partially supported by the National Science Foundation through Cooperative Agreement NCR- 9218179. 10. References Request For Comments (RFC) are available from and Internet-Drafts are available from . Both are also available from numerous mirror sites. [1] M. F. Schwartz, C. Pu. "Applying an Information Gathering Architecture to Netfind: A White Pages Tool for a Changing and Growing Internet," Univer- sity of Colorado Technical Report CU-CS-656-93. December 1993, revised July 1994. [2] G. Mansfield, et. al, "A Directory for Organiza- tions and Services from DNS and WHOIS", Internet Draft (work in progress), November 1997. [3] J. Klensin, T. Wolf, Jr., "Domain Names and Company Name Retrieval", Internet Draft (work in progress), July 1997. [4] K. Sollins, "Plan for Internet Directory Services", RFC 1107, M.I.T. Laboratory for Computer Science, July 1989. [5] S. Hardcastle-Kille, "Replication Requirements to provide an Internet Directory using X.500, RFC 1275, University College London, November 1991. [6] J. Postel, C. Anderson, "White Pages Meeting Report", RFC 1588, February 1994. 11. Authors' addresses l l. Ryan Moats Rick Huber AT&T AT&T 15621 Drexel CircleRoom 1B-433, 101 Crawfords Corner Road Omaha, NE 68135-2358Holmdel, NJ 07733-3030 USA USA EMail: jayhawk@att.comEmail: rvh@att.com Moats, Huber [Page 5] Moats, Huber [Page 6] Network Working Group Ryan Moats Request for Comments: NNNN Rick Huber Category: Informational AT&T April 1998 Directories and DNS: Experiences from Netfind Status of This Memo This memo provides information for the Internet community. This memo does not specify an Internet standard of any kind. Distribution of this memo is unlimited. Abstract There have been several internet drafts and RFCs written about the need for Internet Directories. This draft discusses lessons that have been learned during InterNIC Directory and Database's custodianship of the Netfind search engine and database, which have direct implications on providing and maintaining the mappings between domain names and company information that are essential for an Internet Directory. This work builds on that of Mike Schwartz and his team at the university of Colorado at Boulder [1]. 1. Introduction There have been several internet drafts [2, 3] and RFCs [4, 5, 6] written about approaches for providing Internet Directories. Many of the earlier documents discussed white pages directories that supply mappings from a person's name to their telephone number, email address, etc. More recently, there has been discussion of directories that map from a company name to a domain name or web site. >From July 1996 until our shutdown in March 1998, the InterNIC Directory and Database Services project maintained the Netfind search engine [1] and the associated "Seed Database" that maps organization information to domain names and thus acts as the type of Internet directory that associates company names with domain names. The experience gained from maintaining and growing this database has provided valuable insight into the issues of providing a directory service. Many people are using DNS as a directory today to find information Moats, Huber [Page 1] RFC NNNN Directories and DNS: Experiences from Netfind April 1998 about a given company. Typically when DNS is used, users guess the domain name of the company they are looking for and then prepend "www.". This makes it highly desirable for a company to have an easily guessable name. There are two major difficulties here. As the number of assigned names increases, it becomes more difficult to get an easily guessable name. Also, the TLD must be guessed as well as the name. While many users just guess ".COM" today, there are many two-letter country code top-level domains in current use as well as other gTLDs (.NET, .ORG, and possibly .EDU in addition to .COM) with the prospect of additional gTLDs in the near future. Since both of these problems are or will soon be present in DNS, guessing is getting more difficult every day. 2. Building a Directory We are dealing here with directories whose goal is to map company names to domain names. The reverse lookup (domain name to "owning" company) can be done using WHOIS or similar tools (for TLDs where such tools are supported). A database that contains the mapping we want can be built from the WHOIS data, but we must first get data on what DNS names exist. There are three issues that must be addressed: - Finding new domain names for directory updates (and finding all domain names for the initial directory build). - Finding the company name associated with each domain name. - Determining when the data associated with an existing domain name has changed. 3. Finding New Domain Names One proposal to determine domain name existence is to use a variant of a "Tree Walk" to determine the domains that need to be added to the directory. Our experience with the Netfind database is that this is neither a reasonable nor an efficient mechanism for maintaining such a directory. DNS "Tree Walks" tend to be discouraged (as they should) by the Internet community for both security and load reasons. In addition, our experience has shown that data on allocated DNS domains can often be retrieved via other methods (FTP, HTTP, etc.). Therefore, to find new domain names FTP or HTTP should be used to download lists of allocated domains and DNS "Tree Walks" should be Moats, Huber [Page 2] RFC NNNN Directories and DNS: Experiences from Netfind April 1998 used only as a last resort. 4. Associating Company Information with a Domain Name WHOIS appears to be the logical starting point for information relating company names to domain names, and several of the directory proposals [2,3] discuss using WHOIS for this purpose. As of the March 1998 release, the Netfind seed database had approximately 2.7 million records that contained data retrievable by WHOIS. This constituted 82.8% of the entire Netfind database, but our experience has shown that this information contains a number of factual and typographical errors. Further, those TLDs that have registrars that support WHOIS typically only support WHOIS information for second level domains as opposed to lower level domains. There also remains the other 17.2%: TLDs without registrars, TLDs without WHOIS support, and TLDs that use tools other than WHOIS (HTTP, FTP, gopher) for providing organizational information. In summary, using WHOIS alone is not sufficient to populate an internet directory. 5. Keeping Data Current Given the current size of the Netfind database and a reasonable processor, it requires somewhere between 7.2 million and 9.0 million seconds of CPU time to rebuild the entire portion of the Netfind database that is available from WHOIS lookups. This is roughly 85-105 CPU days if no parallel processing is done. Note that this estimate does not include other considerations that would increase the amount of time to rebuild the database. During our maintenance of the Netfind database, we provided monthly updates which would require between 3 and 5 machines dedicated full time to provide a full database rebuild every month. Such a dedication was unreasonable, given that the set of allocated domains changes currently by around 150,000 new allocated domains per month. Checkpointing the allocated domain list is checkpointed and rebuilding during one weekend of the month reduced the requirement to between 40 and 60 machines for a full update. A more reasonable approach was to do incremental updates of the directory. Such an approach allowed incremental updates to be handled on a monthly basis using a reasonable number (between 1 and 4) of machines. Coupling such an approach with a periodic refresh of already allocated domains allowed for older records to be updated when underlying information changes. Note that the periodic refresh Moats, Huber [Page 3] RFC NNNN Directories and DNS: Experiences from Netfind April 1998 was not triggered by any event; rather, it was a scheduled procedure. When using an incremental approach, it was necessary to verify the information for domains that are already in the database. This was done by direct DNS lookups to verify the existence of the domain name in question and then using WHOIS lookups to determine if that information had changed. This was done on a rotating basis so that acceptable performance was maintained. In practice, we did a 100% check by direct DNS lookup and checked about 10% of the names in WHOIS each month. 6. Distributed vs Monolithic While a distributed directory is a desirable goal, the March 1998 Netfind database was monolithic in nature. Given past growth, it is not clear at what point migrating to a distributed directory becomes actually necessary to support customer queries. The current Netfind database holds approximately 3.26 million records in a flat ASCII file. Searching is done via a PERL script and an inverted tree. While admittedly primitive, this configuration supported over 70,000 queries per month (with a peak level of 200,000 in one month) from our production servers. Increasing the database size only requires more disk space to hold the database and inverted tree. Of course, using actual database technology would probably improve performance and scalability, but such technology has not yet been required. 7. Other Directory Considerations Availability goals can be met by having multiple copies of the database in place. InterNIC Directory and Database Services maintained 3 production copies of the Netfind database, and there are about a dozen others maintained by other organizations throughout the world. This ensures that users almost always have access to the database. At the InterNIC Directory and Database services sites, service downtime for database update was avoided by doing updates in series; only one server was being updated at any given time. 8.0 Security Considerations This document specifies methods of collecting and accessing data that is already freely accessible to anyone on the Internet. Such gathering will make access to this data easier, and may increase opportunities for abuse. Moats, Huber [Page 4] RFC NNNN Directories and DNS: Experiences from Netfind April 1998 9. Acknowledgments The work described in this document was partially supported by the National Science Foundation through Cooperative Agreement NCR- 9218179. 10. References Request For Comments (RFC) are available from and Internet-Drafts are available from . Both are also available from numerous mirror sites. [1] M. F. Schwartz, C. Pu. "Applying an Information Gathering Architecture to Netfind: A White Pages Tool for a Changing and Growing Internet," Univer- sity of Colorado Technical Report CU-CS-656-93. December 1993, revised July 1994. [2] G. Mansfield, et. al, "A Directory for Organiza- tions and Services from DNS and WHOIS", Internet Draft (work in progress), November 1997. [3] J. Klensin, T. Wolf, Jr., "Domain Names and Company Name Retrieval", Internet Draft (work in progress), July 1997. [4] K. Sollins, "Plan for Internet Directory Services", RFC 1107, M.I.T. Laboratory for Computer Science, July 1989. [5] S. Hardcastle-Kille, "Replication Requirements to provide an Internet Directory using X.500, RFC 1275, University College London, November 1991. [6] J. Postel, C. Anderson, "White Pages Meeting Report", RFC 1588, February 1994. 11. Authors' addresses l l. Ryan Moats Rick Huber AT&T AT&T 15621 Drexel CircleRoom 1B-433, 101 Crawfords Corner Road Omaha, NE 68135-2358Holmdel, NJ 07733-3030 USA USA EMail: jayhawk@att.comEmail: rvh@att.com Moats, Huber [Page 5] INTERNET DRAFT EXPIRES OCT 1998 INTERNET DRAFT