INTERNET DRAFT		EXPIRES OCT 1998		INTERNET DRAFT
Network Working Group                                         Ryan Moats
INTERNET DRAFT		                                      Rick Huber
Category: Informational                                             AT&T
                                                              April 1998


             Directories and DNS: Experiences from Netfind
		<draft-rfced-info-moats-00.txt>

Status of This Memo

This document is an Internet-Draft.  Internet-Drafts are working
documents of the Internet Engineering Task Force (IETF), its
areas, and its working groups.  Note that other groups may also
distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other
documents at any time.  It is inappropriate to use Internet-
Drafts as reference material or to cite them other than as
"work in progress."

To view the entire list of current Internet-Drafts, please check
the "1id-abstracts.txt" listing contained in the Internet-Drafts
Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net
(Northern Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au
(Pacific Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu
(US West Coast).


Distribution of this document is unlimited.


Abstract

   There have been several internet drafts and RFCs written about the
   need for Internet Directories.  This draft discusses lessons that
   have been learned during InterNIC Directory and Database's
   custodianship of the Netfind search engine and database, which have
   direct implications on providing and maintaining the mappings between
   domain names and company information that are essential for an
   Internet Directory. This work builds on that of Mike Schwartz and his
   team at the university of Colorado at Boulder [1].

1. Introduction

   There have been several internet drafts [2, 3] and RFCs [4, 5, 6]
   written about approaches for providing Internet Directories.  Many of
   the earlier documents discussed white pages directories that supply
   mappings from a person's name to their telephone number, email
   address, etc.  More recently, there has been discussion of
   directories that map from a company name to a domain name or web
   site.

   >From July 1996 until our shutdown in March 1998, the InterNIC
   Directory and Database Services project maintained the Netfind search
   engine [1] and the associated "Seed Database" that maps organization
   information to domain names and thus acts as the type of Internet
   directory that associates company names with domain names.  The
   experience gained from maintaining and growing this database has
   provided valuable insight into the issues of providing a directory
   service.

   Many people are using DNS as a directory today to find information


Moats, Huber                                                    [Page 1]


   about a given company.  Typically when DNS is used, users guess the
   domain name of the company they are looking for and then prepend
   "www.".  This makes it highly desirable for a company to have an
   easily guessable name.

   There are two major difficulties here.  As the number of assigned
   names increases, it becomes more difficult to get an easily guessable
   name.  Also, the TLD must be guessed as well as the name.  While many
   users just guess ".COM" today, there are many two-letter country code
   top-level domains in current use as well as other gTLDs (.NET, .ORG,
   and possibly .EDU in addition to .COM) with the prospect of
   additional gTLDs in the near future.  Since both of these problems
   are or will soon be present in DNS, guessing is getting more
   difficult every day.


2. Building a Directory

   We are dealing here with directories whose goal is to map company
   names to domain names.  The reverse lookup (domain name to "owning"
   company) can be done using WHOIS or similar tools (for TLDs where
   such tools are supported).  A database that contains the mapping we
   want can be built from the WHOIS data, but we must first get data on
   what DNS names exist.

   There are three issues that must be addressed:

    - Finding new domain names for directory updates (and finding all
      domain names for the initial directory build).

    - Finding the company name associated with each domain name.

    - Determining when the data associated with an existing domain name
      has changed.


3. Finding New Domain Names

   One proposal to determine domain name existence is to use a variant
   of a "Tree Walk" to determine the domains that need to be added to
   the directory.  Our experience with the Netfind database is that this
   is neither a reasonable nor an efficient mechanism for maintaining
   such a directory.  DNS "Tree Walks" tend to be discouraged (as they
   should) by the Internet community for both security and load reasons.
   In addition, our experience has shown that data on allocated DNS
   domains can often be retrieved via other methods (FTP, HTTP, etc.).
   Therefore, to find new domain names FTP or HTTP should be used to
   download lists of allocated domains and DNS "Tree Walks" should be


Moats, Huber                                                    [Page 2]


   used only as a last resort.


4.  Associating Company Information with a Domain Name

   WHOIS appears to be the logical starting point for information
   relating company names to domain names, and several of the directory
   proposals [2,3] discuss using WHOIS for this purpose.  As of the
   March 1998 release, the Netfind seed database had approximately 2.7
   million records that contained data retrievable by WHOIS.

   This constituted 82.8% of the entire Netfind database, but our
   experience has shown that this information contains a number of
   factual and typographical errors.  Further, those TLDs that have
   registrars that support WHOIS typically only support WHOIS
   information for second level domains as opposed to lower level
   domains.  There also remains the other 17.2%:  TLDs without
   registrars, TLDs without WHOIS support, and TLDs that use tools other
   than WHOIS (HTTP, FTP, gopher) for providing organizational
   information.  In summary, using WHOIS alone is not sufficient to
   populate an internet directory.


5.  Keeping Data Current

   Given the current size of the Netfind database and a reasonable
   processor, it requires somewhere between 7.2 million and 9.0 million
   seconds of CPU time to rebuild the entire portion of the Netfind
   database that is available from WHOIS lookups.  This is roughly
   85-105 CPU days if no parallel processing is done.  Note that this
   estimate does not include other considerations that would increase
   the amount of time to rebuild the database.

   During our maintenance of the Netfind database, we provided monthly
   updates which would require between 3 and 5 machines dedicated full
   time to provide a full database rebuild every month.  Such a
   dedication was unreasonable, given that the set of allocated domains
   changes currently by around 150,000 new allocated domains per month.
   Checkpointing the allocated domain list is checkpointed and
   rebuilding during one weekend of the month reduced the requirement to
   between 40 and 60 machines for a full update.

   A more reasonable approach was to do incremental updates of the
   directory.  Such an approach allowed incremental updates to be
   handled on a monthly basis using a reasonable number (between 1 and
   4) of machines.  Coupling such an approach with a periodic refresh of
   already allocated domains allowed for older records to be updated
   when underlying information changes.  Note that the periodic refresh


Moats, Huber                                                    [Page 3]


   was not triggered by any event; rather, it was a scheduled procedure.

   When using an incremental approach, it was necessary to verify the
   information for domains that are already in the database.  This was
   done by direct DNS lookups to verify the existence of the domain name
   in question and then using WHOIS lookups to determine if that
   information had changed.  This was done on a rotating basis so that
   acceptable performance was maintained.  In practice, we did a 100%
   check by direct DNS lookup and checked about 10% of the names in
   WHOIS each month.


6. Distributed vs Monolithic

   While a distributed directory is a desirable goal, the March 1998
   Netfind database was monolithic in nature.  Given past growth, it is
   not clear at what point migrating to a distributed directory becomes
   actually necessary to support customer queries.  The current Netfind
   database holds approximately 3.26 million records in a flat ASCII
   file.  Searching is done via a PERL script and an inverted tree.
   While admittedly primitive, this configuration supported over 70,000
   queries per month (with a peak level of 200,000 in one month) from
   our production servers.  Increasing the database size only requires
   more disk space to hold the database and inverted tree.  Of course,
   using actual database technology would probably improve performance
   and scalability, but such technology has not yet been required.


7. Other Directory Considerations

   Availability goals can be met by having multiple copies of the
   database in place.  InterNIC Directory and Database Services
   maintained 3 production copies of the Netfind database, and there are
   about a dozen others maintained by other organizations throughout the
   world.  This ensures that users almost always have access to the
   database.  At the InterNIC Directory and Database services sites,
   service downtime for database update was avoided by doing updates in
   series; only one server was being updated at any given time.


8.0 Security Considerations

   This document specifies methods of collecting and accessing data that
   is already freely accessible to anyone on the Internet.  Such
   gathering will make access to this data easier, and may increase
   opportunities for abuse.


Moats, Huber                                                    [Page 4]


9. Acknowledgments

   The work described in this document was partially supported by the
   National Science Foundation through Cooperative Agreement NCR-
   9218179.

10. References

   Request For Comments (RFC) are available from
   <URL:ftp://venera.isi.edu/in-notes> and Internet-Drafts are available
   from <URL:ftp://ftp.ietf.org/internet-drafts>.  Both are also
   available from numerous mirror sites.

         [1]         M. F. Schwartz, C. Pu.  "Applying an Information
                     Gathering Architecture to Netfind: A White Pages
                     Tool for a Changing and Growing Internet," Univer-
                     sity of Colorado Technical Report CU-CS-656-93.
                     December 1993, revised July 1994.
                     <URL:ftp://ftp.cs.colorado.edu/pub/cs/techreports/
                     schwartz/Netfind.Gathering.txt.Z>

         [2]         G. Mansfield, et. al, "A Directory for Organiza-
                     tions and Services from DNS and WHOIS", Internet
                     Draft (work in progress), November 1997.

         [3]         J. Klensin, T. Wolf, Jr., "Domain Names and Company
                     Name Retrieval", Internet Draft (work in progress),
                     July 1997.

         [4]         K. Sollins, "Plan for Internet Directory Services",
                     RFC 1107, M.I.T. Laboratory for Computer Science,
                     July 1989.

         [5]         S. Hardcastle-Kille, "Replication Requirements to
                     provide an Internet Directory using X.500, RFC
                     1275, University College London, November 1991.

         [6]         J. Postel, C. Anderson, "White Pages Meeting
                     Report", RFC 1588, February 1994.

11. Authors' addresses


   l l.  Ryan Moats  Rick Huber AT&T        AT&T 15621 Drexel CircleRoom
   1B-433, 101 Crawfords Corner Road Omaha, NE 68135-2358Holmdel, NJ
   07733-3030 USA         USA

   EMail:  jayhawk@att.comEmail: rvh@att.com


Moats, Huber                                                    [Page 5]


Moats, Huber                                                    [Page 6]


Network Working Group                                         Ryan Moats
Request for Comments: NNNN                                    Rick Huber
Category: Informational                                             AT&T
                                                              April 1998


             Directories and DNS: Experiences from Netfind


                          Status of This Memo

      This memo provides information for the Internet community.  This
      memo does not specify an Internet standard of any kind.
      Distribution of this memo is unlimited.

Abstract

   There have been several internet drafts and RFCs written about the
   need for Internet Directories.  This draft discusses lessons that
   have been learned during InterNIC Directory and Database's
   custodianship of the Netfind search engine and database, which have
   direct implications on providing and maintaining the mappings between
   domain names and company information that are essential for an
   Internet Directory. This work builds on that of Mike Schwartz and his
   team at the university of Colorado at Boulder [1].

1. Introduction

   There have been several internet drafts [2, 3] and RFCs [4, 5, 6]
   written about approaches for providing Internet Directories.  Many of
   the earlier documents discussed white pages directories that supply
   mappings from a person's name to their telephone number, email
   address, etc.  More recently, there has been discussion of
   directories that map from a company name to a domain name or web
   site.

   >From July 1996 until our shutdown in March 1998, the InterNIC
   Directory and Database Services project maintained the Netfind search
   engine [1] and the associated "Seed Database" that maps organization
   information to domain names and thus acts as the type of Internet
   directory that associates company names with domain names.  The
   experience gained from maintaining and growing this database has
   provided valuable insight into the issues of providing a directory
   service.

   Many people are using DNS as a directory today to find information


Moats, Huber                                                    [Page 1]


RFC NNNN     Directories and DNS: Experiences from Netfind    April 1998


   about a given company.  Typically when DNS is used, users guess the
   domain name of the company they are looking for and then prepend
   "www.".  This makes it highly desirable for a company to have an
   easily guessable name.

   There are two major difficulties here.  As the number of assigned
   names increases, it becomes more difficult to get an easily guessable
   name.  Also, the TLD must be guessed as well as the name.  While many
   users just guess ".COM" today, there are many two-letter country code
   top-level domains in current use as well as other gTLDs (.NET, .ORG,
   and possibly .EDU in addition to .COM) with the prospect of
   additional gTLDs in the near future.  Since both of these problems
   are or will soon be present in DNS, guessing is getting more
   difficult every day.


2. Building a Directory

   We are dealing here with directories whose goal is to map company
   names to domain names.  The reverse lookup (domain name to "owning"
   company) can be done using WHOIS or similar tools (for TLDs where
   such tools are supported).  A database that contains the mapping we
   want can be built from the WHOIS data, but we must first get data on
   what DNS names exist.

   There are three issues that must be addressed:

    - Finding new domain names for directory updates (and finding all
      domain names for the initial directory build).

    - Finding the company name associated with each domain name.

    - Determining when the data associated with an existing domain name
      has changed.


3. Finding New Domain Names

   One proposal to determine domain name existence is to use a variant
   of a "Tree Walk" to determine the domains that need to be added to
   the directory.  Our experience with the Netfind database is that this
   is neither a reasonable nor an efficient mechanism for maintaining
   such a directory.  DNS "Tree Walks" tend to be discouraged (as they
   should) by the Internet community for both security and load reasons.
   In addition, our experience has shown that data on allocated DNS
   domains can often be retrieved via other methods (FTP, HTTP, etc.).
   Therefore, to find new domain names FTP or HTTP should be used to
   download lists of allocated domains and DNS "Tree Walks" should be


Moats, Huber                                                    [Page 2]


RFC NNNN     Directories and DNS: Experiences from Netfind    April 1998


   used only as a last resort.


4.  Associating Company Information with a Domain Name

   WHOIS appears to be the logical starting point for information
   relating company names to domain names, and several of the directory
   proposals [2,3] discuss using WHOIS for this purpose.  As of the
   March 1998 release, the Netfind seed database had approximately 2.7
   million records that contained data retrievable by WHOIS.

   This constituted 82.8% of the entire Netfind database, but our
   experience has shown that this information contains a number of
   factual and typographical errors.  Further, those TLDs that have
   registrars that support WHOIS typically only support WHOIS
   information for second level domains as opposed to lower level
   domains.  There also remains the other 17.2%:  TLDs without
   registrars, TLDs without WHOIS support, and TLDs that use tools other
   than WHOIS (HTTP, FTP, gopher) for providing organizational
   information.  In summary, using WHOIS alone is not sufficient to
   populate an internet directory.


5.  Keeping Data Current

   Given the current size of the Netfind database and a reasonable
   processor, it requires somewhere between 7.2 million and 9.0 million
   seconds of CPU time to rebuild the entire portion of the Netfind
   database that is available from WHOIS lookups.  This is roughly
   85-105 CPU days if no parallel processing is done.  Note that this
   estimate does not include other considerations that would increase
   the amount of time to rebuild the database.

   During our maintenance of the Netfind database, we provided monthly
   updates which would require between 3 and 5 machines dedicated full
   time to provide a full database rebuild every month.  Such a
   dedication was unreasonable, given that the set of allocated domains
   changes currently by around 150,000 new allocated domains per month.
   Checkpointing the allocated domain list is checkpointed and
   rebuilding during one weekend of the month reduced the requirement to
   between 40 and 60 machines for a full update.

   A more reasonable approach was to do incremental updates of the
   directory.  Such an approach allowed incremental updates to be
   handled on a monthly basis using a reasonable number (between 1 and
   4) of machines.  Coupling such an approach with a periodic refresh of
   already allocated domains allowed for older records to be updated
   when underlying information changes.  Note that the periodic refresh


Moats, Huber                                                    [Page 3]


RFC NNNN     Directories and DNS: Experiences from Netfind    April 1998


   was not triggered by any event; rather, it was a scheduled procedure.

   When using an incremental approach, it was necessary to verify the
   information for domains that are already in the database.  This was
   done by direct DNS lookups to verify the existence of the domain name
   in question and then using WHOIS lookups to determine if that
   information had changed.  This was done on a rotating basis so that
   acceptable performance was maintained.  In practice, we did a 100%
   check by direct DNS lookup and checked about 10% of the names in
   WHOIS each month.


6. Distributed vs Monolithic

   While a distributed directory is a desirable goal, the March 1998
   Netfind database was monolithic in nature.  Given past growth, it is
   not clear at what point migrating to a distributed directory becomes
   actually necessary to support customer queries.  The current Netfind
   database holds approximately 3.26 million records in a flat ASCII
   file.  Searching is done via a PERL script and an inverted tree.
   While admittedly primitive, this configuration supported over 70,000
   queries per month (with a peak level of 200,000 in one month) from
   our production servers.  Increasing the database size only requires
   more disk space to hold the database and inverted tree.  Of course,
   using actual database technology would probably improve performance
   and scalability, but such technology has not yet been required.


7. Other Directory Considerations

   Availability goals can be met by having multiple copies of the
   database in place.  InterNIC Directory and Database Services
   maintained 3 production copies of the Netfind database, and there are
   about a dozen others maintained by other organizations throughout the
   world.  This ensures that users almost always have access to the
   database.  At the InterNIC Directory and Database services sites,
   service downtime for database update was avoided by doing updates in
   series; only one server was being updated at any given time.


8.0 Security Considerations

   This document specifies methods of collecting and accessing data that
   is already freely accessible to anyone on the Internet.  Such
   gathering will make access to this data easier, and may increase
   opportunities for abuse.


Moats, Huber                                                    [Page 4]


RFC NNNN     Directories and DNS: Experiences from Netfind    April 1998


9. Acknowledgments

   The work described in this document was partially supported by the
   National Science Foundation through Cooperative Agreement NCR-
   9218179.

10. References

   Request For Comments (RFC) are available from
   <URL:ftp://venera.isi.edu/in-notes> and Internet-Drafts are available
   from <URL:ftp://ftp.ietf.org/internet-drafts>.  Both are also
   available from numerous mirror sites.

         [1]         M. F. Schwartz, C. Pu.  "Applying an Information
                     Gathering Architecture to Netfind: A White Pages
                     Tool for a Changing and Growing Internet," Univer-
                     sity of Colorado Technical Report CU-CS-656-93.
                     December 1993, revised July 1994.
                     <URL:ftp://ftp.cs.colorado.edu/pub/cs/techreports/
                     schwartz/Netfind.Gathering.txt.Z>

         [2]         G. Mansfield, et. al, "A Directory for Organiza-
                     tions and Services from DNS and WHOIS", Internet
                     Draft (work in progress), November 1997.

         [3]         J. Klensin, T. Wolf, Jr., "Domain Names and Company
                     Name Retrieval", Internet Draft (work in progress),
                     July 1997.

         [4]         K. Sollins, "Plan for Internet Directory Services",
                     RFC 1107, M.I.T. Laboratory for Computer Science,
                     July 1989.

         [5]         S. Hardcastle-Kille, "Replication Requirements to
                     provide an Internet Directory using X.500, RFC
                     1275, University College London, November 1991.

         [6]         J. Postel, C. Anderson, "White Pages Meeting
                     Report", RFC 1588, February 1994.

11. Authors' addresses


   l l.  Ryan Moats  Rick Huber AT&T        AT&T 15621 Drexel CircleRoom
   1B-433, 101 Crawfords Corner Road Omaha, NE 68135-2358Holmdel, NJ
   07733-3030 USA         USA

   EMail:  jayhawk@att.comEmail: rvh@att.com


Moats, Huber                                                    [Page 5]

INTERNET DRAFT		EXPIRES OCT 1998		INTERNET DRAFT