MSR ’18, May 28–29, 2018, Gothenburg, Sweden Jinqiu Yang, Erik Wiern, Annie T.T. Ying, Julian Dolby, and Lin Tan
Documentation with Identical Structures
. APIs from the same
providers may have identical documentation structures (e.g., Google
web APIs). Documentation structures may be different across dif-
ferent API providers. To show the generalizability of our approach,
we applie d our approach to APIs from different providers: Our eval-
uation set contains 120 APIs from 98 different web API providers.
5 RELATED WORK
We discuss related works that address extracting or inferring web
API specifications, as well as works that rely on information ex-
traction approaches, both for extracting software entities and for
extracting any type of information from web pages generally.
Hanyang et al. describe AutoREST, a tool that, as does this work,
aims to extract web API specifications from HTML-based documen-
tation [
9
]. AutoREST uses a preprocessing step to select crawled
web pages that likely contain information relevant to the specifica-
tion, which could be used in combination with the here presented
work. AutoREST relies on a set of simple, fixed rules to extract in-
formation from selected HTML pages, whereas the here presented
methods are designed to be applicable also in light of stark differ-
ences in the way APIs are do cumented. We furthermore present
a more extensive and detailed evaluation in this work. Gao et al.
propose to infer constraints on the data required by web APIs (i.e.,
payload or parameters values) by mining both, API documentation
and error-messages [
15
]. In contrast to the here presented work, the
focus is thus on data definitions, making this work complementary
to ours.
Further related works on extracting web API spe cifications rely
on sources of information other than documentation. Wittern and
Suter use dynamic traces in form of web-server logs [
26
]. The
SpyREST tool, presented by Sohan et al., intercepts HTTP requests
to an API using a proxy and then attempts to infer the API spec-
ification from them [
23
]. In later work, the same authors discuss
the application of SpyREST at Cisco, where requests to the proxy
are driven by existing tests against APIs [
24
]. Ed-douibi et al. pro-
pose an approach to generate web API specifications from example
request-response pairs [
12
]. One benefit of our approach, as com-
pared to these works, is that API documentation is typically publicly
available, while access to web logs are limited to those with access
to the private web servers, proxying may not be an option, and
providing extensive examples for API usages may require (manual)
effort, which could be targeted to generate specifications directly.
Many software engineering researchers have looked into the
problem of identifying co de elements—more specifically, Java code
elements such as method signatures and calls—from API documen-
tation. Dagenais and Robillard proposed an approach that extracts
code elements from API documentation and links the elements to
an index of known code elements, i.e., signatures from a Java li-
brary [
11
]. Subramanian et al. subsequently applied this approach to
identify code elements on Stack Overflow posts and augmented the
code elements in the posts with links to their official JavaDo c [
25
].
Rigby and Robillard use a light-weight, regular expression base d
approach to identify code elements that relaxes the requirement
on a known index [
21
]. Another line of work focuses on extracting
more complex specifications on code entities from natural language
descriptions. Pandita et al. [
20
] extract method pre-conditions and
post-conditions from natural language API documentation. Lin et
al. [
27
] extract code contracts from comments and statically check
for violations in the code. Our work differs in two ways. First, we
extract web API endpoints and related information as opposed to
code elements. Second, there is arguably greater value in our re-
covered index (i.e., OpenAPI Specifications) because such an index
is often not available or known to the clients; while clients of Java
libraries (or other statically-typed languages) are always exposed
at least to method signatures, but callers of web APIs often do not
have such information.
There have been many efforts in information extraction on web
pages [
8
,
10
,
14
,
16
,
19
,
31
,
32
]. For example, techniques for ex-
tracting product information from e-commerce sites [
31
,
32
] lever-
age the structure from the sites: the sites’ organizational structure
usually consists of a search page and a set of individual product
pages, which typically have the same structure as they are gener-
ated from scripts. These techniques exploit this common structure
across the pages within the same site. However, for extracting
endpoints and other information from web API documentation
pages, we cannot rely on such an assumption: There is no stan-
dard structure for API documentation. For many API documen-
tation the content is semi-structured at best, written by humans
using free-form text and/or diverse HTML structures. For exam-
ple, the GitHub API documentation uses an example-based style,
where the base URL
https://api.github.com
and the path tem-
plate
/users/{username}/orgs
are embedded in free-form text
and a
curl
command. Other documentation uses a more structured,
reference-based documentation style.
6 CONCLUSION
In this paper, we presented D2Spec, a tool which extracts parts of
web API specifications from documentation, including base URLs,
path templates, and HTTP methods. D2Spec is based on the three
assumptions: (1) documentation includes multiple web API URLs
(so that a base URL can be extracted); (2) path templates are either
denoted explicitly (e.g., using brackets) or that multiple example
URLs for paths exist from which templates can be inferred; and (3)
descriptions close to the path templates contain information about
HTTP methods.
One missing piece so far is understanding the data that is re-
turned by the APIs that we discover. We believe it is feasible to
do this in several possible ways. The first is extending our extrac-
tion from documentation; documentation often includes example
of API usage, and we could extract those examples and statically
analyze that code for what data it expects back. Given example API
usage, existing client code could be analyzed either dynamically or
statically to infer data structures.
Our evaluation of D2Spec shows that our assumptions hold
mostly true when it comes to extracting base URLs, path templates,
and HTTP methods. It furthermore shows that D2Spec is not only
useful for creating specifications from scratch, but also for checking
existing ones for consistency with documentation. We contacted
API providers for the found inconsistencies. In the future, we aim
to expand the scop e of D2Spec to also extract information on data
structures, HTTP headers, and authentication methods.