Towards Extracting Web API Specifications from Documentation

Jinqiu Yang

University of Waterloo

Waterloo, Ontario, Canada

[email protected]

Erik Wittern

IBM T.J. Watson Research Center

Yorktown Heights, NY, USA

[email protected]

Annie T.T. Ying

EquitySim

Vancouver, BC, Canada

[email protected]

Julian Dolby

IBM T.J. Watson Research Center

Yorktown Heights, NY, USA

[email protected]

Lin Tan

University of Waterloo

Waterloo, ON, Canada

[email protected]

ABSTRACT

Web API speciﬁcations are machine-readable descriptions of APIs.

These speciﬁcations, in combination with related tooling, simplify

and support the consumption of APIs. However, despite the in-

creased distribution of web APIs, speciﬁcations are rare and their

creation and maintenance heavily rely on manual eﬀorts by third

parties. In this paper, we propose an automatic approach and an

associated tool called D2Spec for extracting signiﬁcant parts of

such speciﬁcations from web API documentation pages. Given a

seed online documentation page of an API, D2Spec ﬁrst crawls all

documentation pages on the API, and then uses a set of machine-

learning techniques to extract the base URL, path templates, and

HTTP methods – collectively describing the endpoints of the API.

We evaluate whether D2Spec can accurately extract endpoints

from documentation on 116 web APIs. The results show that D2Spec

achieves a precision of 87.1% in identifying base URLs, a precision

of 80.3% and a recall of 80.9% in generating path templates, and a

precision of 83.8% and a recall of 77.2% in extracting HTTP methods.

In addition, in an evaluation on 64 APIs with pre-existing API spec-

iﬁcations, D2Sp ec revealed many inconsistencies between web API

documentation and their corresponding publicly available speciﬁ-

cations. API consumers would beneﬁt from D2Spec pointing them

to, and allowing them thus to ﬁx, such inconsistencies.

ACM Reference Format:

Jinqiu Yang, Erik Wittern, Annie T.T. Ying, Julian Dolby, and Lin Tan. 2018.

Towards Extracting Web API Speciﬁcations from Documentation. In MSR

’18: 15th International Conference on Mining Software Repositories , May 28–

29, 2018, Gothenburg, Sweden. ACM, New York, NY, USA, 11 pages. https:

//doi.org/10.1145/3196398.3196411

1 INTRODUCTION

Web Application Programming Interfaces (web APIs or simply APIs

from hereon) provide applications remote, programmatic access to

resources such as data or functionalities. For application developers,

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full citation

on the ﬁrst page. Copyrights for components of this work owned by others than the

author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from [email protected].

MSR ’18, May 28–29, 2018, Gothenburg, Sweden

2018 Copyright held by the owner/author(s). Publication rights licensed to the

Association for Computing Machinery.

ACM ISBN 978-1-4503-5716-6/18/05. .. $15.00

https://doi.org/10.1145/3196398.3196411

the proliferation of such APIs provides tremendous opportunities.

Applications can take advantage of vast amount of existing data,

like location-based information from the Google Places API

, hook

into established and global social networks, using for example

Facebook’s

or Twitter’s

APIs, or outsource critical and hard to

implement functionalities, such as payment processing using the

Stripe API.

To consume APIs, though, developers face numerous challenges [

The ne ed to ﬁnd and select the APIs meeting their requirements,

both from a functional and non-functional point-of-view [

]. They

need to familiarize with the capabilities provided by an API and

how to invoke these capabilities, which typically involves studying

HTML-based documentation pages that vary across APIs. Com-

pared to using library APIs, for example of a Java library, consumers

of web APIs do not have interface signatures readily available or

accessible via development tools. In addition, web APIs are under

the control of independent providers who can change the API in a

way that can break client code [

]. Even for supposedly stan-

dardized notions such as the APIs’ URL structures, HTTP methods,

or HTTP status codes, the semantics and implementation styles

diﬀer across APIs [22].

One attempt to mitigate these problems is to describe APIs in a

well-deﬁned way using web API speciﬁcations.

Web API speciﬁ-

cation formats, like the OpenAPI Speciﬁcation [

] or the RESTful

API Modeling Language (RAML) [

], describ e the URL templates,

HTTP methods, headers, parameters, and input and output data

required to interact with an API. Being machine-understandable,

web API speciﬁcations are the basis for various tools that support

API consumption: speciﬁcation are input for generating consis-

tent API documentation pages

, they are used to catalog APIs

, to

auto-generate software development kits that wrap APIs in vari-

ous languages

, or even to statically check client co de for possible

errors [29].

https://developers.google.com/places

https://developers.facebook.com/docs/graph-api

https://developer.twitter.com/en/docs/api-reference-index

https://stripe.com/docs/api

We acknowledge that the term speciﬁcation sometimes means a much more compre-

hensive description of an application’s or system’s syntax and semantics. Even though

web API speciﬁcations like OpenAPI Speciﬁcation do not provide any semantics, we

use the term here nonetheless due to its prevalence in the web API community.

e.g., https://editor.swagger.io/ or https://github.com/Rebilly/ReDoc

e.g., https://apiharmony-open.mybluemix.net/ or https://any-api.com

e.g., https://swagger.io/swagger-codegen or https://apimatic.docs.apiary.io

Towards Extracting Web API Specifications from Documentation MSR ’18, May 28–29, 2018, Gothenburg, Sweden

focuses on extracting three components of API speciﬁcations: base

URLs, path templates, and descriptions (i.e., HTTP methods).

The web API’s

base URL

is essential in a web API speciﬁca-

tion: any URL of a Web API request must contain the base URL

and the relative path of the corresponding endp oint. More for-

mally, a base URL is a common preﬁx of all URLs for web API

invocations, excluding other URLs such as documentation pages.

In OpenAPI speciﬁcations, a base URL is constructed via three

ﬁelds: a scheme (e.g.,

https

), the host (e.g.,

api.instagram.com

and optionally a base path (e.g.,

/v1

). In many APIs (e.g., Insta-

gram API ), the base URL is the longest common preﬁx of all the

URLs for invoking the web API. However, for other APIs, such as

Microsoft’s The DevTest Labs Client API, the longest common pre-

ﬁx is

https://management. azure.com/subscriptions

while

the actual base URL is

https://management.azure.com

, because

/subscriptions

is deﬁned to be part of the endpoint paths. Whether

a base URL is indeed the longest common preﬁx is a design decision

of the API provider.

path template

deﬁnes ﬁxed components of a URL as well

as ones to be instantiated dynamically. For example, in the path

template

/users/{userId}/posts

, the part

{userId}

is a path

parameter that needs to be instantiate d with a concrete value of a

user ID before performing a request. A path parameter is typically

denoted via enclosing brackets (i.e., “{}”, “[]”, “

”, or “()”) or a

preﬁx “:”.

D2Spec focuses on one type of

description

associated with the

path template: the

HTTP metho d

. It reﬂects the type of interac-

tion to be p erformed on a resource exposed by a web API. While

many web APIs long relied on

GET

and

POST

, now a much broader

spectrum of methods is used [

]. As proposed in related work, we

denote every valid combination of a path template and an HTTP

method as an endpoint of the API [26].

D2Spec combines a set of techniques to infer the base URL, path

templates, and HTTP methods, given a seed documentation page.

Figure 3 provides an overview. In the ﬁrst step, D2Spec uses a simple

crawling approach to obtain complete documentation sources for an

API. The crawler, starting from the provided seed page, iteratively

downloads all linked sub pages. For crawling, D2Spec uses the

headless browser Splash

to execute any JavaScript on each page

before downloading it, as this may impact the resulting HTML

structure. In order to extract the

base URL

(see Section 2.1), D2Spec

ﬁrst extracts all candidate URLs that can represent a web API call

from the crawled documentation pages; D2Spec next leverages

machine learning classiﬁcation to determine for each candidate

URL whether or not it is likely to represent an invocation to the

documented web API. Finally, D2Spec selects the longest common

preﬁx of these URLs. For the

path templates

(see Section 2.2),

D2Spec leverages the URLs likely to be invocations of the API and

extracts the URL paths (the part of the URL after removing the

base URL). From these paths, using an agglomerative hierarchical

clustering algorithm, D2Spec infers path templates by identifying

path parameters and aggregating paths. D2Spec then ﬁnds the

descriptions co-lo cated with the URL paths in the documentation,

from which it extracts the

HTTP method(s)

(see Se ction 2.3) that

can be combine d with each path template (forming endpoints).

https://scrapinghub.com/splash/

2.1 Base URL Extraction

Identifying the base URL in online documentation is not as straight-

forward as searching for keywords or templates such as “The base

URL is

base URL

”; base URLs are often not explicitly mentioned

in the documentation. Rather, base URLs are often included as

part of depicted examples of web API requests, as in the case of

the GitHub documentation shown in Figure 1. Thus, D2Spec’s ap-

proach is to infer the base URL from all the URLs provided in online

documentation.

Step 1 - Extracting URLs

As a ﬁrst step, D2Spec extracts all candidate URLs in the documen-

tation that represent web API calls. These candidate URLs consist

of standard URLs (according to the W3C deﬁnition [

]) and URLs

containing path parameters enclosed in “

{}

”, “

[]

”, “

()

”, or “

”.

We do not include in this list URL links within

href

attributes of

link tags, nor inside

tags: URLs that represent web API

calls are one of the main content in a documentation page to be

communicated to the readers; hence, such URLs tend to be rendered

in the documentation rather than appear as links or in scripts. Even

excluding such links, some of the URLs in the candidate list may

not represent web API calls, e.g., URLs of related or even unrelated

resources. In fact, we studied a set of 15 web APIs

and found that

42% of the contained URLs are not invocations of web APIs.

Step 2 - Identifying URLs of web API calls

To ﬁlter out spurious URLs that do not represent web API calls,

we use supervised machine learning classiﬁcation to determine

whether each URL from Step 1 is likely to represent a web API

call. For each URL, D2Spec generates two categories of features:

properties of documentation pages and properties of the URL itself.

The ﬁrst category consists of four features related to properties of

the do cumentation pages from which URLs were extracted:

• clickable

: True, if the URL is part of the link text enclosed

in <a> tags with the “href” attribute.

• code_tag

: True, if the URL appears inside

<code>

tags, and

false otherwise.

• within_json

: True, if the URL is inside valid JSON within a

pair of matched HTML tags.

• same_domain_with_doc_link

: True, if the URL has the same

host name as the URL of the documentation page itself, and

false otherwise.

The second category consists of four features related to properties

of the URL itself:

• query_parameter

: True, if the URL contains query parame-

ters which are denoted by

and/or

. For example, in the URL

https://api.github.com/.../issues?state=closed

state

is a parameter with the value

closed

. URLs with parameters

are more likely to depict web API calls.

• api_convention

: The number of conventions exhibited by

the URL indicates whether it likely corresponds to a web

API call. We include three conventions described in previous

work [

]: (1) whether the URL contains the term

rest

; (2)

whether the URL contains the term

api

; and (3) whether

the URL contains version related information, including the

This data set of

web APIs, which we stu died to design D2Spec, is independent of

our evaluation set.

Towards Extracting Web API Specifications from Documentation MSR ’18, May 28–29, 2018, Gothenburg, Sweden

Algorithm 1: Clustering algorithm

Input: paths /*a set of paths that represent endpoints*/

Input: T /* Threshold for merging clusters */

Output: c

, ..., c

/*each cluster c

groups the paths invoking

the same endpoint*/

1 Function hierarchical_clustering (T, paths)

2 C ← make each path a singleton cluster

3 do

4 proдress ← f alse

5 foreach c

, c

∈ C with min dist (c

, c

) do

6 if dist (c

, c

) < T then

7 proдress ← true

8 C ← C −



, c



∪



merдe



, c





9 end

10 end

11 while |C| > 1 ∧ proдress;

We propose an iterative algorithm to infer whether a path seg-

ment is a ﬁxed segment of an endpoint, a path parameter, or an

instantiated value. The algorithm consists of two main ideas. First,

it uses clustering to group paths that we infer to invoke the same

endpoint. For example, if we found four paths in the documentation

for an API:

/users/{username}/repos

/users/alice/repos

/users/alice/received_events

/users/bob/received_events

the clustering algorithm groups the ﬁrst two into one cluster and

the last two into the second cluster. From the ﬁrst cluster, we know

that

alice

is an instantiated value of

{username}

. Second, in sub-

sequent iterations, the algorithm then leverages the fact that

alice

is an already inferred instantiated value to improve the clustering in

the next iteration, marking both

alice

and

bob

as two instantiated

values.

D2Spec uses hierarchical agglomerative clustering [

], as de-

scribed in Algorithm 1. Given a set of paths with the same number

of segments, the goal is to group the paths so that paths in a cluster

invoke the same endpoint. We begin with one data point (i.e., one

path) per cluster ( line 2 in Algorithm 1). At each iteration (lines

4-10), we calculate the distance among all the pair-wise clusters and

picks the pair with the shortest distance (line 5) to merge (line 8).

For our implementation, the distance function (Algorithm 2) consid-

ers two paths the “closest” if they have exactly the same segments

– each matching concrete (i.e. not a path parameter) segment

gets

one point (Algorithm 2, line 8). Because two paths can never invoke

the same endpoint when they have a diﬀerent number of segments,

the distance of such a pair is inﬁnite (Algorithm 2, line 5). If the

-th segment of a path is a path parameter, the distance function

considers the segment a match on the

-th segment of any other

paths of the same length, with a discounted point of 0.8 instead

of 1 (Algorithm 2, line 8). The clustering algorithm stops when

the next pair of clusters to merge has the distance larger than a

threshold T (Algorithm 1, lines 6, 7, and 11). In our implementation,

the threshold is set to 1, meaning that we allow paths in a cluster

to have a single path segment diﬀerent from each other.

Algorithm 2: Distance functions

1 Function dist(cluster c

, cluster c

)

2 return min

path

∈c

, path

∈c

dist_sinдles(path

, path

)

3 Function dist_singles(list of segments S

, list of segments S

)

4 if |S

| , |S

| then

5 return ∞

6 end

7 else

8 sim ←



concrete(S

[i]) ∧ S

[i] = S

[i]

0.8 ×

param(S

[j]) ∨ param(S

[j])



9 d ← |S

| − sim

10 return d

11 end

To leverage the instantiated values that are already inferred,

such as

alice

in the

received_events

cluster in the example, we

adapt the standard hierarchical agglomerative algorithm as follows

(Algorithm 3): The algorithm keeps track of a list of instantiated

values of the path parameters per API (line 9), and stops when

no additional instantiated values are found from the function in-

fer_parameter_value (lines 10 and 12). Each iteration starts by up-

dating the paths with the currently known instantiated values (lines

5-7). These paths are the input to the hierarchical agglomerative

clustering algorithm (line 8). Clustering is performed after updating

the newly instantiated values because when new path parameters

are identiﬁed, the similarities will be updated. Within each cluster,

new values of path parameters are inferred (line 10, the call to in-

fer_parameter_value). This adapted algorithm can correctly cluster

the four paths into two endpoints:

/users/{username}/repos

and

/users/ {username}/received_events.

2.3 HTTP Methods

In web API documentation, the paths (whether or not path parame-

ters are denoted explicitly or not) are typically co-located with other

valuable information that D2Spec aims to extract, namely valid

HTTP methods to use with a path template (

GET

PUT

DELETE

...).

We call the context in which this information exists a description

block of a path template. In this section, we ﬁrst describe how we

locate the description block associated with a path template, and

then how D2Spec extracts the HTTP method.

D2Spec uses the URLs from the original documentation page

that match with the inferred path templates as anchors in docu-

mentation pages (in HTML) to locate the scope of the description

block. If there are multiple URLs in the page that match the path

template, D2Spec combines the contexts of all the URLs as the de-

scription block of the path template. D2Spec locates a description

block for each path template as follows. First, D2Spec parses the

documentation page into a DOM tree (Figure 4), with each node

representing the rendered text from the fragment of the HTML

page enclosed in a pair of matched tags. Second, D2Spec marks the

nodes whose rendered text contains at least one URL that matches

a path template as gray, and locates the description block for each

of these nodes. More speciﬁcally, for each gray node, D2Spec com-

bines its description block by expanding to include (1) the siblings

Towards Extracting Web API Specifications from Documentation MSR ’18, May 28–29, 2018, Gothenburg, Sweden

selected Google APIs in our evaluation. Overall, from APIs.guru, the

APIs used in the evaluation consist of 48 APIs from the individual

providers (we will call this set GuruIndividual) and 20 Google APIs

(we will call this set GuruGoogleSample), thus a total of 68 APIs.

API Harmony

[

] is a catalog of web APIs that helps developers

to ﬁnd and choose web APIs, and learn how to use them. API Har-

mony collects information on public web APIs. When we collected

the evaluation data, API Harmony listed 1

019 web APIs in total,

772 of which contained links to the API’s documentation page. We

crawled the links to these documentation pages with the help of

API Harmony’s sitemap.xml ﬁle. We took a sample of 48 APIs (Har-

monySample) from the 681 APIs from API Harmony and manually

veriﬁed that the documentation indeed describes the API (from 772,

we excluded 91 APIs that overlapped with APIs.guru).

From these two sources, collectively, we obtained 116 unique

APIs. For RQ1 and RQ2, we will examine the set of 116 APIs as

follows:

•

The GuruIndividual dataset consists of the 48 APIs in APIs.guru

that are not from Google or Microsoft.

•

The GuruGoogleSample dataset consists of the 20 Google

APIs from APIs.guru.

•

The HarmonySample dataset consists of a sample of 48 APIs

from the 681 APIs from API Harmony.

3.2 RQ1: Can D2Spec accurately extract web

API speciﬁcations from documentation?

Approach:

To assess the accuracy of D2Spec, we aim to determine

how well the produced speciﬁcations match the input, which is the

online documentation.

To increase the generalizability of our results, we performed

the evaluation in two stages: First, we applied D2Spec to all 68

APIs obtained from APIs.guru (GuruIndividual and GuruGoogle-

Sample). These APIs do not contain the 15 APIs we used to train

D2Spec (see Section 2.1). We decided to use all APIs from APIs.guru

in this evaluation because the required manual examination of

them is also required to conduct the se cond research question, and

there are limited APIs in APIs.guru to study. Second, we performed

the evaluation on HarmonySample, which is a completely separate

data source from APIs.guru. The results for these APIs thus bet-

ter quantify how well D2Spec can potentially generalize to other

API documentation pages. We limited the number of APIs in Har-

monySample because the evaluation requires signiﬁcant manual

eﬀort.

Overall, for RQ1, we considered 116 APIs (GuruIndividual + Gu-

ruGoogleSample + HarmonySample). To create the ground truth, we

manually identiﬁed base URLs, path templates and HTTP methods

from web API documentation. We then compared the ground truth

with the speciﬁcations created by D2Spec for the same API. For

base URLs, we calculated precision, which is the percentage the

base URLs generated by D2Spec that are correct. Since each API

documentation describes only one base URL, and by design D2Spec

generates one base URL for each API documentation, recall is equal

to precision for base URLs. For path templates and HTTP methods,

we consider precision to be the percentage the results generated

by D2Spec that are correct and recall to be the percentage of the

Table 1: Precision and recall of D2Spec

HarmonySample

(48 APIs)

GuruGoogleSample

(20 APIs)

GuruIndividual

(48 APIs)

All APIs

(116 APIs)

Base URL

# of APIs with

correct base URL 45 16 40 101

Precision 93.8% 80.0% 83.3% 87.1%

Path Template

# created D2Spec 967 188 1,331 2,486

# in documentation

(with correct base URLs) 747 196 1,526 2,469

# matches 683 187 1,127 1,997

Precision 70.6% 99.5% 84.7% 80.3%

Recall 91.4% 95.4% 73.9% 80.9%

HTTP Method

# created D2Spec 817 188 1,142 2,147

# in documentation

(with correct base URLs) 815 219 1,297 2,331

# matches 658 184 957 1,799

Precision 80.5% 97.9% 83.8% 83.8%

Recall 80.7% 84.0% 73.8% 77.2%

given information type (e.g., path templates) in the documentation

that D2Spec correctly generates. Because path templates and HTTP

methods can only be extracted if a base URL was previously de-

tected (see Se ctions 2.2 and 2.3), we fo cus on APIs for which D2Spec

was able to do so in these parts of the evaluation.

Results:

D2Spec recovered base URLs with a precision of 87.1%,

inferred path templates with a precision of 80.3% and a recall of

80.9%, and extracted HTTP methods with a precision of 83.8% and

a recall of 77.2%. Table 1 provides a break-down of the results.

3.2.1 Base URL Results. For the 116 web APIs, D2Spec generated

correct base URLs for 101 of them, yielding a precision of 87.1%. In

the subsequent evaluation for path templates and HTTP methods

for RQ1, the evaluation was based on the 101 APIs.

Upon manual inspection, we found that there were two reasons

that D2Spec generated incorrect base URLs. First, when the docu-

mentation described multiple API versions, D2Spec was unable to

tell which one was preferred by the writer of the documentation. For

example, in the documentation of the CityContext web API, two end-

points were described with

https://api.citycontext.com/v1/postcodes

and

https://api.citycontext.com/v2/<location>

. D2Spec determined the

base URL to be

https://api.citycontext.com

by selecting the longest

common preﬁx of these two URLs. However, the oﬃcial documen-

tation listed

https://api.citycontext.com/v2

as base URL. Second, al-

though the classiﬁcation achieved a good precision, it is unable to

remove all URLs that are not web API requests. Such URLs with the

same preﬁx caused D2Spec to generate incorrect base URLs when

they outnumbered the true web API requests.

3.2.2 Path Template Results. D2Spec was able to generate the

majority (80.9% recall) of path templates correctly (80.3% precision)

for the 101 web APIs whose base URLs are correctly identiﬁed by

D2Spec. There were in total 2,469 path templates described in the

documentation. D2Spec generated 2,486 path templates in total,

and 1,997 of them were correct. Thus, the overall precision of path

template extraction was 80.3%, and the recall was 80.9%. Figure 5

illustrates stacked histograms on precision and recall of the path

MSR ’18, May 28–29, 2018, Gothenburg, Sweden Jinqiu Yang, Erik Wiern, Annie T.T. Ying, Julian Dolby, and Lin Tan

(a) Precision of D2Spec on the 101 web APIs.

(b) Recall of D2Spec on the 101 web APIs.

Figure 5: Stacked histograms showing precision and recall

of D2Spec on the

101

web APIs for which the base URL was

correctly extracted.

template extraction on the 101 APIs that D2Spec can generate

correct base URLs for. For example, Figure 5a shows that for 58 (out

of 101) web APIs, D2Spec achieves a precision above 90%.

3.2.3 HTTP Method Results. D2Spec achieved a precision of

83.8% and a recall of 77.2% in extracting HTTP methods for the

path templates in the evaluated 101 web APIs. In total, there were

2,331 endpoints with the associated HTTP methods described in

the web API documentation evaluated; D2Spec produced a result

for 2,147 of them and 1,799 HTTP methods were correct. D2Spec

failed to locate the correct HTTP method when its position in the

documentation was far away from the path templates. For example,

the Mandrillapp do cumentation has a consolidated description for

all endpoints: “All API calls should be made with HTTP POST”,

instead of listing the method

POST

individually for each of the path

template. Thus, D2Spec failed to identify correct method names for

Mandrillapp’s path templates.

3.3 RQ2: Can D2Spec be used to identify

inconsistencies between a pre-existing API

speciﬁcation and the API’s documentation,

pointing to the two being out of

synchronization?

Approach:

We fo cused on the 68 APIs from APIs.guru (GuruIn-

dividual + GuruGoogleSample). For these APIs, we compared the

speciﬁcations generated by D2Spec (from hereon denoted as

Tool-

Specs

) with the speciﬁcations provided by APIs.guru (from hereon

denoted as

GuruSpecs

). Our comparison focused, again, on the

three pieces of information extracted by D2Spec, namely, base URLs,

path templates, and HTTP methods. For

base URLs

, we compared

whether the ones extracted by D2Spec per API match those deﬁned

in the OpenAPI speciﬁcations. We obtained base URLs from the

speciﬁcations by concatenating the schemes, host, and basePath

ﬁelds. We then compared whether the extracted

path templates

and the associated

HTTP methods

match the ones in the speciﬁ-

cations.

For each of the three extracted pieces of information, we counted

the number of matches, and then manually inspected the mis-

matches to determine their origin.

Results:

We found that mismatches between GuruSpecs and Tool-

Specs were partly caused by limitations of D2Spec, and partly by

publicly-available speciﬁcations (i.e., the GuruSpecs) being out of

synchronization with API documentation: Our manual inspection

showed that for base URLs and HTTP methods, GuruSpecs were

often up-to-date with documentation, and mismatches between

ToolSpecs and GuruSpecs were due to inaccuracies of D2Spec. How-

ever, for path templates, our manual inspection found that many

mismatches were due to the documentation and GuruSpecs being

out of synchronization with each other, or due to errors in the doc-

umentation. Speciﬁcally, for the 68 APIs evaluated, we identiﬁed

394 path templates from 24 APIs where GuruSpecs and the docu-

mentation were diﬀerent. One reason for the mismatches is that as

web APIs evolve, API providers tend to keep documentation up-to-

date since it is, as a human-readable medium, often the ﬁrst source

that developers inquire to use APIs. In the following, we present

the results of manually examining the mismatches between path

templates in GuruSpecs and ToolSpecs. We found that mismatches

fall into four categories:

• Inconsistencies

were mismatches resulting from the docu-

mentation and speciﬁcation in GuruSpecs being inconsistent

with each other. Such inconsistencies were not caused by

deﬁciencies of D2Spec or by errors in the documentation,

but indicated that the API provider should either update the

documentation or the speciﬁcation. For example, in the doc-

umentation of Slack, there were eight endpoints on getting

information on members from a given Slack team. The paths

of all eight endpoints start with

/users.<action>

. How-

ever, only one path template was listed in the speciﬁcation–

/users.list

; the remaining seven (e.g.,

/users.info

and

/users.setPresence

) were missing in the speciﬁcation. In

this case, we considered that there were seven inconsisten-

cies between the documentation and the speciﬁcation.

• Errors in the documentation

referred to obvious errors

in the documentation, e.g., typos. Incorrect information in

the documentation led D2Spec to generate path templates

that, while being labeled as correct with regard to RQ1, did

not match the speciﬁcations. For example, in the documen-

tation of the ClickMeter API, many path templates starting

with

/datatpoints

were misspelled as

/datapoints

. Thus,

D2Spec generated several mismatched path templates com-

pared to the oﬃcial speciﬁcation.

• Partially correct path templates

occurred if D2Spec failed

to infer path parameters correctly (i.e., the path templates

Towards Extracting Web API Specifications from Documentation MSR ’18, May 28–29, 2018, Gothenburg, Sweden

generated by D2Spec still contain path parameter values).

A common reason for this problem was that the documen-

tation contained only one instance of an instantiated value

for a path parameter. In such cases, even though D2Spec’s

clustering-based algorithm can correctly place the path in its

own cluster, D2Spec could not distinguish which segments

of the path were instantiated values and which ones are ﬁxed

segments.

• Deﬁciencies in the algorithm

occurred when D2Spec failed

to extract certain path templates or generated incorrect path

templates because of deﬁciencies in its design. For instance,

D2Spec failed to extract certain path templates if the way the

path templates appeared in the documentation was beyond

the scope of the conventions used by D2Spec. D2Spec relies

on the format of URL and relative path to extract path tem-

plate information. If the path templates are not presented

as such, D2Spec will not extract them correctly. For exam-

ple, HealthCare.gov’s documentation describes a series of

path templates as follows: “The following content types are

available: articles, blog, questions, glossary, states, and top-

ics. The request structure is https://.../api/:content-type.json.”

D2Spec extracts one path template–

/api/{content-type.

json}

instead of six path templates (e.g.,

/api/articles

and

/api/blog

). On the other hand, using the conventions

mentioned above, D2Spec also extracted false path templates

which did not describe path templates in the documentation.

For example, the documentation of dweet.io listed a ﬁle path

–

/play/definitions

, which was not a true path template.

Figure 6 visualizes the comparison of path templates from Gu-

ruSpecs and ToolSpecs. The breakdown of the mismatches from both

aspects (

¬ToolSpecs ∩ GuruSpecs

and

ToolSpecs ∩ ¬GuruSpecs

) is

shown as well. In total, there are 929 path template matches be-

tween ToolSpecs and GuruSpecs. Among the 1509 path templates

generated by D2Spec, there are 590 mismatches with the path tem-

plates deﬁne d in the GuruSpecs. Our manual analysis shows that 394

(67

8%) of the mismatches are caused by de-synchronization, i.e.,

“inconsistencies” and “errors in the doc.”. The other two categories

– “partially correct” and “deﬁciencies” – are due to the limitations

of D2Sp ec.

Overall, while the manual examination of mismatches also pointed

to some weaknesses of D2Spec, it also highlights that D2Spec can

be used to ﬁnd documentation and existing speciﬁcations being

out of synchronization. To focus on this aspect, Figure 7 shows

a histogram on the percentage of mismatches that are caused by

de-synchronization for each web API. It shows that, for example,

for 11 web APIs, over 90% of the mismatches detected by D2Spec

indicate that documentation and pre-existing sp eciﬁcations were

out of synchronization with each other.

4 THREATS TO VALIDITY AND DISCUSSIONS

Generalizability

. D2Spec generates base URL from documenta-

tion by ﬁrstly identifying URLs that represent web API calls through

a classiﬁcation algorithm. The classiﬁcation algorithm uses a set of

pre-labeled URLs for training. We built the training set from a set

of web API documentation that were independent of the ones used

in the evaluation. The precision of D2Spec in base URL extraction

Figure 6: Comparison between path templates in speciﬁca-

tions from D2Spec and the ones from APIs.guru.

Figure 7: Percentage of “Inconsistencies” and “Errors on the

documentation” between documentation and spe ciﬁcations

by APIs.guru across APIs from APIs.guru

may be diﬀerent if we use a diﬀerent training set. However, we

mitigated this bias by cho osing a random set of web APIs for build-

ing the training set, and evaluating on a diﬀerent set of APIs (the

GuruIndividual and GuruGoogleSample datasets described in Sec-

tion 3.1). In addition, we evaluated on a completely separate dataset

(HarmonySample) and even achieved a better precision (97.5%) com-

pared to a 80.0% precision on GuruGoogle and a 84.1% precision

on GuruIndividual, demonstrating that our approach is likely to

generalize to other unseen documentation pages.

Thresholds Used in the Clustering-Based Path Template Ex-

traction

. D2Spec leverages an iterative clustering-based algorithm

to identify path parameters by inferring values of path parame-

ters from similar web API calls. The proposed algorithm contains

thresholds to control the hierarchical clustering (e.g., determining

whether two web API calls are similar through a threshold T, see

Algorithm 1 in Section 2.2). In this evaluation, we set the thresholds

based on our observations on the training set. We found that the

chosen thresholds also worked well for the evaluation set. Nev-

ertheless, future studies should investigate the eﬀects of diﬀerent

thresholds on the path template extraction results.

MSR ’18, May 28–29, 2018, Gothenburg, Sweden Jinqiu Yang, Erik Wiern, Annie T.T. Ying, Julian Dolby, and Lin Tan

Documentation with Identical Structures

. APIs from the same

providers may have identical documentation structures (e.g., Google

web APIs). Documentation structures may be diﬀerent across dif-

ferent API providers. To show the generalizability of our approach,

we applie d our approach to APIs from diﬀerent providers: Our eval-

uation set contains 120 APIs from 98 diﬀerent web API providers.

5 RELATED WORK

We discuss related works that address extracting or inferring web

API speciﬁcations, as well as works that rely on information ex-

traction approaches, both for extracting software entities and for

extracting any type of information from web pages generally.

Hanyang et al. describe AutoREST, a tool that, as does this work,

aims to extract web API speciﬁcations from HTML-based documen-

tation [

]. AutoREST uses a preprocessing step to select crawled

web pages that likely contain information relevant to the speciﬁca-

tion, which could be used in combination with the here presented

work. AutoREST relies on a set of simple, ﬁxed rules to extract in-

formation from selected HTML pages, whereas the here presented

methods are designed to be applicable also in light of stark diﬀer-

ences in the way APIs are do cumented. We furthermore present

a more extensive and detailed evaluation in this work. Gao et al.

propose to infer constraints on the data required by web APIs (i.e.,

payload or parameters values) by mining both, API documentation

and error-messages [

]. In contrast to the here presented work, the

focus is thus on data deﬁnitions, making this work complementary

to ours.

Further related works on extracting web API spe ciﬁcations rely

on sources of information other than documentation. Wittern and

Suter use dynamic traces in form of web-server logs [

]. The

SpyREST tool, presented by Sohan et al., intercepts HTTP requests

to an API using a proxy and then attempts to infer the API spec-

iﬁcation from them [

]. In later work, the same authors discuss

the application of SpyREST at Cisco, where requests to the proxy

are driven by existing tests against APIs [

]. Ed-douibi et al. pro-

pose an approach to generate web API speciﬁcations from example

request-response pairs [

]. One beneﬁt of our approach, as com-

pared to these works, is that API documentation is typically publicly

available, while access to web logs are limited to those with access

to the private web servers, proxying may not be an option, and

providing extensive examples for API usages may require (manual)

eﬀort, which could be targeted to generate speciﬁcations directly.

Many software engineering researchers have looked into the

problem of identifying co de elements—more speciﬁcally, Java code

elements such as method signatures and calls—from API documen-

tation. Dagenais and Robillard proposed an approach that extracts

code elements from API documentation and links the elements to

an index of known code elements, i.e., signatures from a Java li-

brary [

]. Subramanian et al. subsequently applied this approach to

identify code elements on Stack Overﬂow posts and augmented the

code elements in the posts with links to their oﬃcial JavaDo c [

Rigby and Robillard use a light-weight, regular expression base d

approach to identify code elements that relaxes the requirement

on a known index [

]. Another line of work focuses on extracting

more complex speciﬁcations on code entities from natural language

descriptions. Pandita et al. [

] extract method pre-conditions and

post-conditions from natural language API documentation. Lin et

al. [

] extract code contracts from comments and statically check

for violations in the code. Our work diﬀers in two ways. First, we

extract web API endpoints and related information as opposed to

code elements. Second, there is arguably greater value in our re-

covered index (i.e., OpenAPI Speciﬁcations) because such an index

is often not available or known to the clients; while clients of Java

libraries (or other statically-typed languages) are always exposed

at least to method signatures, but callers of web APIs often do not

have such information.

There have been many eﬀorts in information extraction on web

pages [

]. For example, techniques for ex-

tracting product information from e-commerce sites [

] lever-

age the structure from the sites: the sites’ organizational structure

usually consists of a search page and a set of individual product

pages, which typically have the same structure as they are gener-

ated from scripts. These techniques exploit this common structure

across the pages within the same site. However, for extracting

endpoints and other information from web API documentation

pages, we cannot rely on such an assumption: There is no stan-

dard structure for API documentation. For many API documen-

tation the content is semi-structured at best, written by humans

using free-form text and/or diverse HTML structures. For exam-

ple, the GitHub API documentation uses an example-based style,

where the base URL

https://api.github.com

and the path tem-

plate

/users/{username}/orgs

are embedded in free-form text

and a

curl

command. Other documentation uses a more structured,

reference-based documentation style.

6 CONCLUSION

In this paper, we presented D2Spec, a tool which extracts parts of

web API speciﬁcations from documentation, including base URLs,

path templates, and HTTP methods. D2Spec is based on the three

assumptions: (1) documentation includes multiple web API URLs

(so that a base URL can be extracted); (2) path templates are either

denoted explicitly (e.g., using brackets) or that multiple example

URLs for paths exist from which templates can be inferred; and (3)

descriptions close to the path templates contain information about

HTTP methods.

One missing piece so far is understanding the data that is re-

turned by the APIs that we discover. We believe it is feasible to

do this in several possible ways. The ﬁrst is extending our extrac-

tion from documentation; documentation often includes example

of API usage, and we could extract those examples and statically

analyze that code for what data it expects back. Given example API

usage, existing client code could be analyzed either dynamically or

statically to infer data structures.

Our evaluation of D2Spec shows that our assumptions hold

mostly true when it comes to extracting base URLs, path templates,

and HTTP methods. It furthermore shows that D2Spec is not only

useful for creating speciﬁcations from scratch, but also for checking

existing ones for consistency with documentation. We contacted

API providers for the found inconsistencies. In the future, we aim

to expand the scop e of D2Spec to also extract information on data

structures, HTTP headers, and authentication methods.

Towards Extracting Web API Specifications from Documentation MSR ’18, May 28–29, 2018, Gothenburg, Sweden

REFERENCES

[1] 2016. scikit-learn. (2016). http://scikit-learn.org/stable/index.html.

[2] 2016. URL - W3C. (2016). https://www.w3.org/TR/url-1/.

[3] 2017. API Harmony. (2017). https://apiharmony-open.mybluemix.net.

[4]

2018. APIs.guru - Wikipedia for Web APIs. (2018). https://apis.guru/openapi-

directory.

[5]

2018. OpenAPI Speciﬁcation. (2018). https://github.com/OAI/OpenAPI-

Speciﬁcation.

[6]

2018. ProgrammableWeb API Directory. (2018).

https://www.programmableweb.com/apis/directory/.

[7] 2018. RESTful API Modeling Language (RAML). (2018). https://raml.org/.

[8]

Manuel Álvarez, Alberto Pan, Juan Raposo, Fernando Bellas, and Fidel Cacheda.

2008. Extracting lists of data records from semi-structured web pages. Data &

Knowledge Engineering 64, 2 (2008), 491–509.

[9]

Hanyang Cao, Jean-Rémy Falleri, and Xavier Blanc. 2017. Automated Generation

of REST API Speciﬁcation from Plain HTML Documentation. In Service-Oriented

Computing (ICSOC). Springer International Publishing, Cham, 453–461.

[10]

Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo, et al

2001. RoadRunner:

Towards Automatic Data Extraction from Large Web Sites. In Proceedings of the

International Conference of Very Large Data Bases (VLDB). Morgan Kaufmann,

109–118.

[11]

Barthélémy Dagenais and Martin P. Robillard. 2012. Recovering Traceability

Links Between an API and Its Learning Resources. In Proceedings of the 34th

International Conference on Software Engineering (ICSE). IEEE Press, 47–57.

[12]

Hamza Ed-douibi, Javier Luis Cánovas Izquierdo, and Jordi Cabot. 2017. Example-

Driven Web API Speciﬁcation Discovery". In Modelling Foundations and Applica-

tions. Springer International Publishing, Cham, 267–284.

[13]

Tiago Espinha, Andy Zaidman, and Hans-Gerhard Gross. 2015. Web API Growing

Pains: Loosely Coupled yet Strongly Tied. Journal of Systems and Software 100

(2015), 27–43.

[14]

Emilio Ferrara, Pasquale De Meo, Giacomo Fiumara, and Robert Baumgartner.

2014. Web data extraction, applications and techniques: a survey. Knowledge-

based systems 70 (2014), 301–323.

[15]

Chushu Gao, Jun Wei, Hua Zhong, and Tao Huang. 2014. Inferring Data Contract

for Web-Based API. In IEEE International Conference on Web Services (ICWS).

IEEE, 65–72.

[16]

Alberto HF Laender, Berthier A Ribeiro-Neto, Altigran S da Silva, and Juliana S

Teixeira. 2002. A brief survey of web data extraction tools. ACM Sigmod Record

31, 2 (2002), 84–93.

[17]

Jun Li, Yingfei Xiong, Xuanzhe Liu, and Lu Zhang. 2013. How Does Web Service

API Evolution Aﬀect Clients?. In 2013 IEEE 20th International Conference on Web

Services (ICWS). 300–307.

[18] Manning. 2009. Introduction to Information Retrieval. Cambridge Press.

[19]

Jussi Myllymaki. 2002. Eﬀective web data extraction with standard XML tech-

nologies. Computer Networks 39, 5 (2002), 635–644.

[20]

Rahul Pandita, Xusheng Xiao, Hao Zhong, Tao Xie, Stephen Oney, and Amit

Paradkar. 2012. Inferring Method Speciﬁcations from Natural Language API

Descriptions. In Proceedings of the 34th International Conference on Software

Engineering. 815–825.

[21]

Peter C Rigby and Martin P Robillard. 2013. Discovering essential code elements

in informal documentation. In Proceedings of the 35th International Conference on

Software Engineering (ICSE). IEEE Press, 832–841.

[22]

Carlos Rodríguez, Marcos Baez, Florian Daniel, Fabio Casati, Juan Carlos Tra-

bucco, Luigi Canali, and Gianraﬀaele Percannella. 2016. REST APIs: A Large-Scale

Analysis of Compliance with Principles and Best Practices. In Web Engineering.

Springer International Publishing, Cham, 21–39.

[23]

S M Sohan, Craig Anslow, and Frank Maurer. 2015. SpyREST: Automated RESTful

API Documentation Using an HTTP Proxy Server. In Proceedings of the 30th

IEEE/ACM International Conference on Automated Software Engineering (ASE).

IEEE, 271–276.

[24]

S. M. Sohan, C. Anslow, and F. Maurer. 2017. Automated example oriented REST

API do cumentation at Cisco. In 2017 IEEE/ACM 39th International Conference on

Software Engineering: Software Engineering in Practice Track (ICSE-SEIP). 213–222.

[25]

Siddharth Subramanian, Laura Inozemtseva, and Reid Holmes. 2014. Live API

documentation. In Proceedings of the 36th International Conference on Software

Engineering. ACM, 643–652.

[26]

Philippe Suter and Erik Wittern. 2015. Inferring web API descriptions from usage

data. In Proceedings of the Third IEEE Workshop on Hot Topics in Web Systems and

Technologies. 7–12.

[27]

Lin Tan, Ding Yuan, Gopal Krishna, and Yuanyuan Zhou. 2007. /*Icomment:

Bugs or Bad Comments?*/. SIGOPS Oper. Syst. Rev. 41, 6 (Oct. 2007), 145–158.

https://doi.org/10.1145/1323293.1294276

[28]

Erik Wittern, Vinod Muthusamy, Jim Alain Laredo, Maja Vukovic, Aleksander A.

Slominski, Shriram Rajagopalan, Hani Jamjoom, and Arjun Natarajan. 2016. API

Harmony: Graph-based search and selection of APIs in the cloud. IBM Journal of

Research and Development 60, 2-3 (March 2016), 12:1–12:11.

[29]

Erik Wittern, Annie T. T. Ying, Yunhui Zheng, Julian Dolby, and Jim Alain Laredo.

2017. Statically Checking Web API Requests in JavaScript. In Proceedings of

the 39th International Conference on Software Engineering (ICSE). IEEE Press,

244–254.

[30]

Erik Wittern, Annie T. T. Ying, Yunhui Zheng, Jim Alain Laredo, Julian Dolby,

Christopher C. Young, and Aleksaner A. Slominski. 2017. Opportunities in

Software Engineering Research for Web API Consumption. In 2017 IEEE/ACM

1st International Workshop on API Usage and Evolution (WAPI). IEEE, 7–10.

[31]

Yanhong Zhai and Bing Liu. 2005. Web data extraction based on partial tree

alignment. In Proceedings of the International Conference on World Wide Web.

ACM, 76–85.

[32]

Yanhong Zhai and Bing Liu. 2007. Extracting web data using instance-based

learning. World Wide Web: Internet and Web Information Systems 10, 2 (2007),

113–132.