scalability landscape surrounding multiple existing use cases
in spreadsheets.
To accomplish this, we collected over 700 posts from an Excel
Reddit forum. While analyzing the posts, four main types of
operations emerged that posed problems, both in scalability,
and otherwise—importing, managing, querying, and present-
ing data. We dig deeper to characterize these problem areas to
highlight concrete areas of improvement for spreadsheet soft-
ware, with an eye towards expanding the reach and usability of
spreadsheets, especially for very large and complex datasets.
The contributions of this paper are (1) a mapping of challenges
users face using spreadsheets in general, as well as (2) how
they pertain to scalability, and (3) a methodology for broad
evaluation of problems in features, capabilities, and intent
specification for end-user software, and (4) a discussion of
how to fix these problems, as a means towards building a more
robust, scalable, powerful spreadsheet tool.
RELATED WORK
Our work builds on prior work in two different research areas,
(1) spreadsheets and (2) the use of online community discourse
as a primary data source for study.
Prior work in spreadsheets
Previous research in spreadsheets have focused on improving
spreadsheets through understanding existing problems via user
studies and analyzing existing spreadsheets. Researchers have
conducted user studies to understand users’ conceptual models
of spreadsheets to identify how the cognitive process can affect
error rates [27], to see how users navigate large spreadsheets
[23], to evaluate how multiple users interact with a single
spreadsheet [25], and to characterize the strengths and weak-
nesses of spreadsheets [26, 14]. Other studies have focused
on errors: Powell et al. explore different types of errors that
occur and how they can be minimized [28], while others study
real spreadsheets to discover errors [9, 20]. Our approach
is instead to identify scalability problems in spreadsheets by
exploring troubleshooting posts on an online forum.
Using online communities as a data source
The availability of diverse and large amounts of online com-
munity data has led researchers to mine this data to answer
research questions [15, 8, 21, 18, 16]. While this method
may bias the user sample to users with internet access and a
level of technology savviness, prior works have successfully
created rich characterizations of users via this approach. In
addition to the papers mentioned in the introduction [8, 18,
17], Kulshrestha et al. [21] measured the political bias of an
individual Twitter search result by extracting features from
the Twitter user’s account, while Keelan et al. [16] extracted
Youtube videos to measure the sentiment (positive/negative)
surrounding immunization.
CHARACTERIZING REDDIT
Reddit is a website that hosts a variety of forums called sub-
reddits. Each subreddit is a forum dedicated to a specific topic
and is named /r/topic_name. Within these forums, users can
post questions or notes and can comment on each others’ posts,
leading to discussions. Reddit has an API [3] to access data
from the site, making it ideal to use for those interested in
automating the scraping and categorizing of data.
The Forum
The /r/excel forum has been around for over 8 years, and as of
September 2017, had over 70,000 subscribers [4]. Note that a
subscriber is someone who subscribes to (follows) the forum;
anyone with a Reddit account can read, post, or comment
in the /r/excel forum without following it. Reddit recently
removed its statistics regarding the traffic of subreddits to im-
prove user privacy [5], resulting in the number of subscribers
being the most valuable statistic publicly available. Note also
that this forum is monitored (meaning a moderator can take
down posts as they see fit), and one of the moderators of the
forum disclosed in a post entitled "Please welcome our new
Corporate Overlords" that the subreddit is involved in the
Excel Influencer Program [2].
The Users of the Forum
Reddit does not collect any information from the user other
than a username, password, and email, so it is difficult to
characterize the type of users that frequent Reddit, never mind
those that visit the /r/excel subreddit. However, in their posts
and comments, some users share their experience level with
Excel. Often users who ask questions begin with a statement of
their unfamiliarity with Excel. Others state that they use Excel
frequently for work but that they need help performing a new
and/or complex operation. The users who answer questions
typically do not state their credentials, but often their complex
solutions indicate a high level of experience with Excel.
In some cases, there are tools other than Excel that are useful
for managing (particularly large amounts of) data like Mi-
crosoft’s Access software [22] or relational databases. Users
of the forum (both post makers and commenters) showed vary-
ing degrees of knowledge about these alternatives. In some
cases, the user wanted to know if a tool was more appropriate
than Excel. In others, the users specifically said they knew
other programs would perform better, but they were forced
to use Excel. Sometimes the user did not mention Access
or databases, but was recommended by commenters to use a
database as opposed to a spreadsheet. Out of the 712 posts we
collected, 89 mentioned one or more of the terms “Access”,
“SQL”, or “database” in the post body or comments.
The Uses of Excel
In the 712 posts, the uses of Excel seemed to fall into two
overarching categories: Excel for personal use, and Excel for
professional use. Traditionally, in both of these areas, we think
of spreadsheets as being used for record keeping of data like
addresses and emails, time trackers and schedules, or financial
information and budgets. There were many posts regarding
these topics in both personal and professional settings. How-
ever, the more unusual uses of Excel were impressive, detailed
next.
Regarding personal uses, several users asked about how to
keep track of and calculate sports statistics; fantasy football
was particularly popular. One user wanted to design a spread-
sheet that automatically organizes a table tennis tournament.
2