When doing any kind of project centered on data analysis or visualization, the biggest challenge (by far) is finding the right dataset. Except in rare circumstances, you probably won’t want to build your own dataset from scratch. While the number of datasets expands every year, not every dataset is equally usable. Here are some general tips when looking for data, along with some resources to get you started.

General Tips

  • Data can come in many different formats, but for this class you’ve mainly learned how to work with tabular data (ie. spreadsheets). Look for datasets that are stored in a format that you can either download to your computer (files ending in .csv .xlsx .xls) or something like Google Sheets that you can download and transform into that format.
  • Always look at the metadata! Just because a dataset exists doesn’t mean it’s accurate or complete - be on the lookout for the who, what, when, how, and why behind how a dataset was created.
  • Roll up your sleeves and take a look at the actual contents of the dataset as soon as possible to get a feel for what’s in there, what’s not, and how it’s formatted

Where to Find Datasets

  • Data.Gov
    • Strengths: Good documentation, metadata, and search function usability
    • Weaknesses: Limited to data produced/collected by government agencies
    • Example dataset: Walkability Index
  • Kaggle
    • Strengths: Broad mix of topics; good usability
    • Weaknesses: Sometimes lacks documentation and metadata; might have incomplete data
    • Example: Video Game Sales
  • Data Is Plural archive of datasets
    • Strengths: eclectic mix of curated datasets, often relevant to current events; good descriptions of data
    • Weaknesses: Small size
  • Google Dataset Search
    • Stregnths: Size - it’s pulling in datasets from thousands of different sources
    • Weaknesses: Useful as a starting point but requires extra navigating (it points you towards other websites to locate the actual dataset)
  • Harvard Dataverse
    • Strengths: good documentation and metatadata
    • Weaknesses: focused on more academic topics

Updated: