Data Life Cycle#

Introduction to the stages of the Data Life Cycle#

  • When we talk about data, what do we mean?

  • What is the data life cycle? Where are you on this cycle? Which steps apply to you?

  • We are discussing these steps as a list, which may make you think of them as sequential, however: you may skip some steps, you may go back and forth between various steps - it all depends on what you are trying to do and what your role in your current research project is.

    • Discussion topic: In your experience, how does this relate to other research frameworks: e.g. Research Process, Research Life Cycle, Scientific Method, etc. What steps in those process are similar to what we are discussing for the Data Life Cycle.

Data Life Cycle

via

  • Research questions

    • Why are you looking for data? What kind of data do you think will have useful information for your questions?

  • Data Search/ Reuse

    • You are looking for already existing data or data sets that you can reuse for your research.

    • What is the difference between data and a data set?

    • Search Strategies

    • Data Sources @ SMU Libraries

    • Is the data you found in the proper format for analysis?

    • Is is clean and structured?

  • Data Management Plan (DMP)

    • Do you need a long term plan to manage your data?

    • Do you want to share it with others?

    • Does the grant you are working on require you have a plan for sharing and/or preserving your data?

    • Research Data Management Support @ SMU

  • Data Storage (Collection, Description, Recollection)

    • Where are you going to store the data?

    • Do you want to store the description/metadata?

  • Analysis

    • You are deciding what tools (Excel) or Scripts (Python, R) to use to on your data to ask questions.

  • Archive

    • You are done using the data, but you want to save it for the long term.

  • Publications

    • You are now publishing your results, research and/or the data set.

    • Data visualizations are a type of publication.

Additional Resources:

  • Read 8 STEPS IN THE DATA LIFE CYCLE

  • What do you think about this framework? Do you find it helpful?

  • Does this framework change how you think about your data?

  • Take notes on your responses to this and we will discuss this in the session.

  • We will be discussing how to think like a data professional (scientist, journalist, etc.)? We will be applying this framework to a project.

What is Data ?#

What are the differences between data, a dataset, and a database?#

  • “Data are observations or measurements (unprocessed or processed) represented as text, numbers, or multimedia.

  • A dataset is a structured collection of data generally associated with a unique body of work.

  • A database is an organized collection of data stored as multiple datasets. Those datasets are generally stored and accessed electronically from a computer system that allows the data to be easily accessed, manipulated, and updated.”

Forms of Data#

There are many ways to represent data, just as there are many sources of data. After processing our data, we turn it into a number of products. For example:

  • Non-digital text (lab books, field notebooks)

  • Digital texts or digital copies of text

  • Spreadsheets

  • Audio

  • Video

  • Computer Aided Design/CAD

  • Statistical analysis (SPSS, SAS)

  • Databases

  • Geographic Information Systems (GIS) and spatial data

  • Digital copies of images

  • Web files

  • Scientific sample collections

  • Matlab files & 3D Models

  • Metadata & Paradata

  • Data visualizations

  • Computer code

  • Standard operating procedures and protocols

  • Protein or genetic sequences

  • Artistic products

  • Curriculum materials

  • Collection of digital objects acquired and generated during research

Adapted from: Georgia Tech Library Guide

Discussion: Forms of Data#

These are some (most!) of the shapes your research data might transform into. What are some forms of data you use in your work? What about forms of data that you produce as your output? Perhaps there are some forms that are typical of your field? Where do you usually get your data from?

Stages of Data#

We begin without data. Then it is observed, or made, or imagined, or generated. After that, it goes through further transformations:

Raw#

Raw data is yet to be processed, meaning it has yet to be manipulated by a human or computer. Received or collected data could be in any number of formats, locations, etc.. It could be in any of the forms listed above.

But “raw data” is a relative term, inasmuch as when one person finishes processing data and presents it as a finished product, another person may take that product and work on it further, and for them that data is “raw data”.

Discussion: Raw Data#

For example, is “big data” “raw data”? How do we understand data that we have “scraped”?

Processed/Transformed#

Processing data puts it into a state more readily available for analysis, and makes the data legible. For instance it could be rendered as structured data. This can also take many forms, e.g., a table.

Here are a few you’re likely to come across, all representing the same data:

XML#

<Cats> 
    <Cat> 
        <firstName>Smally</firstName> <lastName>McTiny</lastName> 
    </Cat> 
    <Cat> 
        <firstName>Kitty</firstName> <lastName>Kitty</lastName> 
    </Cat> 
    <Cat> 
        <firstName>Foots</firstName> <lastName>Smith</lastName> 
    </Cat> 
    <Cat> 
        <firstName>Tiger</firstName> <lastName>Jaws</lastName> 
    </Cat> 
</Cats> 

JSON#

{"Cats":[ 
    { "firstName":"Smally", "lastName":"McTiny" }, 
    { "firstName":"Kitty", "lastName":"Kitty" }, 
    { "firstName":"Foots", "lastName":"Smith" }, 
    { "firstName":"Tiger", "lastName":"Jaws" } 
]} 

CSV#

First Name,Last Name/n
Smally,McTiny/n
Kitty,Kitty/n
Foots,Smith/n
Tiger,Jaws/n

The importance of using open data formats#

A small detour to discuss (the ethics of?) data formats. For accessibility, future-proofing, and preservation, keep your data in open, sustainable formats. A demonstration:

  1. Open [this file](Download cats.csv ) in a text editor, and then in an app like Excel. This is a CSV, an open, text-only, file format.

  2. Now try to do the same with [this one](Download cats.csv. This is a proprietary format!

Sustainable formats are generally unencrypted, uncompressed, and follow an open standard. A small list:

  • ASCII

  • PDF

  • .csv

  • FLAC

  • TIFF

  • JPEG2000

  • MPEG-4

  • XML

  • RDF

  • .txt

  • .r

Discussion: Processed/Transformed#

How do you decide the formats to store your data when you transition from ‘raw’ to ‘processed/transformed’ data? What are some of your considerations?

Tidy Data#

There are guidelines to the processing of data, sometimes referred to as Tidy Data.1 One manifestation of these rules:

  1. Each variable is in a column.

  2. Each observation is a row.

  3. Each value is a cell.

Look back at our example of cats to see how they may or may not follow those guidelines. Important note: some data formats allow for more than one dimension of data! How might that complicate the concept of Tidy Data?

{"Cats":[
    {"Calico":[
    { "firstName":"Smally", "lastName":"McTiny" },
    { "firstName":"Kitty", "lastName":"Kitty" }],
    "Tortoiseshell":[
    { "firstName":"Foots", "lastName":"Smith" }, 
    { "firstName":"Tiger", "lastName":"Jaws" }]}]}

1Wickham, Hadley. “Tidy Data”. Journal of Statistical Software.

Group Watch:#

  • We will watch this as a group and discuss the stages of the data lifye cycle for this research question:We measured pop music’s falsetto obsession - Vox Earworm

  • Optional extra: “Along the way, we discovered that using [social media platform] data to concretely answer this question is quite a challenge. Our process included creating dozens of custom data sets, careful fact-checking, and conversations with hit songwriters and music industry executives to match data with real experiences.”-We tracked what happens after TikTok songs go viral

Big data#

Big Data 3 Vs

Big Data More Vs

What are the potentials of big data? What are the big problems?#

“We define Big Data as a cultural, technological, and scholarly phenomenon that rests on the interplay of:

  1. Technology: maximizing computation power and algorithmic accuracy to gather, analyze, link, and compare large data sets.

  2. Analysis: drawing on large data sets to identify patterns in order to make economic, social, technical, and legal claims.

  3. Mythology: the widespread belief that large data sets offer a higher form of intelligence and knowledge that can generate insights that were previously impossible, with the aura of truth, objectivity, and accuracy.” (boyd & Crawford, 2012)

“The next time you hear someone talking about algorithms, replace the term with ‘God’ and ask yourself if the meaning changes. Our supposedly algorithmic culture is not a material phenomenon so much as a devotional one….It gives us an excuse not to intervene in the social shifts wrought by big corporations like Google or Facebook or their kindred, to see their outcomes as beyond our influence [and] it makes us forget that particular computational systems are abstractions, caricatures of the world, one perspective among many. The first error turns computers into gods, the second treats their outputs as scripture.” (Bogost, 2015)

“We believe ‘big data’ research can be similarly improved by working with, rather than denying the importance of, ‘small data’ (Kitchin and Lauriault, 2014; Thatcher and Burns, 2013) and other existing approaches to research….Furthermore, doing critical work with ‘big data’ involves understanding not only data’s formal characteristics, but also the social context of the research amidst shifting technologies and broad social processes. Done right, ‘big’ and small data utilized in concert opens new possibilities: topics, methods, concepts, and meanings for what can be understood and done through research.” (Dalton & Thatcher, 2014)

Ten simple rules for responsible big data research#

  1. Acknowledge that data are people and can do harm

  2. Recognize that privacy is more than a binary value

  3. Guard against the reidentification of your data

  4. Practice ethical data sharing

  5. Consider the strengths and limitations of your data; big does not automatically mean better

  6. Debate the tough, ethical choices

  7. Develop a code of conduct for your organization, research community, or industry

  8. Design your data and systems for auditability

  9. Engage with the broader consequences of data and analysis practices

  10. Know when to break these rules

Zook, Matthew et al. “Ten simple rules for responsible big data research.” PLoS computational biology vol. 13,3 e1005399. 30 Mar, 2017. doi:10.1371/journal.pcbi.1005399

Checklists, principles, examples#

  • Data Ethics Decision Aid: DEDA

  • Data Harm Record: DHR

  • Data Science Ethics Checklist & Examples of Data Harms :Deon

  • “Feminist Data Visualization” (D’Ignazio & Klein, 2018): FDV

Handout for data discussion#

About your data project#

Data Lifecycle (SouthernMethodistUniversity/datalifecycle)

Data Lifecycle(SouthernMethodistUniversity/datalifecycle)

Lessons for big data#

Resources at SMU#

Glossaries#

References#