Research by surgical residents and surgeons is essential to advancing the field. However, there are many obstacles to executing a good research project. As a health services analyst often working with surgical residents interested in research, I work with study data through all stages of a project and am familiar with data collection traps that are common among researchers. As you follow best practices to design and implement your study1, the following are a few pieces of advice should you need to collect your own data.
Consult with an expert
This is the most important piece of advice I can give: consult with an expert in data analysis (statistician or health services researcher) before you collect any data. They can help you decide what you need to collect and how you will collect it. Before you spend any time collecting data, make sure it will be suitable to be read into their software package and usable for analysis.
Decide what to collect
Carefully define your research question, and ensure that what you are collecting will be able to answer your question. Avoid the temptation of collecting every variable available to you—only collect the fields you need to answer your question. The more you collect, the more time both data entry and data cleaning will take.
Select data collection software
Decide which software you will use to collect your data. Consult with an expert in data analysis before making this decision. Common tools at my institution include REDcap, JMP, and Microsoft Excel, but you may have others available to you at your institution. Be aware that tools like Excel may be flexible and allow for easy data entry, but this flexibility can complicate the data preparation and analysis after the data are collected, so the onus is on you to collect clean and usable data.
Collect clean data
Here are some tips for collecting clean data and avoiding some common mistakes. All of these tips apply specifically to Excel, but many are applicable regardless of the software package you choose to use. Be sure to consult with the person who will be analyzing your data before you begin collecting to understand if they have requirements in addition to my suggestions.
- Carefully define each column as numeric, character, or date. Be rigidly consistent. A column should be entirely numeric, entirely character, or entirely dates. (Example A)
- Define the categories you will use in each column before starting data entry for that column.
When possible, use numeric codes instead of character strings. For example, instead of “Yes” and “No”, enter as 1 and 0. If you have so many categories that you have trouble keeping them straight with numeric codes, consider whether the variable will be useful in analysis with so many categories and reevaluate your categories as needed. Provide a key for the codes you use. (Example B)
If you choose to enter a column as character values, make sure you use the exact same spelling and capitalization. For example, avoid entering female as “female” in some cells and “Female” in other cells. Consider using the software’s data validation tools to limit what can be entered in each column.
- Enter all dates with the same format (mm/dd/yyyy). Enter the year with 4 digits. If you use only 2 digits for year, Excel assumes that years <30 are in the 21st century (“29” becomes “2029”), while years ?30 are in the 20th century (“30” becomes “1930”). Alternatively, create three separate columns for month, day, and year. (Example C)
- If just part of a date is unknown (for example, the day or month), do not enter “99” or some other arbitrary value for the unknown value—even though it may look nice to the human eye, Excel will not recognize this as a date value. Discuss how to handle this scenario with a data expert. (Example D)
- If someone else will be analyzing your data, they may want you to record the raw data required for a calculation rather than the calculated value. For example, collect date of birth and date of surgery instead of collecting age at surgery. (Example E)
- If a patient can be in more than one category of a variable, create separate indicator variables instead of one multi-level categorical variable. Alternatively, create multiple variables with the same multi-level codes. (Example F)
- If you are collecting an outcome for which you have varying lengths of follow-up (eg, mortality within 5 years after surgery), collect (1) whether the patient had the outcome, and (2) the date of the outcome or the last date at which it was known that they did not have the outcome. If you have the same length of follow-up on all patients, you can simply record whether the outcome occurred or not (eg, collect death within 30 days following surgery as yes/no if you have 30 day follow-up for all patients). Consult with an expert before collecting outcomes data. (Example G)
- Ensure that each row has a unique identifier (or set of variables that create a unique identifier). For example, if your data set has one row per patient, the unique identifier might be the patient ID. If your data set has more than one row per patient, the unique identifier might be the patient ID and date. Make sure the unique identifier is entered in every row. (Example H)
Collect data for a few cases and ask your expert to review what you have collected. Consider having them do a preliminary analysis to ensure that you are collecting what you need and in a format that is usable for analysis.
Through careful attention to data collection, you will be well on your way to a sound research study that contributes to moving your field forward.
References
- Haider AH, Bilimoria KY, Kibbe MR. A Checklist to Elevate the Science of Surgical Database. JAMA Surg. 2018 Apr 4 [Epub ahead of print]. doi:10.1001/jamasurg.2018.0628.