Preprocessing of Learning Behavior Analysis¶

To show example, we implement preprocessing of learning behavior analysis (Yin, 2019). This research investigated the relationship between reading behavior and learning outcome based on four types of reading behavior: RP, PT, RT, and BRR. RP means the total number of pages that a student read (including duplicates pages). PT means the number of times a student previews the lecture. RT means the total time spent reading the learning materials, and it is calculated on an hourly basis. BRR means the backtrack reading rate which is calculated by dividing the number of turning the previous page by the number of turning the next page. For the preprocessing, the following 3 rules were defined:

Invalid reading time:

If a student spends less than five seconds on one page, then the student did not read the page.

Invalid record:

If the time difference between two actions is longer than 20 minutes, then the record is invalid. It means that the student did not read the contents, as he/she did not conduct any action within 20 minutes.

Invalidity preview:

If a student did not preview the lesson (read the learning content before class) up to three minutes before the class, then he/she is considered to have not previewed the lesson.

Preprocessing with OpenLA

import OpenLA as la
import pandas as pd

course_info, event_stream = la.start_analysis(files_dir="dataset_sample", course_id="A")

page_transition = la.convert_into_page_transition(event_stream, invalid_seconds=4, timeout_seconds=20*60, count_operation=False)

stream_perview = la.select_by_lecture_time(course_info, event_stream, timing="before")
transition_preview = la.convert_into_page_transition(stream_perview, invalid_seconds=4, timeout_seconds=20*60, count_operation=False)

users_feature = pd.DataFrame(columns=["RP", "PT", "RT", "BRR"])
for user_id in course_info.user_ids():
    PT = 0
    for lecture in course_info.lecture_weeks():
        contents_id = course_info.lecture_week_to_contents_id(lecture)
        preview_seconds = transition_preview.reading_seconds(user_id, contents_id)
        if preview_seconds > 3*60:
            PT += 1

    NN = event_stream.operation_count("NEXT", user_id=user_id)
    NP = event_stream.operation_count("PREV", user_id=user_id)
    BRR = NP / NN
    RP = page_transition.num_transition(user_id=user_id)
    RT = page_transition.reading_seconds(user_id=user_id) / (60*60) # hourly basis
    users_feature.loc[user_id] = pd.Series({"RP":RP, "PT":PT, "RT":RT, "BRR":BRR})
users_feature.to_csv("example.csv")

Preprocessing without OpenLA

import pandas as pd
event_stream = pd.read_csv("dataset_sample/Course_A_EventStream.csv")
lecture_schedule = pd.read_csv("dataset_sample/Course_A_LectureTime.csv")
contents_information = pd.read_csv("dataset_sample/Course_A_LectureMaterial.csv")

event_stream["eventtime"] = pd.to_datetime(event_stream["eventtime"])
lecture_schedule['lecture'] = lecture_schedule['lecture'].apply(int)
lecture_schedule = lecture_schedule.set_index('lecture')
lecture_schedule["starttime"] = pd.to_datetime(lecture_schedule["starttime"])

contents_information['lecture'] = contents_information['lecture'].apply(int)
contents_information = contents_information.set_index('lecture')

users_feature = pd.DataFrame(columns=["RP", "PT", "RT", "BRR"])
for user_id in event_stream["userid"].unique():
    user_stream = event_stream[event_stream["userid"] == user_id]
    PT = 0
    RP = 0
    RT = 0
    for lecture in lecture_schedule.index:
        contents_id = contents_information.at[lecture, "contentsid"]
        lecture_stream = user_stream[user_stream["contentsid"] == contents_id]
        lecture_start_time = lecture_schedule.at[lecture, "starttime"]
        event_time_list = lecture_stream["eventtime"].tolist()
        page_list = lecture_stream["pageno"].tolist()
        operation_list = lecture_stream["operationname"].tolist()
        page_enter_idx = 0
        preview_seconds = 0
        for i in range(len(lecture_stream)):

            if (i == len(lecture_stream) - 1) or\
               (i+1 < len(lecture_stream)) and (page_list[i] != page_list[i+1]) or\
               (operation_list[i] == "CLOSE"):

                page_duration = (event_time_list[i] - event_time_list[page_enter_idx]).seconds
                if page_duration > 4 and page_duration <= 20*60:
                    RP += 1
                    RT += page_duration
                    if event_time_list[i] < lecture_start_time:
                        preview_seconds += page_duration

                page_enter_idx = i if operation_list[i] != "CLOSE" else i+1

            if (i+1 < len(lecture_stream)) and ((event_time_list[i + 1] - event_time_list[i]).seconds > 20 * 60):
                page_enter_idx = i + 1
                continue

        if preview_seconds > 3*60:
            PT += 1

    operation_count = user_stream["operationname"].value_counts()
    NN = operation_count["NEXT"]
    NP = operation_count["PREV"]
    BRR = NP / NN
    RT = RT / (60*60) # hourly basis
    users_feature.loc[user_id] = pd.Series({"RP":RP, "PT":PT, "RT":RT, "BRR":BRR})
users_feature.to_csv("example.csv")