用后缀数组查找 meme

Julia Evans 发起的另一个编程挑战:目前啁啾会馆上有许多 meme,其中一些推文会采用相同的句式,现在需要利用后缀数组找出一些这样的推文。

先把问题抽象一下:目前有一堆字符串,其中一些字符串可能有公共子串,现在需要用后缀数组在这些字符串中找出具有公共子串的一堆字符串。

为了进一步描述问题,这里引入两个参数:

  • MIN_COMMOM_SUBSTRING_SIZE:公共子串的最小长度。

  • MIN_COMMOM_NUM:找出的字符串集合的最小数量(这么说有点不准确,后面再解释)。

整个算法的思路是:

  1. 所有字符串拼接成一个很长的字符串。

  2. 构建这个大字符串的后缀数组。

  3. 遍历整个后缀数组:对于当前元素,取其前 MIN_COMMOM_SUBSTRING_SIZE 个字符作为目标公共子串 Target(如果元素的长度不足则跳到下一个元素),遍历当前元素之后的元素,直至遇到不以 Target 为前缀的元素。如果以 Target 为前缀的元素的数量大于或等于 MIN_COMMOM_NUM,则返回这堆字符串,否则继续遍历下一个元素。


再来看看具体的实现。

Suffix 是后缀数组的元素,存储了后缀以及后缀开始的位置。

struct Suffix {
    // suffix string
    std::string str;
    // where does the suffix start in the original string?
    int start_idx;

    Suffix(const std::string& str, int start_idx) : str(str), start_idx(start_idx) {}
    Suffix& operator=(const Suffix& other) = default;
};

using SuffixArray = std::vector<Suffix>;

对于一个字符串,可用一个简单的排序构建后缀数组。

// create suffix array from a string
SuffixArray createSuffixArray(const std::string& str) {
    SuffixArray suffix_arr;
    for (int start_idx = 0; start_idx < str.size(); start_idx++) {
        suffix_arr.emplace_back(str.substr(start_idx), start_idx);
    }
    std::sort(suffix_arr.begin(), suffix_arr.end(), [](const Suffix& lhs, const Suffix& rhs) {
        return lhs.str < rhs.str;
    });
    return suffix_arr;
}

读入的推文用 Tweet 结构存储,它保存了文本以及文本在整个拼接而成的大字符串中的结束位置。

struct Tweet {
    // text in the tweet
    std::string text;
    // end index in the concatenated string
    int end_idx;

    Tweet(const std::string& text, int end_idx) : text(text), end_idx(end_idx) {}
    Tweet& operator=(const Tweet& other) = default;
};

Julia 已经提供了一个 SQLite 的数据库,我用了一个第三方库 SQLiteCpp 来连接数据库,把推文读入 Tweet 数组中,并将推文拼接成一个大字符串。原数据库中有上百万行的数据,这里加入了一个 TWEET_NUM_LIMIT 参数用于限制读入的推文数量。

try {
    SQLite::Database db(DB_PATH);

    SQLite::Statement query(db, "SELECT tweet FROM tweets LIMIT ?");
    query.bind(1, TWEET_NUM_LIMIT);

    std::vector<Tweet> tweets;
    std::string big_str = "";

    while (query.executeStep()) {
        std::string text = query.getColumn(0);
        big_str.append(text);
        int end_idx = big_str.size() - 1;
        tweets.emplace_back(text, end_idx);
    }

    // ......
}
catch (std::exception& e) {
    std::cout << "Exception: " << e.what() << std::endl;
}

下面是算法的关键代码。Lambda 函数会用前面构建的大字符串创建一个后缀数组,然后依据后缀数组找到符合要求的字符串堆(见上文的算法描述),最后返回所有目标后缀在大字符串中的开始位置,这些位置依照大小排序,并存储在栈中。

std::stack<int> start_idxs = [=]() -> std::stack<int> {
    // create a suffix array from the big string
    SuffixArray suffix_arr = createSuffixArray(big_str);
    for (SuffixArray::iterator curr_it = suffix_arr.begin(); curr_it != suffix_arr.end(); curr_it++) {
        const std::string& curr_suffix_str = curr_it->str;
        if (curr_suffix_str.size() < MIN_COMMOM_SUBSTRING_SIZE) {
            continue;
        }
        std::string target = curr_suffix_str.substr(0, MIN_COMMOM_SUBSTRING_SIZE);
        // find the first iter where the target is not a prefix
        SuffixArray::iterator target_end_it = std::find_if_not(curr_it + 1, suffix_arr.end(), [target](const Suffix& suffix) {
            return suffix.str.size() >= target.size() && isPrefix(target, suffix.str);
        });
        // check if the number of tweets we found is enough
        int tweet_found_num = target_end_it - curr_it;
        if (tweet_found_num < MIN_COMMOM_NUM) {
            curr_it = target_end_it - 1;
            continue;
        }
        else {
            std::cout << "[Common substring: " << target << "]" << std::endl;
            std::vector<int> start_idxs;
            for (int i = 0; i < tweet_found_num; i++) {
                start_idxs.emplace_back((curr_it + i)->start_idx);
            }
            std::sort(start_idxs.begin(), start_idxs.end(), std::greater<int>());
            return std::stack<int>(std::deque<int>(start_idxs.begin(), start_idxs.end()));
        }
    }
    return std::stack<int>();
}();

如果上一步返回的栈为空,说明找不到目标;如果栈非空,则在推文中查找各个后缀的开始位置所属的推文。需要注意多个后缀的开始位置属于同一条推文的情况(公共子串在同一条推文中出现多次),这时需要忽略重复的推文,导致最终结果中的推文数量可能比 MIN_COMMOM_NUM 要小。

if (start_idxs.empty()) {
    std::cout << "No results found" << std::endl;
}
else {
    // print tweets
    int curr_idx = 0;
    while (curr_idx < tweets.size()) {
        int target_idx = searchTweet(tweets, curr_idx, tweets.size() - 1, start_idxs.top());
        std::cout << std::endl
                  << std::string(50, '=') << "[ Tweet ]" << std::string(50, '=') << std::endl
                  << tweets[target_idx].text << std::endl
                  << std::string(109, '=') << std::endl;
        start_idxs.pop();
        // remove repeated tweets
        while (!start_idxs.empty() && tweets[target_idx].end_idx >= start_idxs.top()) {
            start_idxs.pop();
        }
        if (start_idxs.empty()) {
            break;
        }
        curr_idx = target_idx + 1;
    }
}

查找推文的过程采用了二分法。

int searchTweet(const std::vector<Tweet>& tweets, int left, int right, int start_idx) {
    while (left < right) {
        int mid = left + (right - left) / 2;
        if (tweets[mid].end_idx >= start_idx) {
            right = mid;
        }
        else {
            left = mid + 1;
        }
    }
    return right;
}

完整项目可见 GitHub


MIN_COMMOM_SUBSTRING_SIZE 设置为 40,MIN_COMMOM_NUM 设置为 5,得到如下结果:

[Common substring:  a disclaimer and would like to mention ]

==================================================[ Tweet ]==================================================
⚠ ATTENTION

due to twitter's new policy, i'm posting a disclaimer and would like to mention that i am not any of t… https://t.co/tzJRIIfjm1
=============================================================================================================

==================================================[ Tweet ]==================================================
ATTENTION

due to twitter's new policy, i'm posting a disclaimer and would like to mention that i am not any of the… https://t.co/e8wre8ftPy
=============================================================================================================

==================================================[ Tweet ]==================================================
due to twitter's new policy, i'm posting a disclaimer and would like to mention that i am not any of the celebritie… https://t.co/RBFygQNOMY
=============================================================================================================

==================================================[ Tweet ]==================================================
ATTENTION

due to Twitter’s new policy, I’m posting a disclaimer and would like to mention that I am not any of th… https://t.co/QcepQojThz
=============================================================================================================

==================================================[ Tweet ]==================================================
ATTENTION

due to Twitter’s new policy, I’m posting a disclaimer and would like to mention that I am not any of th… https://t.co/APBBMnGxeu
=============================================================================================================

==================================================[ Tweet ]==================================================
due to twitter's new policy, i'm posting a disclaimer and would like to mention that i am not any of the celebritie… https://t.co/6YkEhOwOjf
=============================================================================================================

==================================================[ Tweet ]==================================================
due to twitter's new policy, i'm posting a disclaimer and would like to mention that i am not any of the celebritie… https://t.co/rqDlvdmdDz
=============================================================================================================

Updated: