yanyiwu / gojieba

"结巴"中文分词的Golang版本
MIT License
2.43k stars 303 forks source link

词库2.3G导致64G机器不够用 #55

Closed yaokun123 closed 2 weeks ago

yaokun123 commented 5 years ago
package main

import (
    "encoding/json"
    "flag"
    "fmt"
    "github.com/yanyiwu/gojieba"
    "io"
    "net/http"
    "runtime"
    "strings"
    "time"
)

var (
    host = flag.String("host","127.0.0.1","HTTP服务器主机名")
    port = flag.Int("port",8888,"HTTP服务器端口")
    x = gojieba.NewJieba("/tmp/test.dict.utf8")
)

/**
启动命令如下(其中host(127.0.0.1)、port(8888)可不传,均有默认参数)
go run server.go -host 0.0.0.0 -port 3306
 */
func main()  {
    flag.Parse()

    //将线程数设置为CPU数
    runtime.GOMAXPROCS(runtime.NumCPU())

    http.HandleFunc("/segmentation",Handler)
    fmt.Println(fmt.Sprintf("%s:%d",*host,*port))
    http.ListenAndServe(fmt.Sprintf("%s:%d",*host,*port),nil)
}

func Handler(w http.ResponseWriter, req *http.Request)  {
    start_time := time.Now().UnixNano() / 1000000
    // 得到要分词的文本
    text := req.URL.Query().Get("company_name")
    if text == ""{
        text = req.PostFormValue("company_name")
    }

    words := x.Tag(text)
    split_word := []string{}
    list := make([]string,0)
    for _,word:= range words{
        split_word = strings.Split(word,"/")
        if split_word[1] == "n" {
            list = append(list, split_word[0])
        }
    }
    end_time := time.Now().UnixNano() / 1000000
    fmt.Println("处理时间:",(end_time-start_time),"ms")
    response,_ := json.Marshal(list)

    w.Header().Set("Content-Type", "application/json")
    io.WriteString(w, string(response))
}

注:/tmp/test.dict.utf8单文件大约五千万数据, 词库格式如下:

常州市伟芳机械有限公司 2 n
兰州金乐塑胶有限公司 2 n
河南兆龙电气设备有限公司 2 n
青岛德润鑫文化传媒有限公司 2 n
重庆禾加合科技发展有限公司 2 n
潍坊崔旺建材销售有限公司 2 n
甘肃龙发装饰工程有限公司 2 n
任丘市大卫电动车有限公司 2 n
建湖县众友服饰有限公司 2 n
曹县小金豆电子商务有限公司 2 n
yaokun123 commented 5 years ago

@yanyiwu

yanyiwu commented 5 years ago

哦,用法应该没啥问题,看来你这个词库数量级可能确实64G不够。。。

yaokun123 commented 5 years ago

@yanyiwu 好的,感谢🙏

yaokun123 commented 5 years ago

你好,有没有什么解决方案?比如能否牺牲一些性能来满足内存

------------------ 原始邮件 ------------------ 发件人: "Yanyi Wu"notifications@github.com; 发送时间: 2019年9月23日(星期一) 晚上11:42 收件人: "yanyiwu/gojieba"gojieba@noreply.github.com; 抄送: "嫁莪! 佷緈鍢"1182728515@qq.com;"Author"author@noreply.github.com; 主题: Re: [yanyiwu/gojieba] 词库2.3G导致64G机器不够用 (#55)

哦,用法应该没啥问题,看来你这个词库数量级可能确实64G不够。。。

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

yanyiwu commented 5 years ago

建议是清理一下词库,看上去是词库建设不太合理。

yaokun123 commented 5 years ago

词库的词是公司名称,后面的词频和词意都是固定的2 n 这种词库优化方向是啥?

------------------ 原始邮件 ------------------ 发件人: "Yanyi Wu"notifications@github.com; 发送时间: 2019年9月24日(星期二) 晚上11:56 收件人: "yanyiwu/gojieba"gojieba@noreply.github.com; 抄送: "嫁莪! 佷緈鍢"1182728515@qq.com;"Author"author@noreply.github.com; 主题: Re: [yanyiwu/gojieba] 词库2.3G导致64G机器不够用 (#55)

建议是清理一下词库,看上去是词库建设不太合理。

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

mmcer commented 3 years ago

第一步,将你的词库 5千万的数量级,分割为好几次处理,例如分为 500 个文件,那么每次就需要处理 10 万行。 第二步,将处理后的结果去重。

或者直接采用流的方式打开文档,每次读取一行然后分词处理。

github-actions[bot] commented 1 month ago

This issue has not been updated for over 1 year and will be marked as stale. If the issue still exists, please comment or update the issue, otherwise it will be closed after 7 days.

github-actions[bot] commented 2 weeks ago

This issue has been automatically closed due to inactivity. If the issue still exists, please reopen it.