过程:
1)先把一篇文章,按逗号分隔成一个一个短语
2)然后计算每个短语的字数
3)前两个>10个字的短语,我们拿出来在百度搜索下,计算百度搜索结果中,完整出现该短语的次数。
若一个文章被其他网站大量转载,那么随便提取该文章中一个短语,都能在百度搜索出完全重复的内容:
如果我们连续搜索两个短语,在百度搜索中,完全重复的结果很少,则可以一定程度代表该内容被其他站点大量转载的概率比较小,原创度较高
原创
以上3个步骤,编写一个脚本来执行:
左列是文章ID,右列是两个短语,在百度搜索结果中完整出现的次数。次数越大,重复度越高,具体数值达到多少,自己定义。比如本渣一般将>=30%定位重复度比较高的,即搜索2个短语,20个搜索结果中,完整出现该短语的结果有>=6个
- #coding:utf-8
- import requests,re,time,sys,json,datetime
- import multiprocessing
- import MySQLdb as mdb
- reload(sys)
- sys.setdefaultencoding('utf-8')
- current_date = time.strftime('%Y-%m-%d',time.localtime(time.time()))
- def search(req,html):
- text = re.search(req,html)
- if text:
- data = text.group(1)
- else:
- data = 'no'
- return data
- def date(timeStamp):
- timeArray = time.localtime(timeStamp)
- otherStyleTime = time.strftime("%Y-%m-%d %H:%M:%S", timeArray)
- return otherStyleTime
- def getHTml(url):
- host = search('^([^/]*?)/',re.sub(r'(https|http)://','',url))
- headers = {
- "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
- "Accept-Encoding":"gzip, deflate, sdch",
- "Accept-Language":"zh-CN,zh;q=0.8,en;q=0.6",
- "Cache-Control":"no-cache",
- "Connection":"keep-alive",
- #"Cookie":"",
- "Host":host,
- "Pragma":"no-cache",
- "Upgrade-Insecure-Requests":"1",
- "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36",
- }
- # 代理服务器
- proxyHost = "proxy.abuyun.com"
- proxyPort = "9010"
- # 代理隧道验证信息
- proxyUser = "XXXX"
- proxyPass = "XXXX"
- proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
- "host" : proxyHost,
- "port" : proxyPort,
- "user" : proxyUser,
- "pass" : proxyPass,
- }
- proxies = {
- "http" : proxyMeta,
- "https" : proxyMeta,
- }
- html = requests.get(url,headers=headers,timeout=30)
- # html = requests.get(url,headers=headers,timeout=30,proxies=proxies)
- code = html.encoding
- return html.content
- def getContent(word):
- pcurl = 'http://www.baidu.com/s?q=&tn=json&ct=2097152&si=&ie=utf-8&cl=3&wd=%s&rn=10' % word
- # print '@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ start crawl %s @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@' % pcurl
- html = getHTml(pcurl)
- a = 0
- html_dict = json.loads(html)
- for tag in html_dict['feed']['entry']:
- if tag.has_key('title'):
- title = tag['title']
- url = tag['url']
- rank = tag['pn']
- time = date(tag['time'])
- abs = tag['abs']
- if word in abs:
- a += 1
- return a
- con = mdb.connect('127.0.0.1','root','','wddis',charset='utf8',unix_socket='/tmp/mysql.sock')
- cur = con.cursor()
- with con:
- cur.execute("select aid,content from pre_portal_article_content limit 10")
- numrows = int(cur.rowcount)
- for i in range(numrows):
- row = cur.fetchone()
- aid = row[0]
- content = row[1]
- content_format = re.sub('<[^>]*?>','',content)
- a = 0
- for z in [ x for x in content_format.split(',') if len(x)>10 ][:2]:
- a += getContent(z)
- print "%s --> %s" % (aid,a)
- # words = open(wordfile).readlines()
- # pool = multiprocessing.Pool(processes=10)
- # for word in words:
- # word = word.strip()
- # pool.apply_async(getContent, (word,client ))
- # pool.close()
- # pool.join()
转载请注明:思享SEO博客 » 如何检测SEO文章原创度