一年后更新:
在家里着手更新简历,整理实习时用到的资料,为构思今后面试会被cue到的storyline和insights绞尽脑汁。空闲时间,随手写一篇文,聊聊实习这些天来,我在想些什么。
满打满算,我在BDA做PTA的总时间不到十天。正常实习都有一个固定的周期,一般是三个月起步,但BDA他家很特别,所有的PTA都根据全职员工招人的项目时间来定,非常灵活。在不到半个月的时间里,我跟进了两个项目,比价和survey。
PTA没有机会进入BDA的办公地点,而是会被安置在国贸大饭店的共享际办公空间,毗邻中金办公室。这里有BDA租下的一些办公空间,专门用来安置PTA和带我们的leader。
leader是当初在PTA pool群里招募PTA的全职员工,他们都很年轻,刚入职不久。比如,survey项目的leader去年才社招加入BDA。年轻的junior和我们一样,每天大部分时间都在这些外包的办公室工作,坐在我们旁边。
咨询公司的PTA无非就是做些dirty work,脏活累活。比价就是不停地ctrl c,ctrl v,把一些数据扒到excel里,然后互相QC(quality control),中国话叫检查。搞完了一个篇目以后打包发送给leader,他再做一些数据分析。据说我们全组PTA两个星期辛苦做出来的整个项目数据,在项目最终的成品报告里,占据不到一页纸。
survey轻松一些,就是打电话。依照预先设计好的调查问卷,用虚拟平台拨打电话做客户访谈。访谈讲究效率,我从一开始愿意听听客户唠嗑,到学会即时掐断客户的闲话,只用了不到一天时间。
btw,这个survey项目迫切需要三线以下城市的样本做调研,但是leader哥哥和我们这些PTA全部来自一二线城市或者海外,没有人connect得到中国最广大的下沉市场。
矛盾点正在于此。BDA的PTA,不会密切跟进项目,而是自始至终都在干同一种活。比如你报名参加了持续两个星期的比价项目,那这两个星期你都在ctrl c、ctrl v;又比如你去做了survey,那么从早到晚你都在打电话。
优点在于,只要你把整个流程都熟悉摸透,只干几天和干几个月都是同样的效果,因为这些都是重复性的机械工作。
然而缺点是致命的。这些重复性的dirty work,已经dirty到了连简历都写不出来的地步。实习几天里,我向同期的北大,对外经贸的同学讨教了这段经历的写法,还得到了一些不错的模板。
但是,一旦面试官问我“tell me your experience at BDA consulting firm”或者“show me your insights in this field”,我立马就呆若木鸡。诚然,我总不能说我熟练掌握复制黏贴快捷键,午餐晚餐时经常和其他实习生朋友一起讨论how to deal with leader的话题吧。
同期的伙伴倒是给我们敲了一个警钟。他说认识一个同学,以前也做过BDA的PTA,后来去面字节的战投部,面试官对他在BDA的PTA经历很感兴趣,一直追着深挖细节。同学经受不住,最终老实交代了做dirty work的真相。
面试官说,就是这些了?你对这个项目有没有一些自己的见解?同学说,就是这些了。面试官随即在面试意见中写,该同学缺乏独立思考能力,在实习中不主动思考。随即被刷。
(btw,一天后得到消息,字节战投部被裁撤了,不用再为面字节战投而考虑了)
我觉得好笑。这个面试官但凡学生时代也做过BDA的PTA,就应该知道这个问题本身十分无聊。我即使带着脑子来这里上班,最终也会沦为一个不加思考的机器。
所有和项目相关的信息PTA都接触不到。我们不知道客户是谁,不知道这个项目的目的是什么,甚至不知道这个项目的全貌如何。以我们PTA团队所做的不到一页纸的功劳,猜测项目的全貌无异于盲人摸象。
不仅我们,连我们的leader可能也看不到项目的全貌。他们每天做的事,不比我们高级多少。我们提交数据结果给他们,他们做一个最终的QC,然后做一些数据分析处理的工作。如果他们QC出错误,就遣返至我们这里要求订正;如果他们发现数据不利于呈现结果,就遣返至我们这里要求美化。
另外一个阻碍我思考的是工作时长。10点前上班打卡,理论晚上8点下班打卡,即算当天加班完毕,得到350元报酬(不加班六点下班,250一天)。但是遇到leader很push时,九点半到十点半下班如同家常便饭。
据我的观察,全职员工下班更晚。带我们的leader,这些junior consultant,每天都在凌晨一两点钟下班(一两点下班,不是一两点睡觉)。
实习期间的某天晚上,我刷短视频刷的上头,一不留神,发现已经凌晨两点半。翻一下微信,发现leader还在项目群里发文件,@相关的同学布置任务。据说项目冲刺阶段,consultant甚至凌晨三四点才下班。
实习过后就知道,这样的工作状态,是我所不能接受的。哪怕BDA的起薪比MBB都高,哪怕应届入职年薪就有五六十万,哪怕工作三到五年后可以独立带项目,哪怕在30岁之前年薪就可以妥妥地上百万。这样的hours,都是我不能够接受的。
我可以接受每天都12点左右睡觉,但我不能接受哪怕有一小段日子需要凌晨2、3点,甚至3、4点睡觉。12点之后睡觉只是亚健康状态,长期1、2点以后睡觉会直接折损阳寿。“努力奋斗”一词我举双手赞成,但是想到要用身体健康,甚至生命作赌注,我就不愿意了。
况且,高强度的工作,不给人喘息,又何来时间让人进行反思,进行工作以外的学习精进呢?
考虑到这些,职业规划感觉越来越清晰了。找工作一定考虑的是性价比,而不能单看绝对薪酬。我敢肯定,在21世纪20年代,人不可能单靠为公司打工就能实现财务自由,即使他在投行咨询这些给pay顶流的行业;同样,人也不可能期望从一家公司的底层一路能干到合伙人级别,因为中高层早已一个萝卜一个坑。
想财务自由,或是想做出更大的事业,创业是唯一的通路;要么就考虑性价比,不要单纯追求谁家的工资高。在国企和事业单位的发展未必不如私企,体制内的发展未必不如体制外,大家只是在不同的赛道而已。
写到这里,我萌发出一点感叹。
咨询真的是一个很神奇的行业。尤其以MBB为代表的management consulting,作为舶来品从美国波士顿席卷世界,也随着中国市场化经济的改革而出现在华夏大地,一直以一种售卖IQ,售卖brain wise的形象示人。所有consultant都西装革履,实地驻扎本土企业,从总裁办对话CEO,到车间技术间对话基层员工,然后兜售各种战略模型,化身business coach,为企业量身定做画一套deck,拿天价的咨询费用;
但另一面,咨询行业进入壁垒又几乎为零。consulting opens up to nearly all major.不管你专业是经管,理工,文史,甚至艺术,只要你case interview secrets背的熟,case mock的好,excel和ppt的快捷键用的溜,那么顶尖咨询公司对你的唯一要求,就剩下名校title了。前文提到的熬夜到两三点钟,无非就是当一个excel boy/girl,或者画一画ppt。很难让人觉得咨询行业能够产生价值。
咨询还有一个特质,它非常吸引留学背景的同学。可能行业的观感赋予了consulting十足的魅力。深度对话业内公司合伙人,show how smart you are,持续跟踪了解不同的商业模式,会带来一种非常fancy的体验。
更何况,从理论上,咨询给员工带来一种未来有无限可能的前景:深耕咨询行业,以后可以跳槽去甲方公司做战略,可以跳去PEVC做一级市场投融资,甚至在积累了无数business case后,你可能会想亲身下场创业,将心中燃烧已久的business plan落地,打破薪酬天花板,毕竟小红书毛文超的案例也在前方激励着各位consultant。
上述职业路径我不便过多评论,但是我一直觉得:对于2000年后出生的一代来说,很多窗口已经越来越小了。
survey组里有位在Arizona读书的姐姐,曾经在四大某家咨询做了两个月的日常实习,目前在罗兰贝格做了六个月的PTA,还在蔚来汽车做战略。
她回忆起在四大和罗兰贝格做PTA的艰苦情景,告诉我们那些地方都学不到什么东西,全是dirty work,和BDA的PTA半斤八两。我承认“学得到/学不到东西”这种问题见仁见智,如同小马过河。但对方的描述,大概让人有了一个底。
当然,她还讲了一些实习中的fun facts,比如在罗兰贝格做PTA,周末加班还不给钱;蔚来汽车很好,空间舒适工作轻松,公司给她分了一间办公室,她在里面给罗兰贝格画ppt。
我认识一个北师大数院的师姐,在去年申请到了毕马威的精英计划,选择了喜欢的咨询条线。她对我说,招人的hr很开心,说招的这一批全都是理工背景,没有一个纯商科的。
可是,最近从师姐朋友圈里看到,在咨询口最后的定岗中,师姐被分去做了运营。我承认运营也有独特的发展,但这和师姐当初想做的咨询业务差了十万八千里,甚至简历都完全不知道该怎么写了。
商科求职越来越偏好理工背景是不争的事实,但一边庆祝理工背景的加入,一边随意分配给大家毫无含金量的非核心岗,这完全在利用信息差坑害辛辛苦苦打牢数理基础的同学。
一言以蔽之,没有壁垒的行业反而是最难的。当人没有了核心竞争力,就只能靠长时间的工作,熬夜,来换取收入。或者出现资源型合伙人一家独大的局面。
不仅咨询,四大,甚至国内以合规业务为主的投行,PEVC(主要指VC),都是没有壁垒的行业,或者说,它们的壁垒是从业经验。它们往往伴随着很差的hours,很多的dirty work。
更广义地说,整个一级市场都更注重平台和资源的壁垒,而二级市场更注重个人能力的壁垒;一级市场更重视法律和会计,二级市场更偏好数理基础。
慢慢来咯,正如survey组里另一位同学一样。她是广州人,在华南理工读本科,寒假到北京玩,抽出一段时间来BDA做onsite PTA。她说,实习就当是旅行的一部分。
好呀,我也把对前路的探索当作是生活的一部分好了。但不管怎么样,hours太差的地方我是不会干的,我想睡觉。
In the world of finance, the term BDA has gained significant attention in recent years. BDA stands for Big Data Analytics, and it refers to the process of collecting, analyzing, and interpreting large sets of financial data to gain insights and make informed decisions.
In finance, data plays a crucial role in decision-making. However, with the increasing volume and complexity of data, traditional methods of analysis have become inadequate. This is where BDA comes into play.
BDA is a technology-driven approach that leverages advanced analytics techniques to extract meaningful patterns, trends, and correlations from massive amounts of data. It involves the use of specialized software and tools to process structured and unstructured data from various sources, such as market data, financial statements, social media, and more.
By applying statistical models, machine learning algorithms, and artificial intelligence, BDA enables financial institutions to uncover valuable insights that can drive strategic decision-making, risk management, fraud detection, and customer relationship management.
BDA has revolutionized the finance industry by providing a more accurate and comprehensive understanding of market trends, customer behavior, and financial risks. Here are some key reasons why BDA is important in finance:
As technology continues to advance, BDA is expected to play an even more significant role in the finance industry. With the proliferation of digital platforms, mobile banking, and online transactions, the amount of data generated is growing exponentially.
This presents both challenges and opportunities for financial institutions. BDA will be crucial in efficiently processing and analyzing this vast amount of data, enabling quick decision-making, risk management, and innovation.
In conclusion, BDA, or Big Data Analytics, is a game-changer in the finance industry. It empowers financial institutions with data-driven decision-making, enhanced risk management, improved customer experiences, fraud detection, and compliance with regulatory requirements. As the demand for data-driven insights grows, BDA will continue to shape the future of finance.
Thank you for reading this article on the meaning of BDA in finance. We hope it has provided you with valuable insights into the importance of Big Data Analytics in the finance industry.
大数据分析(BDA)是券商中一项重要的部门,扮演着至关重要的角色。在当今信息爆炸的时代,数据已经成为决策制定和业务发展的关键。随着金融行业的不断发展,券商也越来越重视数据分析,以帮助他们更好地了解市场趋势、客户需求和风险管理等方面。
BDA是券商什么部门 这个问题关乎整个企业的运作和发展。大数据分析可以帮助券商更好地理解客户行为模式,从而提供个性化的服务和产品。通过分析市场数据和资产波动,券商可以更准确地预测未来市场走势,从而做出更明智的投资决策。此外,大数据分析还能帮助券商发现潜在的风险,并采取相应的措施来规避这些风险,保障企业的稳健发展。
在竞争激烈的金融市场中,券商需要不断创新和提升自身的竞争力。借助大数据分析,券商能够更好地洞察市场动态,优化业务流程,提升客户满意度,实现可持续发展。因此,BDA部门在券商中的地位日益凸显。
大数据分析技术不仅可以帮助券商更好地了解市场和客户,还能够提升企业的运营效率。通过分析海量数据,券商可以发现业务流程中的瓶颈和问题,进而采取针对性的措施进行优化,提高工作效率。此外,大数据分析还可以自动生成报表和分析结果,减轻员工的工作负担,让他们更专注于策略性的工作。
另外,BDA部门还可以利用数据挖掘和机器学习算法,帮助券商发现市场中的商机和潜在风险。通过建立预测模型和风险评估模型,券商可以更加准确地预测市场走势,及时调整投资组合,提升投资收益率。
以某券商为例,他们利用大数据分析技术对客户数据进行深度挖掘,发现了不同客户群体的特点和需求。基于这些数据分析结果,券商对客户进行分类管理,并推出了一系列个性化服务,大大提升了客户满意度和忠诚度。
此外,该券商通过大数据分析,建立了风控模型和资产配置模型,帮助客户规避投资风险,实现资产的稳健增值。这些成功的应用案例充分展示了BDA部门在券商中的重要作用,以及大数据分析技术在金融行业中的巨大潜力。
综上所述,BDA是券商什么部门 已经成为券商发展不可或缺的重要组成部分。大数据分析技术的应用,不仅可以帮助券商更好地了解市场和客户,提升企业的竞争力,还可以帮助券商优化业务流程,提升工作效率。
随着科技的不断进步和信息化的深入发展,大数据分析技术在券商中的应用将会越来越广泛,为券商带来更多发展的机遇和挑战。因此,券商需要不断加强对BDA部门的重视,投入更多资源和人力,推动大数据技术在券商中的深入应用,实现企业的可持续发展。
开机蓝屏的原因有几种 开机启动项问题 处理方案:点击开始----运行----msconfig----看下启动里面,勾选都去掉 ,然后点击确定开机再试下 内存条问题 处理方案:将内存条拔出,用橡皮擦擦金手指的位置,然后插上内存之后开机试试 显卡问题 处理方案:
(1)重装显卡驱动 (2)重新插拔显卡硬件 (3)检修显卡硬件 病毒问题 处理方案:用杀软进入安全模式全盘查杀 系统问题 处理方案:检查开机启动项是否有异常的东西,如果没有的话,重装系统 硬盘问题 处理方案:检查硬件是否有坏道。 重装系统的方法:
1、使用其他电脑下载工具制作pe引导工具(一般下载通用pe5.0)
2、下载系统镜像文件 3、打开下载好的工具,选择一键制作制作u盘pe引导启动工具,将镜像文件拷贝到u盘上 4、使用U盘pe引导启动 5、格式化系统盘 6、使用镜像文件重做系统。
cda和bda数据分析师都好
BDA数据分析师是由教育部主办的“调查分析师”升级而来,传承统计分析和数据挖掘技术的专业性。BDA在国内外市场研究行业广泛认同,证书颁发机构是中国信息协会市场研究业分会,是由经济、科技、社会等领域专家团体组成,经国务院同意和民政部批准成立的全国性社团组织,是市场调查、市场研究、数据分析、数据挖掘、数据洞察领域的专业行业协会组织。
“CDA数据分析师”,是在数字经济大背景和人工智能时代趋势下,面向全行业的专业权威国际资格认证,旨在提升全民数字技能,助力企业数字化转型,推动行业数字化发展。“CDA数据分析师”具体指在互联网、金融、零售、咨询、电信、医疗、旅游等行业专门从事数据的采集、清洗、处理、分析并能制作业务报告、提供决策的新型数据分
In today's constantly evolving financial landscape, it can be challenging to stay abreast of all the different investment options available. One term that has gained traction in recent years is BDA finance. But what exactly is BDA finance, and how does it work? In this article, we will delve into the basics of BDA finance, exploring its structure, operations, and benefits.
BDA finance stands for Business Development Company. It is a type of closed-end investment company that primarily focuses on providing capital and financial support to small and mid-sized businesses. Similar to a mutual fund or a private equity firm, a BDA operates by pooling funds from individual and institutional investors. These funds are then invested in various companies to enable their growth and expansion.
A BDA is structured as a corporation and is regulated under the Investment Company Act of 1940 in the United States. It must meet certain requirements to qualify as a BDA. For example, at least 70% of its total assets must be invested in private or public companies with market values of under $250 million.
BDAs can be internally or externally managed. Internally managed BDAs have their own investment professionals who make decisions regarding which companies to invest in and manage the day-to-day operations. Externally managed BDAs, on the other hand, outsource these responsibilities to an external investment advisor.
Investing in BDA finance offers several benefits for both individual and institutional investors:
BDA finance, or Business Development Companies, offer investors the opportunity to invest in small and mid-sized businesses and benefit from their growth and success. With a focus on diversification, income generation, and professional management, BDAs can be an attractive investment option. However, like all investments, it is essential to conduct thorough research and consider your individual financial goals and risk tolerance before investing in BDA finance.
Thank you for taking the time to read this article. We hope it has provided you with a better understanding of BDA finance and its potential benefits. If you have any further questions or would like more information, please don't hesitate to reach out.
大数据分析金融(Big Data Analytics in Finance,简称BDA Finance)是指在金融行业中应用大数据分析技术来获取、处理和分析海量金融数据,以支持业务决策和风险管理。
随着金融行业信息化程度的提高和数据量的剧增,大数据分析已经成为金融行业的热门话题。传统金融数据处理方法已经无法处理海量的结构化和非结构化数据,而大数据分析技术可以帮助金融机构从庞大的数据中发现隐藏的模式、趋势和关联关系,为业务决策和风险管理提供有力支持。
大数据分析在金融行业的应用非常广泛。以下是几个典型的应用领域:
BDA Finance在金融行业的应用仍处于不断发展和探索阶段。随着大数据技术的不断成熟和金融行业对数据分析的不断需求,BDA Finance在金融行业的应用将会越来越广泛。未来,人工智能、机器学习等技术的发展将进一步推动BDA Finance的应用,为金融行业带来更多的创新和变革。
In the fast-paced world of finance, acronyms abound. One such acronym that has gained traction is BDA. But what exactly does BDA mean in finance? In this article, we will delve into the meaning of BDA, its significance in the finance industry, and how it is used by professionals.
BDA stands for "Big Data Analytics." It refers to the process of collecting, analyzing, and interpreting large volumes of data to derive meaningful insights and support decision-making within the financial sector. As technology continues to advance and data becomes increasingly abundant, BDA has emerged as a vital tool for financial institutions, investment firms, and professionals seeking a competitive edge.
The significance of BDA in finance cannot be overstated. By leveraging sophisticated algorithms and advanced analytics techniques, financial professionals can make data-driven decisions more accurately and efficiently. BDA allows them to identify patterns, trends, and correlations in vast amounts of data that would otherwise be impossible to uncover manually. Moreover, it enables them to predict market movements, identify investment opportunities, manage risks, detect fraud, and enhance operational efficiency.
BDA finds numerous applications in the finance industry. Some key areas where BDA is widely used include:
While BDA holds immense potential, it comes with its own set of challenges. Data privacy, security, and regulatory compliance are critical considerations when dealing with sensitive financial information. Additionally, the complexity of managing and analyzing large data sets requires advanced technological infrastructure and skilled professionals.
Looking ahead, the future of BDA in finance looks bright. As technologies like artificial intelligence (AI) and machine learning continue to evolve, the capabilities of BDA will expand further. The ability to harness the power of ever-growing data sets will enable financial institutions to make more accurate predictions, develop sophisticated risk models, and enhance overall decision-making processes.
In conclusion, BDA, which stands for Big Data Analytics, is a significant concept in the world of finance. It enables financial professionals to gain insights from large volumes of data, make data-driven decisions, and stay ahead in an increasingly competitive industry. Embracing BDA can lead to improved risk management, investment analysis, algorithmic trading, and customer relationship management. Just remember, in this age of data, understanding BDA is essential for anyone operating in the finance realm.
Thank you for reading this article. We hope it has provided you with a comprehensive understanding of what BDA means in finance and its importance in the industry. By harnessing the power of BDA, financial professionals can unlock valuable insights and make informed decisions to navigate the complex world of finance.
丰田标志的三把,应该是丰田锐志(或凯美瑞)中低配版的钥匙(两把带遥控+一把备用不带遥控,丰田大部分中级车的标准配置); 两把本田标志的,是老款雅阁的钥匙。
之前看了Mahout官方示例 20news 的调用实现;于是想根据示例的流程实现其他例子。网上看到了一个关于天气适不适合打羽毛球的例子。
训练数据:
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
检测数据:
sunny,hot,high,weak
结果:
Yes=》 0.007039
No=》 0.027418
于是使用Java代码调用Mahout的工具类实现分类。
基本思想:
1. 构造分类数据。
2. 使用Mahout工具类进行训练,得到训练模型。
3。将要检测数据转换成vector数据。
4. 分类器对vector数据进行分类。
接下来贴下我的代码实现=》
1. 构造分类数据:
在hdfs主要创建一个文件夹路径 /zhoujainfeng/playtennis/input 并将分类文件夹 no 和 yes 的数据传到hdfs上面。
数据文件格式,如D1文件内容: Sunny Hot High Weak
2. 使用Mahout工具类进行训练,得到训练模型。
3。将要检测数据转换成vector数据。
4. 分类器对vector数据进行分类。
这三步,代码我就一次全贴出来;主要是两个类 PlayTennis1 和 BayesCheckData = =》
package myTesting.bayes;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.util.ToolRunner;
import org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob;
import org.apache.mahout.text.SequenceFilesFromDirectory;
import org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles;
public class PlayTennis1 {
private static final String WORK_DIR = "hdfs://192.168.9.72:9000/zhoujianfeng/playtennis";
/*
* 测试代码
*/
public static void main(String[] args) {
//将训练数据转换成 vector数据
makeTrainVector();
//产生训练模型
makeModel(false);
//测试检测数据
BayesCheckData.printResult();
}
public static void makeCheckVector(){
//将测试数据转换成序列化文件
try {
Configuration conf = new Configuration();
conf.addResource(new Path("/usr/local/hadoop/conf/core-site.xml"));
String input = WORK_DIR+Path.SEPARATOR+"testinput";
String output = WORK_DIR+Path.SEPARATOR+"tennis-test-seq";
Path in = new Path(input);
Path out = new Path(output);
FileSystem fs = FileSystem.get(conf);
if(fs.exists(in)){
if(fs.exists(out)){
//boolean参数是,是否递归删除的意思
fs.delete(out, true);
}
SequenceFilesFromDirectory sffd = new SequenceFilesFromDirectory();
String[] params = new String[]{"-i",input,"-o",output,"-ow"};
ToolRunner.run(sffd, params);
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
System.out.println("文件序列化失败!");
System.exit(1);
}
//将序列化文件转换成向量文件
try {
Configuration conf = new Configuration();
conf.addResource(new Path("/usr/local/hadoop/conf/core-site.xml"));
String input = WORK_DIR+Path.SEPARATOR+"tennis-test-seq";
String output = WORK_DIR+Path.SEPARATOR+"tennis-test-vectors";
Path in = new Path(input);
Path out = new Path(output);
FileSystem fs = FileSystem.get(conf);
if(fs.exists(in)){
if(fs.exists(out)){
//boolean参数是,是否递归删除的意思
fs.delete(out, true);
}
SparseVectorsFromSequenceFiles svfsf = new SparseVectorsFromSequenceFiles();
String[] params = new String[]{"-i",input,"-o",output,"-lnorm","-nv","-wt","tfidf"};
ToolRunner.run(svfsf, params);
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
System.out.println("序列化文件转换成向量失败!");
System.out.println(2);
}
}
public static void makeTrainVector(){
//将测试数据转换成序列化文件
try {
Configuration conf = new Configuration();
conf.addResource(new Path("/usr/local/hadoop/conf/core-site.xml"));
String input = WORK_DIR+Path.SEPARATOR+"input";
String output = WORK_DIR+Path.SEPARATOR+"tennis-seq";
Path in = new Path(input);
Path out = new Path(output);
FileSystem fs = FileSystem.get(conf);
if(fs.exists(in)){
if(fs.exists(out)){
//boolean参数是,是否递归删除的意思
fs.delete(out, true);
}
SequenceFilesFromDirectory sffd = new SequenceFilesFromDirectory();
String[] params = new String[]{"-i",input,"-o",output,"-ow"};
ToolRunner.run(sffd, params);
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
System.out.println("文件序列化失败!");
System.exit(1);
}
//将序列化文件转换成向量文件
try {
Configuration conf = new Configuration();
conf.addResource(new Path("/usr/local/hadoop/conf/core-site.xml"));
String input = WORK_DIR+Path.SEPARATOR+"tennis-seq";
String output = WORK_DIR+Path.SEPARATOR+"tennis-vectors";
Path in = new Path(input);
Path out = new Path(output);
FileSystem fs = FileSystem.get(conf);
if(fs.exists(in)){
if(fs.exists(out)){
//boolean参数是,是否递归删除的意思
fs.delete(out, true);
}
SparseVectorsFromSequenceFiles svfsf = new SparseVectorsFromSequenceFiles();
String[] params = new String[]{"-i",input,"-o",output,"-lnorm","-nv","-wt","tfidf"};
ToolRunner.run(svfsf, params);
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
System.out.println("序列化文件转换成向量失败!");
System.out.println(2);
}
}
public static void makeModel(boolean completelyNB){
try {
Configuration conf = new Configuration();
conf.addResource(new Path("/usr/local/hadoop/conf/core-site.xml"));
String input = WORK_DIR+Path.SEPARATOR+"tennis-vectors"+Path.SEPARATOR+"tfidf-vectors";
String model = WORK_DIR+Path.SEPARATOR+"model";
String labelindex = WORK_DIR+Path.SEPARATOR+"labelindex";
Path in = new Path(input);
Path out = new Path(model);
Path label = new Path(labelindex);
FileSystem fs = FileSystem.get(conf);
if(fs.exists(in)){
if(fs.exists(out)){
//boolean参数是,是否递归删除的意思
fs.delete(out, true);
}
if(fs.exists(label)){
//boolean参数是,是否递归删除的意思
fs.delete(label, true);
}
TrainNaiveBayesJob tnbj = new TrainNaiveBayesJob();
String[] params =null;
if(completelyNB){
params = new String[]{"-i",input,"-el","-o",model,"-li",labelindex,"-ow","-c"};
}else{
params = new String[]{"-i",input,"-el","-o",model,"-li",labelindex,"-ow"};
}
ToolRunner.run(tnbj, params);
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
System.out.println("生成训练模型失败!");
System.exit(3);
}
}
}
package myTesting.bayes;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.PathFilter;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.mahout.classifier.naivebayes.BayesUtils;
import org.apache.mahout.classifier.naivebayes.NaiveBayesModel;
import org.apache.mahout.classifier.naivebayes.StandardNaiveBayesClassifier;
import org.apache.mahout.common.Pair;
import org.apache.mahout.common.iterator.sequencefile.PathType;
import org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterable;
import org.apache.mahout.math.RandomAccessSparseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.Vector.Element;
import org.apache.mahout.vectorizer.TFIDF;
import com.google.common.collect.ConcurrentHashMultiset;
import com.google.common.collect.Multiset;
public class BayesCheckData {
private static StandardNaiveBayesClassifier classifier;
private static Map<String, Integer> dictionary;
private static Map<Integer, Long> documentFrequency;
private static Map<Integer, String> labelIndex;
public void init(Configuration conf){
try {
String modelPath = "/zhoujianfeng/playtennis/model";
String dictionaryPath = "/zhoujianfeng/playtennis/tennis-vectors/dictionary.file-0";
String documentFrequencyPath = "/zhoujianfeng/playtennis/tennis-vectors/df-count";
String labelIndexPath = "/zhoujianfeng/playtennis/labelindex";
dictionary = readDictionnary(conf, new Path(dictionaryPath));
documentFrequency = readDocumentFrequency(conf, new Path(documentFrequencyPath));
labelIndex = BayesUtils.readLabelIndex(conf, new Path(labelIndexPath));
NaiveBayesModel model = NaiveBayesModel.materialize(new Path(modelPath), conf);
classifier = new StandardNaiveBayesClassifier(model);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
System.out.println("检测数据构造成vectors初始化时报错。。。。");
System.exit(4);
}
}
/**
* 加载字典文件,Key: TermValue; Value:TermID
* @param conf
* @param dictionnaryDir
* @return
*/
private static Map<String, Integer> readDictionnary(Configuration conf, Path dictionnaryDir) {
Map<String, Integer> dictionnary = new HashMap<String, Integer>();
PathFilter filter = new PathFilter() {
@Override
public boolean accept(Path path) {
String name = path.getName();
return name.startsWith("dictionary.file");
}
};
for (Pair<Text, IntWritable> pair : new SequenceFileDirIterable<Text, IntWritable>(dictionnaryDir, PathType.LIST, filter, conf)) {
dictionnary.put(pair.getFirst().toString(), pair.getSecond().get());
}
return dictionnary;
}
/**
* 加载df-count目录下TermDoc频率文件,Key: TermID; Value:DocFreq
* @param conf
* @param dictionnaryDir
* @return
*/
private static Map<Integer, Long> readDocumentFrequency(Configuration conf, Path documentFrequencyDir) {
Map<Integer, Long> documentFrequency = new HashMap<Integer, Long>();
PathFilter filter = new PathFilter() {
@Override
public boolean accept(Path path) {
return path.getName().startsWith("part-r");
}
};
for (Pair<IntWritable, LongWritable> pair : new SequenceFileDirIterable<IntWritable, LongWritable>(documentFrequencyDir, PathType.LIST, filter, conf)) {
documentFrequency.put(pair.getFirst().get(), pair.getSecond().get());
}
return documentFrequency;
}
public static String getCheckResult(){
Configuration conf = new Configuration();
conf.addResource(new Path("/usr/local/hadoop/conf/core-site.xml"));
String classify = "NaN";
BayesCheckData cdv = new BayesCheckData();
cdv.init(conf);
System.out.println("init done...............");
Vector vector = new RandomAccessSparseVector(10000);
TFIDF tfidf = new TFIDF();
//sunny,hot,high,weak
Multiset<String> words = ConcurrentHashMultiset.create();
words.add("sunny",1);
words.add("hot",1);
words.add("high",1);
words.add("weak",1);
int documentCount = documentFrequency.get(-1).intValue(); // key=-1时表示总文档数
for (Multiset.Entry<String> entry : words.entrySet()) {
String word = entry.getElement();
int count = entry.getCount();
Integer wordId = dictionary.get(word); // 需要从dictionary.file-0文件(tf-vector)下得到wordID,
if (StringUtils.isEmpty(wordId.toString())){
continue;
}
if (documentFrequency.get(wordId) == null){
continue;
}
Long freq = documentFrequency.get(wordId);
double tfIdfValue = tfidf.calculate(count, freq.intValue(), 1, documentCount);
vector.setQuick(wordId, tfIdfValue);
}
// 利用贝叶斯算法开始分类,并提取得分最好的分类label
Vector resultVector = classifier.classifyFull(vector);
double bestScore = -Double.MAX_VALUE;
int bestCategoryId = -1;
for(Element element: resultVector.all()) {
int categoryId = element.index();
double score = element.get();
System.out.println("categoryId:"+categoryId+" score:"+score);
if (score > bestScore) {
bestScore = score;
bestCategoryId = categoryId;
}
}
classify = labelIndex.get(bestCategoryId)+"(categoryId="+bestCategoryId+")";
return classify;
}
public static void printResult(){
System.out.println("检测所属类别是:"+getCheckResult());
}
}
显示全部
收起